Comparing Riak MapReduce and Hadoop MapReduce

Xiaoming Gao mkobie at gmail.com
Fri Jul 19 20:07:58 EDT 2013


Hi everyone,

I am trying to learn about Riak MapReduce and comparing it with Hadoop
MapReduce, and there are some details that I am interested in but not
covered in the online documents. So hopefully we can get some help here
about the following questions? Thanks in advance!

1. For a given MapReduce request (or to say, job), how does Riak decide how
many mappers to use for the job? For example, if I have 8 nodes and my data
are distributed across all nodes with an "N" value of 2, will I have 4
mappers running on 4 nodes concurrently? Is it possible to have multiple
mappers (e.g., 4 or even 6) for the same MR job running on each node (for
better processing speed)?

2. If I run a MapReduce job over the results of a Riak Search query, how
does Riak schedule the mappers based on the search results?

3. How does Riak handle intermediate data generated by mappers?
Specifically:
(1) In Hadoop MapReduce, the output of mappers are <key, value> pairs, and
the output from all mappers are first grouped based on keys, and then handed
over to the reducer. Does Riak do similar grouping of intermediate data? 

(2) How are mapper outputs transmitted to the reducer? Does Riak use local
disks on the mapper nodes or reducer nodes to store the intermediate data
temporarily?

4. According to the document
http://docs.basho.com/riak/latest/dev/advanced/mapreduce/#How-Phases-Work ,
each MR job only schedules one reducer, which runs on the coordinate node.
Is there any way to configure a MR job to use multiple reducers?

Best regards,
Xiaoming



--
View this message in context: http://riak-users.197444.n3.nabble.com/Comparing-Riak-MapReduce-and-Hadoop-MapReduce-tp4028454.html
Sent from the Riak Users mailing list archive at Nabble.com.




More information about the riak-users mailing list