Comparing Riak MapReduce and Hadoop MapReduce

Jeremiah Peschka jeremiah.peschka at
Sun Jul 21 23:14:10 EDT 2013

Responses inline. Hopefully they shed some light on the subject.

Jeremiah Peschka - Founder, Brent Ozar Unlimited
MCITP: SQL Server 2008, MVP
Cloudera Certified Developer for Apache Hadoop

On Fri, Jul 19, 2013 at 5:07 PM, Xiaoming Gao <mkobie at> wrote:

> Hi everyone,
> I am trying to learn about Riak MapReduce and comparing it with Hadoop
> MapReduce, and there are some details that I am interested in but not
> covered in the online documents. So hopefully we can get some help here
> about the following questions? Thanks in advance!

They're not at all similar. Hadoop MR is optimized for sequential data
processing in large batches. Riak MR works better when you think of it like
a multi-processing engine - you can perform work across a matching set of
items and that work will be distributed across the cluster during map

Take a look at this thread for a bit of discussion about when you should
use Riak MapReduce:

Or, if you want to, you can run a Riak MR job across an entire bucket,
which really is like scanning every table in an RDBMS while looking for
rows from a single table. MR jobs run with an R of 1. So, at least there's

> 1. For a given MapReduce request (or to say, job), how does Riak decide how
> many mappers to use for the job? For example, if I have 8 nodes and my data
> are distributed across all nodes with an "N" value of 2, will I have 4
> mappers running on 4 nodes concurrently? Is it possible to have multiple
> mappers (e.g., 4 or even 6) for the same MR job running on each node (for
> better processing speed)?

To the best of my recollection, this will be based on either:

1) If you're using JavaScript MR jobs, the number of mappers and reducers
is controlled by the the map_js_vm_count and reduce_js_vm_count settings
from each node's app.config file.
2) If you're using Erlang: magic. This will be handled by the Erlang VM and
is based on number of processors and your overall Erlang VM configuration.

> 2. If I run a MapReduce job over the results of a Riak Search query, how
> does Riak schedule the mappers based on the search results?

Riak Search uses document-based indices - search will query every node in
the cluster. Map phases happen and then results are then streamed to the

> 3. How does Riak handle intermediate data generated by mappers?
> Specifically:
> (1) In Hadoop MapReduce, the output of mappers are <key, value> pairs, and
> the output from all mappers are first grouped based on keys, and then
> handed
> over to the reducer. Does Riak do similar grouping of intermediate data?

The only reason for the intermediate grouping/scratch work in Hadoop MR
jobs is to deal with multiple reducers. Although, I'm not entirely sure how
this works in Riak, my suspicion is that data is streamed across the wire
after the data is read from disk.

> (2) How are mapper outputs transmitted to the reducer? Does Riak use local
> disks on the mapper nodes or reducer nodes to store the intermediate data
> temporarily?

Since large MR jobs can cause out of memory errors, you can bet good money
that the answer is "no".

> 4. According to the document
> each MR job only schedules one reducer, which runs on the coordinate node.
> Is there any way to configure a MR job to use multiple reducers?

Using Riak MR, there's no way to create a job that runs reducers on
multiple nodes. You can have multiple reducer processes on a single node,
but not reducers on multiple nodes.

> Best regards,
> Xiaoming
> --
> View this message in context:
> Sent from the Riak Users mailing list archive at
> _______________________________________________
> riak-users mailing list
> riak-users at
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

More information about the riak-users mailing list