Comparing Riak MapReduce and Hadoop MapReduce

Xiaoming Gao mkobie at gmail.com
Mon Jul 22 01:08:58 EDT 2013


Thanks a lot, Jeremiah! Your answers really help clarify the issues.

Just one more question, by "document-based indices", do you mean
document-based partitioning for the indices? Because what I found in the
online document
http://docs.basho.com/riak/latest/dev/advanced/search/#Search-KV-and-MapReduceis
"Search
uses term-based partitioning – also known as a global index." I am not sure
if the implementation has changed for the latest version of Riak, but if
term-based partitioning is used, does that mean Riak will only schedule the
mappers after the whole list of <bucket, key> pair is returned from the
index?

Thanks,
Xiaoming


On Sun, Jul 21, 2013 at 11:20 PM, Jeremiah Peschka [via Riak Users] <
ml-node+s197444n4028474h7 at n3.nabble.com> wrote:

> Responses inline. Hopefully they shed some light on the subject.
>
> ---
> Jeremiah Peschka - Founder, Brent Ozar Unlimited
> MCITP: SQL Server 2008, MVP
> Cloudera Certified Developer for Apache Hadoop
>
>
> On Fri, Jul 19, 2013 at 5:07 PM, Xiaoming Gao <[hidden email]<http://user/SendEmail.jtp?type=node&node=4028474&i=0>
> > wrote:
>
>> Hi everyone,
>>
>> I am trying to learn about Riak MapReduce and comparing it with Hadoop
>> MapReduce, and there are some details that I am interested in but not
>> covered in the online documents. So hopefully we can get some help here
>> about the following questions? Thanks in advance!
>>
>
> They're not at all similar. Hadoop MR is optimized for sequential data
> processing in large batches. Riak MR works better when you think of it like
> a multi-processing engine - you can perform work across a matching set of
> items and that work will be distributed across the cluster during map
> phases.
>
> Take a look at this thread for a bit of discussion about when you should
> use Riak MapReduce: http://markmail.org/message/qpoilvmm635inb5v
>
> Or, if you want to, you can run a Riak MR job across an entire bucket,
> which really is like scanning every table in an RDBMS while looking for
> rows from a single table. MR jobs run with an R of 1. So, at least there's
> that.
>
>
>> 1. For a given MapReduce request (or to say, job), how does Riak decide
>> how
>> many mappers to use for the job? For example, if I have 8 nodes and my
>> data
>> are distributed across all nodes with an "N" value of 2, will I have 4
>> mappers running on 4 nodes concurrently? Is it possible to have multiple
>> mappers (e.g., 4 or even 6) for the same MR job running on each node (for
>> better processing speed)?
>>
>
> To the best of my recollection, this will be based on either:
>
> 1) If you're using JavaScript MR jobs, the number of mappers and reducers
> is controlled by the the map_js_vm_count and reduce_js_vm_count settings
> from each node's app.config file.
> 2) If you're using Erlang: magic. This will be handled by the Erlang VM
> and is based on number of processors and your overall Erlang VM
> configuration.
>
>
>>
>> 2. If I run a MapReduce job over the results of a Riak Search query, how
>> does Riak schedule the mappers based on the search results?
>>
>
> Riak Search uses document-based indices - search will query every node in
> the cluster. Map phases happen and then results are then streamed to the
> reducer.
>
>
>>
>> 3. How does Riak handle intermediate data generated by mappers?
>> Specifically:
>> (1) In Hadoop MapReduce, the output of mappers are <key, value> pairs, and
>> the output from all mappers are first grouped based on keys, and then
>> handed
>> over to the reducer. Does Riak do similar grouping of intermediate data?
>>
>
> The only reason for the intermediate grouping/scratch work in Hadoop MR
> jobs is to deal with multiple reducers. Although, I'm not entirely sure how
> this works in Riak, my suspicion is that data is streamed across the wire
> after the data is read from disk.
>
>
>>
>> (2) How are mapper outputs transmitted to the reducer? Does Riak use local
>> disks on the mapper nodes or reducer nodes to store the intermediate data
>> temporarily?
>
>
> Since large MR jobs can cause out of memory errors, you can bet good money
> that the answer is "no".
>
>
>>
>> 4. According to the document
>> http://docs.basho.com/riak/latest/dev/advanced/mapreduce/#How-Phases-Work,
>> each MR job only schedules one reducer, which runs on the coordinate node.
>> Is there any way to configure a MR job to use multiple reducers?
>>
>
> Using Riak MR, there's no way to create a job that runs reducers on
> multiple nodes. You can have multiple reducer processes on a single node,
> but not reducers on multiple nodes.
>
>
>>
>> Best regards,
>> Xiaoming
>>
>>
>>
>> --
>> View this message in context:
>> http://riak-users.197444.n3.nabble.com/Comparing-Riak-MapReduce-and-Hadoop-MapReduce-tp4028454.html
>> Sent from the Riak Users mailing list archive at Nabble.com.
>>
>>
>> _______________________________________________
>> riak-users mailing list
>> [hidden email] <http://user/SendEmail.jtp?type=node&node=4028474&i=1>
>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>>
>
>
> _______________________________________________
> riak-users mailing list
> [hidden email] <http://user/SendEmail.jtp?type=node&node=4028474&i=2>
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>
>
> ------------------------------
>  If you reply to this email, your message will be added to the discussion
> below:
>
> http://riak-users.197444.n3.nabble.com/Comparing-Riak-MapReduce-and-Hadoop-MapReduce-tp4028454p4028474.html
>  To unsubscribe from Comparing Riak MapReduce and Hadoop MapReduce, click
> here<http://riak-users.197444.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=4028454&code=bWtvYmllQGdtYWlsLmNvbXw0MDI4NDU0fDUzMjM5MzU5OA==>
> .
> NAML<http://riak-users.197444.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>




--
View this message in context: http://riak-users.197444.n3.nabble.com/Comparing-Riak-MapReduce-and-Hadoop-MapReduce-tp4028454p4028476.html
Sent from the Riak Users mailing list archive at Nabble.com.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.basho.com/pipermail/riak-users_lists.basho.com/attachments/20130721/7335d6aa/attachment.html>


More information about the riak-users mailing list