Comparing Riak MapReduce and Hadoop MapReduce

Xiaoming Gao mkobie at gmail.com
Mon Jul 22 14:55:25 EDT 2013


Yeah, I also read about pre-reduce and will try it out. Thanks for all the
help! I really appreciate it.

Xiaoming


On Mon, Jul 22, 2013 at 1:43 PM, Jeremiah Peschka [via Riak Users] <
ml-node+s197444n4028497h20 at n3.nabble.com> wrote:

> Oh, I almost forgot, you can also supply the do_prereduce argument to your
> reduce phase - this performs a pre-reduce phase on the mapper. This can,
> depending on the workload, significantly decrease the network overhead
> between the mappers and the reducer.
>
> ---
> Jeremiah Peschka - Founder, Brent Ozar Unlimited
> MCITP: SQL Server 2008, MVP
> Cloudera Certified Developer for Apache Hadoop
>
>
> On Mon, Jul 22, 2013 at 10:21 AM, Jeremiah Peschka <[hidden email]<http://user/SendEmail.jtp?type=node&node=4028497&i=0>
> > wrote:
>
>> For JavaScript the number of reducers is configured in the app.config
>> file on each node with the reduce_js_vm_count property.
>>
>> ---
>> Jeremiah Peschka - Founder, Brent Ozar Unlimited
>> MCITP: SQL Server 2008, MVP
>> Cloudera Certified Developer for Apache Hadoop
>>
>>
>> On Mon, Jul 22, 2013 at 8:07 AM, Xiaoming Gao <[hidden email]<http://user/SendEmail.jtp?type=node&node=4028497&i=1>
>> > wrote:
>>
>>> Thanks for the clarification, Jeremiah!
>>>
>>> One last question: how should I configure the MR job to have multiple
>>> reducer processes on a single node?
>>>
>>> Regards,
>>> Xiaoming
>>>
>>>
>>> On Mon, Jul 22, 2013 at 1:33 AM, Jeremiah Peschka [via Riak Users] <[hidden
>>> email] <http://user/SendEmail.jtp?type=node&node=4028486&i=0>> wrote:
>>>
>>>> Ah, yeah, I'm mistaken about search partitioning. The docs are correct.
>>>>
>>>> I have no idea how the scheduling works.
>>>>
>>>> If I had to guess, I would guess that it is a streaming operation.
>>>>
>>>> --
>>>> Jeremiah Peschka - Founder, Brent Ozar Unlimited
>>>> MCITP: SQL Server 2008, MVP
>>>> Cloudera Certified Developer for Apache Hadoop
>>>>
>>>> On Jul 21, 2013, at 10:08 PM, Xiaoming Gao <[hidden email]<http://user/SendEmail.jtp?type=node&node=4028477&i=0>>
>>>> wrote:
>>>>
>>>> Thanks a lot, Jeremiah! Your answers really help clarify the issues.
>>>>
>>>> Just one more question, by "document-based indices", do you mean
>>>> document-based partitioning for the indices? Because what I found in the
>>>> online document
>>>> http://docs.basho.com/riak/latest/dev/advanced/search/#Search-KV-and-MapReduceis "Search
>>>> uses term-based partitioning – also known as a global index." I am not sure
>>>> if the implementation has changed for the latest version of Riak, but if
>>>> term-based partitioning is used, does that mean Riak will only schedule the
>>>> mappers after the whole list of <bucket, key> pair is returned from the
>>>> index?
>>>>
>>>> Thanks,
>>>> Xiaoming
>>>>
>>>>
>>>> On Sun, Jul 21, 2013 at 11:20 PM, Jeremiah Peschka [via Riak Users] <[hidden
>>>> email] <http://user/SendEmail.jtp?type=node&node=4028476&i=0>> wrote:
>>>>
>>>>> Responses inline. Hopefully they shed some light on the subject.
>>>>>
>>>>> ---
>>>>> Jeremiah Peschka - Founder, Brent Ozar Unlimited
>>>>> MCITP: SQL Server 2008, MVP
>>>>> Cloudera Certified Developer for Apache Hadoop
>>>>>
>>>>>
>>>>> On Fri, Jul 19, 2013 at 5:07 PM, Xiaoming Gao <[hidden email]<http://user/SendEmail.jtp?type=node&node=4028474&i=0>
>>>>> > wrote:
>>>>>
>>>>>> Hi everyone,
>>>>>>
>>>>>> I am trying to learn about Riak MapReduce and comparing it with Hadoop
>>>>>> MapReduce, and there are some details that I am interested in but not
>>>>>> covered in the online documents. So hopefully we can get some help
>>>>>> here
>>>>>> about the following questions? Thanks in advance!
>>>>>>
>>>>>
>>>>> They're not at all similar. Hadoop MR is optimized for sequential data
>>>>> processing in large batches. Riak MR works better when you think of it like
>>>>> a multi-processing engine - you can perform work across a matching set of
>>>>> items and that work will be distributed across the cluster during map
>>>>> phases.
>>>>>
>>>>> Take a look at this thread for a bit of discussion about when you
>>>>> should use Riak MapReduce:
>>>>> http://markmail.org/message/qpoilvmm635inb5v
>>>>>
>>>>> Or, if you want to, you can run a Riak MR job across an entire bucket,
>>>>> which really is like scanning every table in an RDBMS while looking for
>>>>> rows from a single table. MR jobs run with an R of 1. So, at least there's
>>>>> that.
>>>>>
>>>>>
>>>>>> 1. For a given MapReduce request (or to say, job), how does Riak
>>>>>> decide how
>>>>>> many mappers to use for the job? For example, if I have 8 nodes and
>>>>>> my data
>>>>>> are distributed across all nodes with an "N" value of 2, will I have 4
>>>>>> mappers running on 4 nodes concurrently? Is it possible to have
>>>>>> multiple
>>>>>> mappers (e.g., 4 or even 6) for the same MR job running on each node
>>>>>> (for
>>>>>> better processing speed)?
>>>>>>
>>>>>
>>>>> To the best of my recollection, this will be based on either:
>>>>>
>>>>> 1) If you're using JavaScript MR jobs, the number of mappers and
>>>>> reducers is controlled by the the map_js_vm_count and reduce_js_vm_count
>>>>> settings from each node's app.config file.
>>>>> 2) If you're using Erlang: magic. This will be handled by the Erlang
>>>>> VM and is based on number of processors and your overall Erlang VM
>>>>> configuration.
>>>>>
>>>>>
>>>>>>
>>>>>> 2. If I run a MapReduce job over the results of a Riak Search query,
>>>>>> how
>>>>>> does Riak schedule the mappers based on the search results?
>>>>>>
>>>>>
>>>>> Riak Search uses document-based indices - search will query every node
>>>>> in the cluster. Map phases happen and then results are then streamed to the
>>>>> reducer.
>>>>>
>>>>>
>>>>>>
>>>>>> 3. How does Riak handle intermediate data generated by mappers?
>>>>>> Specifically:
>>>>>> (1) In Hadoop MapReduce, the output of mappers are <key, value>
>>>>>> pairs, and
>>>>>> the output from all mappers are first grouped based on keys, and then
>>>>>> handed
>>>>>> over to the reducer. Does Riak do similar grouping of intermediate
>>>>>> data?
>>>>>>
>>>>>
>>>>> The only reason for the intermediate grouping/scratch work in Hadoop
>>>>> MR jobs is to deal with multiple reducers. Although, I'm not entirely sure
>>>>> how this works in Riak, my suspicion is that data is streamed across the
>>>>> wire after the data is read from disk.
>>>>>
>>>>>
>>>>>>
>>>>>> (2) How are mapper outputs transmitted to the reducer? Does Riak use
>>>>>> local
>>>>>> disks on the mapper nodes or reducer nodes to store the intermediate
>>>>>> data
>>>>>> temporarily?
>>>>>
>>>>>
>>>>> Since large MR jobs can cause out of memory errors, you can bet good
>>>>> money that the answer is "no".
>>>>>
>>>>>
>>>>>>
>>>>>> 4. According to the document
>>>>>>
>>>>>> http://docs.basho.com/riak/latest/dev/advanced/mapreduce/#How-Phases-Work,
>>>>>> each MR job only schedules one reducer, which runs on the coordinate
>>>>>> node.
>>>>>> Is there any way to configure a MR job to use multiple reducers?
>>>>>>
>>>>>
>>>>> Using Riak MR, there's no way to create a job that runs reducers on
>>>>> multiple nodes. You can have multiple reducer processes on a single node,
>>>>> but not reducers on multiple nodes.
>>>>>
>>>>>
>>>>>>
>>>>>> Best regards,
>>>>>> Xiaoming
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> View this message in context:
>>>>>> http://riak-users.197444.n3.nabble.com/Comparing-Riak-MapReduce-and-Hadoop-MapReduce-tp4028454.html
>>>>>> Sent from the Riak Users mailing list archive at Nabble.com.
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> riak-users mailing list
>>>>>> [hidden email] <http://user/SendEmail.jtp?type=node&node=4028474&i=1>
>>>>>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> riak-users mailing list
>>>>> [hidden email] <http://user/SendEmail.jtp?type=node&node=4028474&i=2>
>>>>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>>>>>
>>>>>
>>>>> ------------------------------
>>>>>  If you reply to this email, your message will be added to the
>>>>> discussion below:
>>>>>
>>>>> http://riak-users.197444.n3.nabble.com/Comparing-Riak-MapReduce-and-Hadoop-MapReduce-tp4028454p4028474.html
>>>>>  To unsubscribe from Comparing Riak MapReduce and Hadoop MapReduce, click
>>>>> here.
>>>>> NAML<http://riak-users.197444.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>>>>>
>>>>
>>>>
>>>> ------------------------------
>>>> View this message in context: Re: Comparing Riak MapReduce and Hadoop
>>>> MapReduce<http://riak-users.197444.n3.nabble.com/Comparing-Riak-MapReduce-and-Hadoop-MapReduce-tp4028454p4028476.html>
>>>>
>>>> Sent from the Riak Users mailing list archive<http://riak-users.197444.n3.nabble.com/>at
>>>> Nabble.com.
>>>>
>>>> _______________________________________________
>>>> riak-users mailing list
>>>> [hidden email] <http://user/SendEmail.jtp?type=node&node=4028477&i=1>
>>>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>>>>
>>>>
>>>> _______________________________________________
>>>> riak-users mailing list
>>>> [hidden email] <http://user/SendEmail.jtp?type=node&node=4028477&i=2>
>>>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>>>>
>>>>
>>>> ------------------------------
>>>>  If you reply to this email, your message will be added to the
>>>> discussion below:
>>>>
>>>> http://riak-users.197444.n3.nabble.com/Comparing-Riak-MapReduce-and-Hadoop-MapReduce-tp4028454p4028477.html
>>>>  To unsubscribe from Comparing Riak MapReduce and Hadoop MapReduce, click
>>>> here.
>>>> NAML<http://riak-users.197444.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>>>>
>>>
>>>
>>> ------------------------------
>>> View this message in context: Re: Comparing Riak MapReduce and Hadoop
>>> MapReduce<http://riak-users.197444.n3.nabble.com/Comparing-Riak-MapReduce-and-Hadoop-MapReduce-tp4028454p4028486.html>
>>> Sent from the Riak Users mailing list archive<http://riak-users.197444.n3.nabble.com/>at Nabble.com.
>>>
>>> _______________________________________________
>>> riak-users mailing list
>>> [hidden email] <http://user/SendEmail.jtp?type=node&node=4028497&i=2>
>>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>>>
>>>
>>
>
> _______________________________________________
> riak-users mailing list
> [hidden email] <http://user/SendEmail.jtp?type=node&node=4028497&i=3>
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>
>
> ------------------------------
>  If you reply to this email, your message will be added to the discussion
> below:
>
> http://riak-users.197444.n3.nabble.com/Comparing-Riak-MapReduce-and-Hadoop-MapReduce-tp4028454p4028497.html
>  To unsubscribe from Comparing Riak MapReduce and Hadoop MapReduce, click
> here<http://riak-users.197444.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=4028454&code=bWtvYmllQGdtYWlsLmNvbXw0MDI4NDU0fDUzMjM5MzU5OA==>
> .
> NAML<http://riak-users.197444.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>




--
View this message in context: http://riak-users.197444.n3.nabble.com/Comparing-Riak-MapReduce-and-Hadoop-MapReduce-tp4028454p4028501.html
Sent from the Riak Users mailing list archive at Nabble.com.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.basho.com/pipermail/riak-users_lists.basho.com/attachments/20130722/90a7311a/attachment.html>


More information about the riak-users mailing list