Comparing Riak MapReduce and Hadoop MapReduce
jeremiah.peschka at gmail.com
Mon Jul 22 01:30:45 EDT 2013
Ah, yeah, I'm mistaken about search partitioning. The docs are correct.
I have no idea how the scheduling works.
If I had to guess, I would guess that it is a streaming operation.
Jeremiah Peschka - Founder, Brent Ozar Unlimited
MCITP: SQL Server 2008, MVP
Cloudera Certified Developer for Apache Hadoop
On Jul 21, 2013, at 10:08 PM, Xiaoming Gao <mkobie at gmail.com> wrote:
> Thanks a lot, Jeremiah! Your answers really help clarify the issues.
> Just one more question, by "document-based indices", do you mean document-based partitioning for the indices? Because what I found in the online document http://docs.basho.com/riak/latest/dev/advanced/search/#Search-KV-and-MapReduce is "Search uses term-based partitioning – also known as a global index." I am not sure if the implementation has changed for the latest version of Riak, but if term-based partitioning is used, does that mean Riak will only schedule the mappers after the whole list of <bucket, key> pair is returned from the index?
> On Sun, Jul 21, 2013 at 11:20 PM, Jeremiah Peschka [via Riak Users] <[hidden email]> wrote:
>> Responses inline. Hopefully they shed some light on the subject.
>> Jeremiah Peschka - Founder, Brent Ozar Unlimited
>> MCITP: SQL Server 2008, MVP
>> Cloudera Certified Developer for Apache Hadoop
>> On Fri, Jul 19, 2013 at 5:07 PM, Xiaoming Gao <[hidden email]> wrote:
>>> Hi everyone,
>>> I am trying to learn about Riak MapReduce and comparing it with Hadoop
>>> MapReduce, and there are some details that I am interested in but not
>>> covered in the online documents. So hopefully we can get some help here
>>> about the following questions? Thanks in advance!
>> They're not at all similar. Hadoop MR is optimized for sequential data processing in large batches. Riak MR works better when you think of it like a multi-processing engine - you can perform work across a matching set of items and that work will be distributed across the cluster during map phases.
>> Take a look at this thread for a bit of discussion about when you should use Riak MapReduce: http://markmail.org/message/qpoilvmm635inb5v
>> Or, if you want to, you can run a Riak MR job across an entire bucket, which really is like scanning every table in an RDBMS while looking for rows from a single table. MR jobs run with an R of 1. So, at least there's that.
>>> 1. For a given MapReduce request (or to say, job), how does Riak decide how
>>> many mappers to use for the job? For example, if I have 8 nodes and my data
>>> are distributed across all nodes with an "N" value of 2, will I have 4
>>> mappers running on 4 nodes concurrently? Is it possible to have multiple
>>> mappers (e.g., 4 or even 6) for the same MR job running on each node (for
>>> better processing speed)?
>> To the best of my recollection, this will be based on either:
>> 2) If you're using Erlang: magic. This will be handled by the Erlang VM and is based on number of processors and your overall Erlang VM configuration.
>>> 2. If I run a MapReduce job over the results of a Riak Search query, how
>>> does Riak schedule the mappers based on the search results?
>> Riak Search uses document-based indices - search will query every node in the cluster. Map phases happen and then results are then streamed to the reducer.
>>> 3. How does Riak handle intermediate data generated by mappers?
>>> (1) In Hadoop MapReduce, the output of mappers are <key, value> pairs, and
>>> the output from all mappers are first grouped based on keys, and then handed
>>> over to the reducer. Does Riak do similar grouping of intermediate data?
>> The only reason for the intermediate grouping/scratch work in Hadoop MR jobs is to deal with multiple reducers. Although, I'm not entirely sure how this works in Riak, my suspicion is that data is streamed across the wire after the data is read from disk.
>>> (2) How are mapper outputs transmitted to the reducer? Does Riak use local
>>> disks on the mapper nodes or reducer nodes to store the intermediate data
>> Since large MR jobs can cause out of memory errors, you can bet good money that the answer is "no".
>>> 4. According to the document
>>> http://docs.basho.com/riak/latest/dev/advanced/mapreduce/#How-Phases-Work ,
>>> each MR job only schedules one reducer, which runs on the coordinate node.
>>> Is there any way to configure a MR job to use multiple reducers?
>> Using Riak MR, there's no way to create a job that runs reducers on multiple nodes. You can have multiple reducer processes on a single node, but not reducers on multiple nodes.
>>> Best regards,
>>> View this message in context: http://riak-users.197444.n3.nabble.com/Comparing-Riak-MapReduce-and-Hadoop-MapReduce-tp4028454.html
>>> Sent from the Riak Users mailing list archive at Nabble.com.
>>> riak-users mailing list
>>> [hidden email]
>> riak-users mailing list
>> [hidden email]
>> If you reply to this email, your message will be added to the discussion below:
>> To unsubscribe from Comparing Riak MapReduce and Hadoop MapReduce, click here.
> View this message in context: Re: Comparing Riak MapReduce and Hadoop MapReduce
> Sent from the Riak Users mailing list archive at Nabble.com.
> riak-users mailing list
> riak-users at lists.basho.com
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the riak-users