Reduce phase only on one node?
dan at basho.com
Fri Jul 23 20:14:51 EDT 2010
Currently, reduce functions run on the coordinating node. The coordinating
node runs 2 processes per reduce phase to achieve some parallelism. There
are two features requests open to improve the implementation:
Allow users to toggle the number of processes used during reduce
Distribute reduce phases across the cluster
There is no target release set for either issue but you should be able to
add yourself to the CC list in Bugzilla to receive notifications.
Basho Technologies, Inc.
dan at basho.com
On Fri, Jul 23, 2010 at 2:34 PM, John D. Rowell <me at jdrowell.com> wrote:
> +1 to this, my understanding is that you can use the same reduce funcion to
> re-reduce a stream of data and still get the same results. Is this what
> actually happens in Riak internally (i.e. the coordinating node only
> re-reduces each node's reduce) or does the reduce function only run on the
> coordinating node, which receives the raw map data?
> 2010/7/23 John Butler <php_john at hotmail.com>
>> I went through the Riak Fast Track and overall I'm excited about the
>> possibilities of Riak. However, I did see one thing that concerns me in
>> regard to MapReduce jobs. From the Riak docs and this page here:
>> http://seancribbs.com/tech/2010/02/06/why-riak-should-power-your-next-rails-app/it seems to suggest that while Map jobs are performed in parallel Reduce
>> jobs are not.
>> Wouldn't be better if each node performed the reduce function on it's own
>> set of data before sending it to the main coordinator? Imagine if you are
>> mapping over a high number of objects (millions) and are trying to aggregate
>> the data grouped by some demographic (such as state). If all the mappers
>> send the data to coordinating node to be reduced then you are going to send
>> millions of items from each data node all to one coordinating node. If
>> instead each node called reduce first, then sent it to the coordinating
>> node, then each node would be at most sending over 50 objects (for 50
>> states). Not only would this greatly reduce network traffic, it would be
>> able to execute in parallel greatly improving performance. Since all reduce
>> jobs are communicative, associative and idempotent I see no reason why this
>> couldn't happen (of course I'm not familiar with the Riak internals to
>> strongly assert this).
>> Maybe I misunderstood and this is already happening. If it's not
>> happening, why and is that on the roadmap to be added later?
>> Hotmail is redefining busy with tools for the New Busy. Get more from your
>> riak-users mailing list
>> riak-users at lists.basho.com
> riak-users mailing list
> riak-users at lists.basho.com
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the riak-users