Reduce phase only on one node?

John Butler php_john at hotmail.com
Sat Jul 24 09:20:50 EDT 2010


Thank you.
-John

________________________________
> Date: Fri, 23 Jul 2010 17:14:51 -0700
> Subject: Re: Reduce phase only on one node?
> From: dan at basho.com
> To: me at jdrowell.com
> CC: php_john at hotmail.com; riak-users at lists.basho.com
>
> Currently, reduce functions run on the coordinating node. The coordinating node runs 2 processes per reduce phase to achieve some parallelism. There are two features requests open to improve the implementation:
>
>
> Allow users to toggle the number of processes used during reduce
> https://issues.basho.com/148
>
> Distribute reduce phases across the cluster
>
> https://issues.basho.com/149
>
> There is no target release set for either issue but you should be able to add yourself to the CC list in Bugzilla to receive notifications.
>
>
> Thanks,
> Dan
>
> Daniel Reverri
> Developer Advocate
> Basho Technologies, Inc.
> dan at basho.com
>
>
>
> On Fri, Jul 23, 2010 at 2:34 PM, John D. Rowell> wrote:
>
> +1 to this, my understanding is that you can use the same reduce funcion to re-reduce a stream of data and still get the same results. Is this what actually happens in Riak internally (i.e. the coordinating node only re-reduces each node's reduce) or does the reduce function only run on the coordinating node, which receives the raw map data?
>
>
>
> 2010/7/23 John Butler>
>
>
>
>
>
> Hello,
>
> I went through the Riak Fast Track and overall I'm excited about the possibilities of Riak. However, I did see one thing that concerns me in regard to MapReduce jobs. From the Riak docs and this page here: http://seancribbs.com/tech/2010/02/06/why-riak-should-power-your-next-rails-app/ it seems to suggest that while Map jobs are performed in parallel Reduce jobs are not.
>
>
>
> Wouldn't be better if each node performed the reduce function on it's own set of data before sending it to the main coordinator? Imagine if you are mapping over a high number of objects (millions) and are trying to aggregate the data grouped by some demographic (such as state). If all the mappers send the data to coordinating node to be reduced then you are going to send millions of items from each data node all to one coordinating node. If instead each node called reduce first, then sent it to the coordinating node, then each node would be at most sending over 50 objects (for 50 states). Not only would this greatly reduce network traffic, it would be able to execute in parallel greatly improving performance. Since all reduce jobs are communicative, associative and idempotent I see no reason why this couldn't happen (of course I'm not familiar with the Riak internals to strongly assert this).
>
>
>
> Maybe I misunderstood and this is already happening. If it's not happening, why and is that on the roadmap to be added later?
>
> -John
>
> _________________________________________________________________
>
> Hotmail is redefining busy with tools for the New Busy. Get more from your inbox.
>
> http://www.windowslive.com/campaign/thenewbusy?ocid=PID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_2
>
>
>
> _______________________________________________
>
> riak-users mailing list
>
> riak-users at lists.basho.com
>
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>
>
>
>
> _______________________________________________
>
> riak-users mailing list
>
> riak-users at lists.basho.com
>
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>
>
>
 		 	   		  
_________________________________________________________________
Hotmail is redefining busy with tools for the New Busy. Get more from your inbox.
http://www.windowslive.com/campaign/thenewbusy?ocid=PID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_2


More information about the riak-users mailing list