Distributed reduce phases via pre-reduce [Was "Secondary Indexes - Feedback?"]

Gordon Tillman gtillman at mezeo.com
Fri Nov 18 11:36:00 EST 2011


Bryan thanks very much for the info.  I am going to investigate this over the weekend.  What I am trying to achieve is the ability to sort a potentially very large dataset in the most efficient way possible.

Regards,

--g



On Nov 18, 2011, at 10:12 , Bryan Fink wrote:

> On Thu, Nov 17, 2011 at 9:45 AM, Gordon Tillman <gtillman at mezeo.com> wrote:
>> I'm really interested in being able to implement distributed
>> reduce phases (specifically to do a partial sort)  and then have that output
>> handle by a final reduce phase that could perform an efficient merge sort
>>  and stream results back to the client.  That would be really cool!
> 
> Hi, Gordon.  I just caught up on the 2i thread, and noticed your
> comment at the end.  As of 1.0 and RiakPipe-based MapReduce, you are
> able to do something like this with the "pre-reduce" tuning
> functionality:
> 
> http://wiki.basho.com/MapReduce.html#Pre-Reduce
> 
> With pre-reduce enabled, your reduce function is run in two stages.
> The first stage processes the outputs of the previous map stage in
> parallel, on the vnode where each output was produced.  The second
> stage is reduce as you know it, processing all of the results of that
> pre-reduce stage in one place.
> 
> For example, if you had these vnodes producing these outputs:
> 
>   A: 1,2,3
>   B: 4,5,6
>   C: 7,8,9
> 
> enabling pre-reduce would cause three parallel reduces:
> 
>   x = reduce(1,2,3)
>   y = reduce(4,5,6)
>   z = reduce(7,8,9)
> 
> followed by a final all-together reduce of those results:
> 
>   reduce(x,y,z)
> 
> So, it's the same function evaluated in both stages, but if you've
> written it to play well with re-reduce, it should "just work".
> 
> Hope that helps,
> Bryan





More information about the riak-users mailing list