Map reduce weirdness on Riak 1.3

Christian Dahlqvist christian at basho.com
Sat Apr 6 16:09:33 EDT 2013


Hi Kartik,

The reduce phase will normally run recursively a number of times as results come in from map phases on different nodes [1]. This allows Riak to start reducing the results before all data is available and will not require all input data to be stored in memory on the coordinating node. The output from the first iteration will be passed into the following iteration together with the next batch of map phase results. When processing large amounts of data this is the most efficient way to do it, and reduce functions should be written in a way so that they can handle this.

If you know that the number of results will be manageable or the reduce phase requires all results to be processed at once, there is an argument called 'reduce_phase_only_1' [1] that can be specified in order to force the reduce phase to run only once all map phase results are available.

In your case you might however be able to completely eliminate the reduce phase if you made the map phase perform filtering and formatting. If a record does not match the criteria, you just return [].

This would give you a list of JSON documents instead of a single one, but would most likely be more efficient.

Best regards,

Christian

[1] http://docs.basho.com/riak/1.2.0/references/appendices/MapReduce-Implementation/


On 6 Apr 2013, at 20:22, Kartik Thakore <kthakore at aimed.cc> wrote:

> I am not sure have you mean by re-reduce. Can you clarify what you mean? 
> 
> I was thinking the reduce phase happens after the whole map is done. Do I have to implement a count waiter for the map results in the reduce code?
> 
> 
> On Sat, Apr 6, 2013 at 3:19 PM, Christian Dahlqvist <christian at basho.com> wrote:
> Hi Kartik,
> 
> What you are seeing is a result of you not accounting for re-reduce in you reduce phase function. 
> 
> In Riak reduce phases generally run recursively and the input for each run may contain both values from preceding map phase as well as output from previous iterations of the reduce phase. In order for the reduce phase to behave correctly you will need to distinguish between the different types of input records in your reduce function. 
> 
> Best regards,
> 
> Christian
> 
> 
> 
> 
> On 6 Apr 2013, at 19:09, Kartik Thakore <kthakore at aimed.cc> wrote:
> 
>> Hello,
>> 
>> I recently setup a test cluster to try to do a tech demo web application on.
>> 
>> I have been having some weirdness with the map reduce functionality.
>> 
>> My database is here:
>> 
>> http://aimed.cc:8098/riak/rekon/go#/buckets/test_rand_docs
>> 
>> The cluster has 5 nodes
>> ulimit 4096
>> 
>> This is Riak 1.3.0 release on Debian with 663 of free memory.
>> 
>> I am running this map reduce:
>> 
>> curl -X POST -H "content-type: application/json" \
>>     http://aimed.cc:8098/mapred --data @-<<\EOF
>> {"inputs": "test_rand_docs",
>> "query":[{"map":{"language":"javascript","source":"
>>     function (v) {
>>         var r = {};
>>         var data = JSON.parse(v.values[0].data);
>>         r.data = data;
>>         r.key = v.key;
>>         return [ r ];
>>     }
>> "}},{"reduce":{"language":"javascript","source":"
>>     function (v) {
>>         var r = {};
>>         for( var i in v )
>>         {
>>             var doc = v[i];
>>             if( doc['data'] !== undefined) {
>>                 var age = doc['data']['age_int'];
>>                 if ( age !== undefined && age > 10 && age <25 ){
>>                     r[doc['key']] = doc['data'];
>>                 }
>>             }
>>         }
>>         return  [ r ];
>> 
>>     }
>> "}}]
>> 
>> 
>> my result is randomly:
>> 
>> [{"9DYMGV0B6Jdn5DivoTExiqyDYUC":{"age_int":24},"JQYUs2onC822EOzMaToz71j77e":{"age_int":18},"AcrUwotAdYaV5zitaMylnUgYsWY":{"age_int":24}}]
>> 
>> 
>> or
>> 
>> [{"LYJpg97ZA5qjZTTv2cfavmRgxLb":{"age_int":11}}]
>> 
>> but it is clear with this:
>> 
>> http://aimed.cc:8098/solr/test_rand_docs/select?q=age_int:[10%20TO%2025]
>> 
>> 
>> that there are 134 records ....
>> 
>> so what is going on?
>> 
>> 
>> Is it low memory? Or that it is on a XEN machine (Linode)? Is there a scaleable memory server vendor (AWS or w/e) I should consider?
>> 
>> Thanks,
>> Kartik Thakore
>> 
>> 
>> _______________________________________________
>> riak-users mailing list
>> riak-users at lists.basho.com
>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
> 
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.basho.com/pipermail/riak-users_lists.basho.com/attachments/20130406/6e11667b/attachment.html>


More information about the riak-users mailing list