Mapreduce performance

Miroslav Urbanek miroslav.urbanek at gmail.com
Thu Jun 21 12:19:39 EDT 2012


Dear Riak users,

we are evaluating Riak for storing data similar to the dataset in the
tutorial example ("goog.csv" at
http://wiki.basho.com/Loading-Data-and-Running-MapReduce-Queries.html
). However, the mapreduce seems to be very slow. We have generated
200k lines exactly in the goog.csv format and loaded them into Riak.
Using the mapreduce queries from the example, we are unable to get any
results - all queries return "timeout" error.

A sample query:
{
"inputs":"goog",
"query":[
{"map":{"language":"javascript","source":
"function(value,keyData,arg) {var data = Riak.mapValuesJson(value)[0];
return [data.High];}"}},
{"reduce":{"language":"javascript","name":"Riak.reduceMax","keep":true}}
]}

For comparison, the following oneliner takes under a second:
$ awk 'NR==1 {next} max=="" || $3 > max {max=$3} END {print max}'
FS=',' goog-200k.csv

I know that Riak has to execute Javascript code, and to do a lot of
inter-node communication, so the comparison is not completely valid.
However, the entire file goog-200k.csv is only 11 MB big. I expected
Riak would handle it without a problem. We have experimented with
different backends, with a 4 node cluster on the same machine, with a
3 physical-node cluster, but the results are the same.

I have several questions:
1. What setup do you recommend for this use case, specifically for storing logs?
2. I know that mapreduce over entire bucket is not recommended, but
how would you calculate statistics over entire buckets, similar to the
queries in the tutorial?
3. We also tried Riak Search, but we were unable to perform a query
like this - finding the highest column value. Is there a way to do
this?

Thanks,
Miro




More information about the riak-users mailing list