MapReduce performance problem

Kevin Burton rkevinburton at
Tue Feb 26 15:26:52 EST 2013

Right. I know it is not ideal. I have been able to split the VM's into
groups. So 2 of the 4 are running on separate hardware. Anything more I just
get the response 'get real'. That being said I want to get the maximum
performance of the limited resources that I have. I have a separate question
for the group in trying to get basho_bench up and running (I get a long
string of errors). What do you need to know more about my environment to
"understand" it? I am new so I am probably asking the wrong questions so
please tell me what you are missing that might help diagnose the problem.


I agree. I will use ListAllKeysFromIndex to get a list of keys for now. The
only reason that I included the m/r code is because of the error. If I get
another m/r job with similar output I need to know how to diagnose the
problem. I was using JavaScript m/r because I kind of understand JavaScript.
Is there a separate task to do Erlang m/r jobs. I assume that I will need to
know Erlang. Any recommendations on how best to know what I need to know
about Erlang to write a m/r job. But before I do that wouldn't it be prudent
to know that the source if the problem is indeed JavaScript? How would I
pinpoint that?


These two m/r jobs are basically an example of using m/r that would be
typical for our application.  Just for sheer maintenance we wouldn't want to
go down the path of maintaining a counter for all the fields that we have.
There could be departments, categories, celebrations, . . . Basically a lot
of them.  For all intents it is an ad hoc query. If that is one of the
limitations then we will have to note it and see if coping with this
limitation is too onerous.


From: riak-users [mailto:riak-users-bounces at] On Behalf Of
Jeremiah Peschka
Sent: Tuesday, February 26, 2013 1:33 PM
To: riak-users
Subject: Re: MapReduce performance problem


Before you go troubleshooting performance problems, I'd focus on getting
results out of basho_bench and getting a good understanding of your
environment. If you're running 4 guests with 1 vCPU each on the same VM host
with all guests sharing a single pool of disks, no amount of tuning will
solve that problem. Without an understanding of the operating environment,
we can't do much more than point at general best practices and say "these
might help you, not sure, though."


As far as your specifics - for the first query, if you're attempting to get
a list of keys, I still recommend using ListAllKeysFromIndex(string Bucket).
This will be pushed out as part of the IRiakClient interface in the next day
or two, and I'm sure the gods of OOP won't kill you for using an actual
RiakClient object instead of an IRiakClient interface between now and then.
Sending those results back directly from riak_kv is going to be far faster
than messing around with a JavaScript MapReduce job.


Always keep in mind that MR jobs are not going to be the most efficient way
to perform any kind of ad hoc querying - they're great for large scale data
transformations but if you really want performance, you'll want to write
Erlang MR jobs. 


If you need to maintain counts per department, a better approach will be
persisting counters and maintaining those counts via some kind of
caching/pre-aggregation mechanism, most likely outside of Riak because of
eventual consistency guarantees. Alex Siculars will eventually show up and
start chanting "use redis"; you'll be resistant at first, but his arguments
make a lot of sense. Riak does some things very well, maintaining consistent
counters isn't one of them... yet.


Jeremiah Peschka - Founder, Brent Ozar Unlimited

MCITP: SQL Server 2008, MVP

Cloudera Certified Developer for Apache Hadoop


On Tue, Feb 26, 2013 at 10:52 AM, Kevin Burton <rkevinburton at>

I have a simple CorrugatedIron client that makes the following request:


                IRiakClient riakClient = cluster.CreateClient();

                RiakBinIndexRangeInput bucketKeyInput = new
RiakBinIndexRangeInput(productBucketName, "$key", "00000000", "99999999");

                RiakMapReduceQuery query = new RiakMapReduceQuery()


                   .MapJs(m => m.Name("Riak.mapValuesJson").Keep(true));

                RiakResult<RiakMapReduceResult> result =


So as you can see this is a very basic range m/r query. But the result comes
back as:


Riak returned an error. Code '0'. Message: timeout



Another type of m/r query I have


                IRiakClient riakClient = cluster.CreateClient();

                var query = new RiakMapReduceQuery()


                    .MapJs(m => m.Source(@"function(v,d,a) {" +

                        "var p = JSON.parse(v.values[0].data);" +

                        "var r = [];" +

                        "d = escape(p.Department);" +

                        "if(d != '') {" +

                        "var o = {};" +

                        "o[d] = 1;" +

                        "r.push(o);" +

                        "}" +

                        "return r;" +


                    .ReduceJs(m => m.Source(@"function(v,d,a) {" +

                        "var r = {};" +

                        "for(var i in v) {" +

                        "  for(var w in v[i]) {" +

                        "    if(w in r) r[w] += v[i][w];" +

                        "    else r[w] = v[i][w];" +

                        "  }" +

                        "}" +

                        "return [r];" +




This returns but it takes far too long. I have about 60,000 items in my
bucket and this takes about 50-60 seconds to execute. The results seem
valid. For these types of m/r jobs what can I do on the server (or client)
to helo diagnose the problem.  I have basic tools like iostat and top to
give me data but some pointers on using the output of these tools might

riak-users mailing list
riak-users at


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

More information about the riak-users mailing list