MapReduce performance problem

Jeremiah Peschka jeremiah.peschka at gmail.com
Tue Feb 26 14:33:21 EST 2013


Before you go troubleshooting performance problems, I'd focus on getting
results out of basho_bench and getting a good understanding of your
environment. If you're running 4 guests with 1 vCPU each on the same VM
host with all guests sharing a single pool of disks, no amount of tuning
will solve that problem. Without an understanding of the operating
environment, we can't do much more than point at general best practices and
say "these might help you, not sure, though."

As far as your specifics - for the first query, if you're attempting to get
a list of keys, I still recommend using ListAllKeysFromIndex(string
Bucket). This will be pushed out as part of the IRiakClient interface in
the next day or two, and I'm sure the gods of OOP won't kill you for using
an actual RiakClient object instead of an IRiakClient interface between now
and then. Sending those results back directly from riak_kv is going to be
far faster than messing around with a JavaScript MapReduce job.

Always keep in mind that MR jobs are not going to be the most efficient way
to perform any kind of ad hoc querying - they're great for large scale data
transformations but if you really want performance, you'll want to write
Erlang MR jobs.

If you need to maintain counts per department, a better approach will be
persisting counters and maintaining those counts via some kind of
caching/pre-aggregation mechanism, most likely outside of Riak because of
eventual consistency guarantees. Alex Siculars will eventually show up and
start chanting "use redis"; you'll be resistant at first, but his arguments
make a lot of sense. Riak does some things very well, maintaining
consistent counters isn't one of them... yet.

---
Jeremiah Peschka - Founder, Brent Ozar Unlimited
MCITP: SQL Server 2008, MVP
Cloudera Certified Developer for Apache Hadoop


On Tue, Feb 26, 2013 at 10:52 AM, Kevin Burton <rkevinburton at charter.net>wrote:

> I have a simple CorrugatedIron client that makes the following request:***
> *
>
> ** **
>
>                 IRiakClient riakClient = cluster.CreateClient();****
>
>                 RiakBinIndexRangeInput bucketKeyInput = new
> RiakBinIndexRangeInput(productBucketName, "$key", "00000000", "99999999");
> ****
>
>                 RiakMapReduceQuery query = new RiakMapReduceQuery()****
>
>                    .Inputs(bucketKeyInput)****
>
>                    .MapJs(m => m.Name("Riak.mapValuesJson").Keep(true));**
> **
>
>                 RiakResult<RiakMapReduceResult> result =
> riakClient.MapReduce(query);****
>
> ** **
>
> So as you can see this is a very basic range m/r query. But the result
> comes back as:****
>
> ** **
>
> Riak returned an error. Code '0'. Message: timeout****
>
> CommunicationError****
>
> ** **
>
> Another type of m/r query I have****
>
> ** **
>
>                 IRiakClient riakClient = cluster.CreateClient();****
>
>                 var query = new RiakMapReduceQuery()****
>
>                     .Inputs(productBucketName)****
>
>                     .MapJs(m => m.Source(@"function(v,d,a) {" +****
>
>                         "var p = JSON.parse(v.values[0].data);" +****
>
>                         "var r = [];" +****
>
>                         "d = escape(p.Department);" +****
>
>                         "if(d != '') {" +****
>
>                         "var o = {};" +****
>
>                         "o[d] = 1;" +****
>
>                         "r.push(o);" +****
>
>                         "}" +****
>
>                         "return r;" +****
>
>                         "}"))****
>
>                     .ReduceJs(m => m.Source(@"function(v,d,a) {" +****
>
>                         "var r = {};" +****
>
>                         "for(var i in v) {" +****
>
>                         "  for(var w in v[i]) {" +****
>
>                         "    if(w in r) r[w] += v[i][w];" +****
>
>                         "    else r[w] = v[i][w];" +****
>
>                         "  }" +****
>
>                         "}" +****
>
>                         "return [r];" +****
>
>                         "}")****
>
>                         .Keep(true));****
>
> ** **
>
> This returns but it takes far too long. I have about 60,000 items in my
> bucket and this takes about 50-60 seconds to execute. The results seem
> valid. For these types of m/r jobs what can I do on the server (or client)
> to helo diagnose the problem.  I have basic tools like iostat and top to
> give me data but some pointers on using the output of these tools might
> help.****
>
> _______________________________________________
> riak-users mailing list
> riak-users at lists.basho.com
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.basho.com/pipermail/riak-users_lists.basho.com/attachments/20130226/299e0c06/attachment.html>


More information about the riak-users mailing list