MapReduce performance problem

Jeremiah Peschka jeremiah.peschka at gmail.com
Tue Feb 26 16:17:37 EST 2013


Responses inline.

---
Jeremiah Peschka - Founder, Brent Ozar Unlimited
MCITP: SQL Server 2008, MVP
Cloudera Certified Developer for Apache Hadoop


On Tue, Feb 26, 2013 at 12:26 PM, Kevin Burton <rkevinburton at charter.net>wrote:

> Right. I know it is not ideal. I have been able to split the VM’s into
> groups. So 2 of the 4 are running on separate hardware. Anything more I
> just get the response ‘get real’. That being said I want to get the maximum
> performance of the limited resources that I have. I have a separate
> question for the group in trying to get basho_bench up and running (I get a
> long string of errors). What do you need to know more about my environment
> to “understand” it? I am new so I am probably asking the wrong questions so
> please tell me what you are missing that might help diagnose the problem.
>

For troubleshooting any environment it's good to know relevant hardware
details about CPU speed, core count, amount of RAM, disks, network cards,
etc. A working basho_bench benchmark would help, too, because it will
provide an indicator of how your environment will perform with Riak as
opposed to how your business logic performs, as implemented, with Riak.

For virtualization, it's also important to know how many other guests are
on the host, whether there are any CPU, memory, or network reservations in
place, and which version of virtualization you're running. Virtualization
makes performance tuning more complex, but not impossible.


> ****
>
> I agree. I will use ListAllKeysFromIndex to get a list of keys for now.
> The only reason that I included the m/r code is because of the error. If I
> get another m/r job with similar output I need to know how to diagnose the
> problem. I was using JavaScript m/r because I kind of understand
> JavaScript. Is there a separate task to do Erlang m/r jobs.
>
Erlang phases can be added to a MapReduceQuery using MapErlang and
ReduceErlang.

>  I assume that I will need to know Erlang. Any recommendations on how best
> to know what I need to know about Erlang to write a m/r job. But before I
> do that wouldn’t it be prudent to know that the source if the problem is
> indeed JavaScript? How would I pinpoint that?
>
I think this has been answered on list. I'd search http://riak.markmail.org

Someone from Basho can probably handle this better than my handwaving that
using JavaScript involves an interpreter, type marshaling between Erlang
and JavaScript, and won't multi-thread like Erlang will.

> ****
>
> These two m/r jobs are basically an example of using m/r that would be
> typical for our application.  Just for sheer maintenance we wouldn’t want
> to go down the path of maintaining a counter for all the fields that we
> have. There could be departments, categories, celebrations, . . . Basically
> a lot of them.  For all intents it is an ad hoc query. If that is one of
> the limitations then we will have to note it and see if coping with this
> limitation is too onerous.
>

MR queries are going to scan all of your data on disk. If you have 5 nodes
that can read at ~100 MB/s and you have 100GB of data, how long will it
take for your ad hoc query to run?

Riak Search/Lucene/Yokozuna will be better options for ad hoc workloads
than MapReducing across the cluster.

> ****
>
> ** **
>
> *From:* riak-users [mailto:riak-users-bounces at lists.basho.com] *On Behalf
> Of *Jeremiah Peschka
> *Sent:* Tuesday, February 26, 2013 1:33 PM
> *To:* riak-users
> *Subject:* Re: MapReduce performance problem****
>
> ** **
>
> Before you go troubleshooting performance problems, I'd focus on getting
> results out of basho_bench and getting a good understanding of your
> environment. If you're running 4 guests with 1 vCPU each on the same VM
> host with all guests sharing a single pool of disks, no amount of tuning
> will solve that problem. Without an understanding of the operating
> environment, we can't do much more than point at general best practices and
> say "these might help you, not sure, though."****
>
> ** **
>
> As far as your specifics - for the first query, if you're attempting to
> get a list of keys, I still recommend using ListAllKeysFromIndex(string
> Bucket). This will be pushed out as part of the IRiakClient interface in
> the next day or two, and I'm sure the gods of OOP won't kill you for using
> an actual RiakClient object instead of an IRiakClient interface between now
> and then. Sending those results back directly from riak_kv is going to be
> far faster than messing around with a JavaScript MapReduce job.****
>
> ** **
>
> Always keep in mind that MR jobs are not going to be the most efficient
> way to perform any kind of ad hoc querying - they're great for large scale
> data transformations but if you really want performance, you'll want to
> write Erlang MR jobs. ****
>
> ** **
>
> If you need to maintain counts per department, a better approach will be
> persisting counters and maintaining those counts via some kind of
> caching/pre-aggregation mechanism, most likely outside of Riak because of
> eventual consistency guarantees. Alex Siculars will eventually show up and
> start chanting "use redis"; you'll be resistant at first, but his arguments
> make a lot of sense. Riak does some things very well, maintaining
> consistent counters isn't one of them... yet.****
>
>
> ****
>
> ---****
>
> Jeremiah Peschka - Founder, Brent Ozar Unlimited****
>
> MCITP: SQL Server 2008, MVP****
>
> Cloudera Certified Developer for Apache Hadoop****
>
> ** **
>
> On Tue, Feb 26, 2013 at 10:52 AM, Kevin Burton <rkevinburton at charter.net>
> wrote:****
>
> I have a simple CorrugatedIron client that makes the following request:***
> *
>
>  ****
>
>                 IRiakClient riakClient = cluster.CreateClient();****
>
>                 RiakBinIndexRangeInput bucketKeyInput = new
> RiakBinIndexRangeInput(productBucketName, "$key", "00000000", "99999999");
> ****
>
>                 RiakMapReduceQuery query = new RiakMapReduceQuery()****
>
>                    .Inputs(bucketKeyInput)****
>
>                    .MapJs(m => m.Name("Riak.mapValuesJson").Keep(true));**
> **
>
>                 RiakResult<RiakMapReduceResult> result =
> riakClient.MapReduce(query);****
>
>  ****
>
> So as you can see this is a very basic range m/r query. But the result
> comes back as:****
>
>  ****
>
> Riak returned an error. Code '0'. Message: timeout****
>
> CommunicationError****
>
>  ****
>
> Another type of m/r query I have****
>
>  ****
>
>                 IRiakClient riakClient = cluster.CreateClient();****
>
>                 var query = new RiakMapReduceQuery()****
>
>                     .Inputs(productBucketName)****
>
>                     .MapJs(m => m.Source(@"function(v,d,a) {" +****
>
>                         "var p = JSON.parse(v.values[0].data);" +****
>
>                         "var r = [];" +****
>
>                         "d = escape(p.Department);" +****
>
>                         "if(d != '') {" +****
>
>                         "var o = {};" +****
>
>                         "o[d] = 1;" +****
>
>                         "r.push(o);" +****
>
>                         "}" +****
>
>                         "return r;" +****
>
>                         "}"))****
>
>                     .ReduceJs(m => m.Source(@"function(v,d,a) {" +****
>
>                         "var r = {};" +****
>
>                         "for(var i in v) {" +****
>
>                         "  for(var w in v[i]) {" +****
>
>                         "    if(w in r) r[w] += v[i][w];" +****
>
>                         "    else r[w] = v[i][w];" +****
>
>                         "  }" +****
>
>                         "}" +****
>
>                         "return [r];" +****
>
>                         "}")****
>
>                         .Keep(true));****
>
>  ****
>
> This returns but it takes far too long. I have about 60,000 items in my
> bucket and this takes about 50-60 seconds to execute. The results seem
> valid. For these types of m/r jobs what can I do on the server (or client)
> to helo diagnose the problem.  I have basic tools like iostat and top to
> give me data but some pointers on using the output of these tools might
> help.****
>
>
> _______________________________________________
> riak-users mailing list
> riak-users at lists.basho.com
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com****
>
> ** **
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.basho.com/pipermail/riak-users_lists.basho.com/attachments/20130226/bc2c0d0a/attachment.html>


More information about the riak-users mailing list