MapReduce performance problem

Jeremiah Peschka jeremiah.peschka at gmail.com
Thu Feb 28 20:39:17 EST 2013


I didn't want you to think that you've been forgotten, but I've been
swamped getting ready to head out of the country for 2 weeks on a company
trip. You're in good hands with the list, though.

---
Jeremiah Peschka - Founder, Brent Ozar Unlimited
MCITP: SQL Server 2008, MVP
Cloudera Certified Developer for Apache Hadoop


On Tue, Feb 26, 2013 at 4:36 PM, Kevin Burton <rkevinburton at charter.net>wrote:

> I got the same error on an AWS instance (m1.xlarge) when using the
> MapReduce version of list all keys.****
>
> ** **
>
> Query failed with Riak returned an error. Code '0'. Message:
> {"phase":0,"error":"[preflist_exhausted]","input":"{ok,{r_object,<<\"buyseasons-products\">>,<<\"00113023\">>,[{r_content,{dict,4,16,16,8,80,48,{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},{{[],[],[],[],[],[],[],[],[],[],[[<<\"content-type\">>,97,112,112,108,105,99,97,116,105,111,110,47,106,115,111,110],[<<\"X-Riak-VTag\">>,49,54,99,97,90,90,72,106,56,50,100,77,85,75,66,114,76,50,109,88,89,109]],[[<<\"index\">>,{<<\"active_bin\">>,<<\"InactiveDiscontinued\">>},{<<\"definition_bin\">>,<<\"Costume\">>},{<<\"department_bin\">>,<<\"Adult
> Costumes\">>},{<<\"...\">>,...}]],...}}},...}],...},...}","type":"forward_preflist","stack":"[]"}
> – CommunicationError****
>
> ** **
>
> The ‘Department’ MapReduce with an AWS instance also returned three
> ‘failed’ phases.****
>
> ** **
>
> Kevin****
>
> ** **
>
> *From:* riak-users [mailto:riak-users-bounces at lists.basho.com] *On Behalf
> Of *Jeremiah Peschka
> *Sent:* Tuesday, February 26, 2013 3:18 PM
>
> *To:* riak-users
> *Subject:* Re: MapReduce performance problem****
>
> ** **
>
> Responses inline.****
>
> ** **
>
> ---****
>
> Jeremiah Peschka - Founder, Brent Ozar Unlimited****
>
> MCITP: SQL Server 2008, MVP****
>
> Cloudera Certified Developer for Apache Hadoop****
>
> ** **
>
> On Tue, Feb 26, 2013 at 12:26 PM, Kevin Burton <rkevinburton at charter.net>
> wrote:****
>
> Right. I know it is not ideal. I have been able to split the VM’s into
> groups. So 2 of the 4 are running on separate hardware. Anything more I
> just get the response ‘get real’. That being said I want to get the maximum
> performance of the limited resources that I have. I have a separate
> question for the group in trying to get basho_bench up and running (I get a
> long string of errors). What do you need to know more about my environment
> to “understand” it? I am new so I am probably asking the wrong questions so
> please tell me what you are missing that might help diagnose the problem.*
> ***
>
> ** **
>
> For troubleshooting any environment it's good to know relevant hardware
> details about CPU speed, core count, amount of RAM, disks, network cards,
> etc. A working basho_bench benchmark would help, too, because it will
> provide an indicator of how your environment will perform with Riak as
> opposed to how your business logic performs, as implemented, with Riak.***
> *
>
> ** **
>
> For virtualization, it's also important to know how many other guests are
> on the host, whether there are any CPU, memory, or network reservations in
> place, and which version of virtualization you're running. Virtualization
> makes performance tuning more complex, but not impossible. ****
>
>  ****
>
> I agree. I will use ListAllKeysFromIndex to get a list of keys for now.
> The only reason that I included the m/r code is because of the error. If I
> get another m/r job with similar output I need to know how to diagnose the
> problem. I was using JavaScript m/r because I kind of understand
> JavaScript. Is there a separate task to do Erlang m/r jobs.****
>
> Erlang phases can be added to a MapReduceQuery using MapErlang and
> ReduceErlang. ****
>
> I assume that I will need to know Erlang. Any recommendations on how best
> to know what I need to know about Erlang to write a m/r job. But before I
> do that wouldn’t it be prudent to know that the source if the problem is
> indeed JavaScript? How would I pinpoint that?****
>
> I think this has been answered on list. I'd search
> http://riak.markmail.org****
>
> ** **
>
> Someone from Basho can probably handle this better than my handwaving that
> using JavaScript involves an interpreter, type marshaling between Erlang
> and JavaScript, and won't multi-thread like Erlang will.****
>
> These two m/r jobs are basically an example of using m/r that would be
> typical for our application.  Just for sheer maintenance we wouldn’t want
> to go down the path of maintaining a counter for all the fields that we
> have. There could be departments, categories, celebrations, . . . Basically
> a lot of them.  For all intents it is an ad hoc query. If that is one of
> the limitations then we will have to note it and see if coping with this
> limitation is too onerous.****
>
> ** **
>
> MR queries are going to scan all of your data on disk. If you have 5 nodes
> that can read at ~100 MB/s and you have 100GB of data, how long will it
> take for your ad hoc query to run? ****
>
> ** **
>
> Riak Search/Lucene/Yokozuna will be better options for ad hoc workloads
> than MapReducing across the cluster.****
>
>  ****
>
> *From:* riak-users [mailto:riak-users-bounces at lists.basho.com] *On Behalf
> Of *Jeremiah Peschka
> *Sent:* Tuesday, February 26, 2013 1:33 PM
> *To:* riak-users
> *Subject:* Re: MapReduce performance problem****
>
>  ****
>
> Before you go troubleshooting performance problems, I'd focus on getting
> results out of basho_bench and getting a good understanding of your
> environment. If you're running 4 guests with 1 vCPU each on the same VM
> host with all guests sharing a single pool of disks, no amount of tuning
> will solve that problem. Without an understanding of the operating
> environment, we can't do much more than point at general best practices and
> say "these might help you, not sure, though."****
>
>  ****
>
> As far as your specifics - for the first query, if you're attempting to
> get a list of keys, I still recommend using ListAllKeysFromIndex(string
> Bucket). This will be pushed out as part of the IRiakClient interface in
> the next day or two, and I'm sure the gods of OOP won't kill you for using
> an actual RiakClient object instead of an IRiakClient interface between now
> and then. Sending those results back directly from riak_kv is going to be
> far faster than messing around with a JavaScript MapReduce job.****
>
>  ****
>
> Always keep in mind that MR jobs are not going to be the most efficient
> way to perform any kind of ad hoc querying - they're great for large scale
> data transformations but if you really want performance, you'll want to
> write Erlang MR jobs. ****
>
>  ****
>
> If you need to maintain counts per department, a better approach will be
> persisting counters and maintaining those counts via some kind of
> caching/pre-aggregation mechanism, most likely outside of Riak because of
> eventual consistency guarantees. Alex Siculars will eventually show up and
> start chanting "use redis"; you'll be resistant at first, but his arguments
> make a lot of sense. Riak does some things very well, maintaining
> consistent counters isn't one of them... yet.****
>
>
> ****
>
> ---****
>
> Jeremiah Peschka - Founder, Brent Ozar Unlimited****
>
> MCITP: SQL Server 2008, MVP****
>
> Cloudera Certified Developer for Apache Hadoop****
>
>  ****
>
> On Tue, Feb 26, 2013 at 10:52 AM, Kevin Burton <rkevinburton at charter.net>
> wrote:****
>
> I have a simple CorrugatedIron client that makes the following request:***
> *
>
>  ****
>
>                 IRiakClient riakClient = cluster.CreateClient();****
>
>                 RiakBinIndexRangeInput bucketKeyInput = new
> RiakBinIndexRangeInput(productBucketName, "$key", "00000000", "99999999");
> ****
>
>                 RiakMapReduceQuery query = new RiakMapReduceQuery()****
>
>                    .Inputs(bucketKeyInput)****
>
>                    .MapJs(m => m.Name("Riak.mapValuesJson").Keep(true));**
> **
>
>                 RiakResult<RiakMapReduceResult> result =
> riakClient.MapReduce(query);****
>
>  ****
>
> So as you can see this is a very basic range m/r query. But the result
> comes back as:****
>
>  ****
>
> Riak returned an error. Code '0'. Message: timeout****
>
> CommunicationError****
>
>  ****
>
> Another type of m/r query I have****
>
>  ****
>
>                 IRiakClient riakClient = cluster.CreateClient();****
>
>                 var query = new RiakMapReduceQuery()****
>
>                     .Inputs(productBucketName)****
>
>                     .MapJs(m => m.Source(@"function(v,d,a) {" +****
>
>                         "var p = JSON.parse(v.values[0].data);" +****
>
>                         "var r = [];" +****
>
>                         "d = escape(p.Department);" +****
>
>                         "if(d != '') {" +****
>
>                         "var o = {};" +****
>
>                         "o[d] = 1;" +****
>
>                         "r.push(o);" +****
>
>                         "}" +****
>
>                         "return r;" +****
>
>                         "}"))****
>
>                     .ReduceJs(m => m.Source(@"function(v,d,a) {" +****
>
>                         "var r = {};" +****
>
>                         "for(var i in v) {" +****
>
>                         "  for(var w in v[i]) {" +****
>
>                         "    if(w in r) r[w] += v[i][w];" +****
>
>                         "    else r[w] = v[i][w];" +****
>
>                         "  }" +****
>
>                         "}" +****
>
>                         "return [r];" +****
>
>                         "}")****
>
>                         .Keep(true));****
>
>  ****
>
> This returns but it takes far too long. I have about 60,000 items in my
> bucket and this takes about 50-60 seconds to execute. The results seem
> valid. For these types of m/r jobs what can I do on the server (or client)
> to helo diagnose the problem.  I have basic tools like iostat and top to
> give me data but some pointers on using the output of these tools might
> help.****
>
>
> _______________________________________________
> riak-users mailing list
> riak-users at lists.basho.com
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com****
>
>  ****
>
> ** **
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.basho.com/pipermail/riak-users_lists.basho.com/attachments/20130228/691f35fe/attachment.html>


More information about the riak-users mailing list