Using Riak to perform aggregate queries

Chris Corbyn chris at w3style.co.uk
Sun Apr 14 22:53:33 EDT 2013


So then Riak will be good to perform this sort of aggregate query over 500K records? Or am I trying to flex what is effectively a key-value store to do something it's not really good at doing? ;) It feels like I probably need a lot of hardware to get Riak to do this in the same amount of time MySQL does it.


Il giorno 15/apr/2013, alle ore 12:41, Evan Vigil-McClanahan <emcclanahan at basho.com> ha scritto:

> The most common issue with large MapReduce jobs is one of the nodes
> starting to swap (typically the node the request was made on).
> 
> Streaming results and pre-reduction[0] (where applicable) can often
> lower the memory overhead of running MapReduce jobs on large numbers
> of objects.
> 
> That would be the first thing I would check.  Also potentially
> applicable: http://docs.basho.com/riak/latest/cookbooks/Linux-Performance-Tuning/
> 
> 
> 0: http://docs.basho.com/riak/1.1.4/references/appendices/MapReduce-Implementation/,
> pre-reduce is down at the bottom.
> 
> 
> 
> On Sun, Apr 14, 2013 at 6:47 PM, Chris Corbyn <chris at w3style.co.uk> wrote:
>> All,
>> 
>> Just copying this from my stackoverflow post, as the riak tag doesn't get
>> much love over there :) It's fine to just outright say that Riak is never
>> going to work efficiently in this case because I'm inherently depending on
>> MapReduce.
>> 
>> Everywhere I read, people say you shouldn't use Riak's MapReduce over an
>> entire bucket and that there are other ways of achieving your goals. I'm not
>> sure how, though. I'm also not clear on why using an entire bucket is slow,
>> if you only have one bucket in the entire system, so either way, you need to
>> go over all the entries. Passing in a list of key-bucket pairs has the same
>> effect. Maybe the rule should be "don't use MapReduce with more than a
>> handful of keys". Which makes me wonder (apart from link traversal) what use
>> it really has in the real world.
>> 
>> I have a list of 500K+ documents that represent sales data. I need to view
>> this data in different ways: for example, how much revenue was made in each
>> month the business was operating? How much revenue did each product raise?
>> How many of each product were sold in a given month? I always thought
>> MapReduce was supposed to be good at solving these types of aggregate
>> problems. That sounds like a myth now though, unless we're just looking at
>> Hadoop.
>> 
>> My documents are all in a bucket named 'sales' and they are records with the
>> following fields (as native Erlang records, not JSON):
>> 
>>    {"id":1, "product_key": "cyber-pet-toy", "price": "10.00", "tax":
>> "1.00", "created_at": 1365931758}.
>> 
>> Let's take the example where I need to report the total revenue for each
>> product in each month over the past 4 years (that's basically the entire
>> bucket, but that's just the requirement), how does one use Riak's MapReduce
>> to do that efficiently? Even just trying to use an identity map operation on
>> the data I get a timeout after ~30 seconds, which MySQL handles in
>> milliseconds.
>> 
>> I'm doing this in Erlang (using the protocol buffers client), but any
>> language is fine for an explanation.
>> 
>> The equivalent SQL (MySQL) would be:
>> 
>>  SELECT SUM(price)                         AS revenue,
>>         FROM_UNIXTIME(created_at, '%Y-%m') AS month,
>>         product_key
>>    FROM sales
>> GROUP BY month, product_key
>> ORDER BY month ASC;
>> 
>> Even with secondary indexes, there's still a MapReduce involved in this
>> query, once you have the list of keys to process.
>> 
>> (Ordering not important right now).
>> 
>> Cheers,
>> 
>> Chris
>> 
>> _______________________________________________
>> riak-users mailing list
>> riak-users at lists.basho.com
>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>> 





More information about the riak-users mailing list