Using Riak to perform aggregate queries

Chris Corbyn chris at
Sun Apr 14 22:53:33 EDT 2013

So then Riak will be good to perform this sort of aggregate query over 500K records? Or am I trying to flex what is effectively a key-value store to do something it's not really good at doing? ;) It feels like I probably need a lot of hardware to get Riak to do this in the same amount of time MySQL does it.

Il giorno 15/apr/2013, alle ore 12:41, Evan Vigil-McClanahan <emcclanahan at> ha scritto:

> The most common issue with large MapReduce jobs is one of the nodes
> starting to swap (typically the node the request was made on).
> Streaming results and pre-reduction[0] (where applicable) can often
> lower the memory overhead of running MapReduce jobs on large numbers
> of objects.
> That would be the first thing I would check.  Also potentially
> applicable:
> 0:,
> pre-reduce is down at the bottom.
> On Sun, Apr 14, 2013 at 6:47 PM, Chris Corbyn <chris at> wrote:
>> All,
>> Just copying this from my stackoverflow post, as the riak tag doesn't get
>> much love over there :) It's fine to just outright say that Riak is never
>> going to work efficiently in this case because I'm inherently depending on
>> MapReduce.
>> Everywhere I read, people say you shouldn't use Riak's MapReduce over an
>> entire bucket and that there are other ways of achieving your goals. I'm not
>> sure how, though. I'm also not clear on why using an entire bucket is slow,
>> if you only have one bucket in the entire system, so either way, you need to
>> go over all the entries. Passing in a list of key-bucket pairs has the same
>> effect. Maybe the rule should be "don't use MapReduce with more than a
>> handful of keys". Which makes me wonder (apart from link traversal) what use
>> it really has in the real world.
>> I have a list of 500K+ documents that represent sales data. I need to view
>> this data in different ways: for example, how much revenue was made in each
>> month the business was operating? How much revenue did each product raise?
>> How many of each product were sold in a given month? I always thought
>> MapReduce was supposed to be good at solving these types of aggregate
>> problems. That sounds like a myth now though, unless we're just looking at
>> Hadoop.
>> My documents are all in a bucket named 'sales' and they are records with the
>> following fields (as native Erlang records, not JSON):
>>    {"id":1, "product_key": "cyber-pet-toy", "price": "10.00", "tax":
>> "1.00", "created_at": 1365931758}.
>> Let's take the example where I need to report the total revenue for each
>> product in each month over the past 4 years (that's basically the entire
>> bucket, but that's just the requirement), how does one use Riak's MapReduce
>> to do that efficiently? Even just trying to use an identity map operation on
>> the data I get a timeout after ~30 seconds, which MySQL handles in
>> milliseconds.
>> I'm doing this in Erlang (using the protocol buffers client), but any
>> language is fine for an explanation.
>> The equivalent SQL (MySQL) would be:
>>  SELECT SUM(price)                         AS revenue,
>>         FROM_UNIXTIME(created_at, '%Y-%m') AS month,
>>         product_key
>>    FROM sales
>> GROUP BY month, product_key
>> ORDER BY month ASC;
>> Even with secondary indexes, there's still a MapReduce involved in this
>> query, once you have the list of keys to process.
>> (Ordering not important right now).
>> Cheers,
>> Chris
>> _______________________________________________
>> riak-users mailing list
>> riak-users at

More information about the riak-users mailing list