Using Riak to perform aggregate queries

Chris Corbyn chris at w3style.co.uk
Sun Apr 14 19:47:52 EDT 2013


All,

Just copying this from my stackoverflow post, as the riak tag doesn't get much love over there :) It's fine to just outright say that Riak is never going to work efficiently in this case because I'm inherently depending on MapReduce.

Everywhere I read, people say you shouldn't use Riak's MapReduce over an entire bucket and that there are other ways of achieving your goals. I'm not sure how, though. I'm also not clear on why using an entire bucket is slow, if you only have one bucket in the entire system, so either way, you need to go over all the entries. Passing in a list of key-bucket pairs has the same effect. Maybe the rule should be "don't use MapReduce with more than a handful of keys". Which makes me wonder (apart from link traversal) what use it really has in the real world.

I have a list of 500K+ documents that represent sales data. I need to view this data in different ways: for example, how much revenue was made in each month the business was operating? How much revenue did each product raise? How many of each product were sold in a given month? I always thought MapReduce was supposed to be good at solving these types of aggregate problems. That sounds like a myth now though, unless we're just looking at Hadoop.

My documents are all in a bucket named 'sales' and they are records with the following fields (as native Erlang records, not JSON):

    {"id":1, "product_key": "cyber-pet-toy", "price": "10.00", "tax": "1.00", "created_at": 1365931758}.

Let's take the example where I need to report the total revenue for each product in each month over the past 4 years (that's basically the entire bucket, but that's just the requirement), how does one use Riak's MapReduce to do that efficiently? Even just trying to use an identity map operation on the data I get a timeout after ~30 seconds, which MySQL handles in milliseconds.

I'm doing this in Erlang (using the protocol buffers client), but any language is fine for an explanation.

The equivalent SQL (MySQL) would be:

  SELECT SUM(price)                         AS revenue,
         FROM_UNIXTIME(created_at, '%Y-%m') AS month,
         product_key
    FROM sales
GROUP BY month, product_key
ORDER BY month ASC;

Even with secondary indexes, there's still a MapReduce involved in this query, once you have the list of keys to process.

(Ordering not important right now).

Cheers,

Chris
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.basho.com/pipermail/riak-users_lists.basho.com/attachments/20130415/53d03053/attachment.html>


More information about the riak-users mailing list