Using Riak to perform aggregate queries

Evan Vigil-McClanahan emcclanahan at basho.com
Sun Apr 14 22:41:31 EDT 2013


The most common issue with large MapReduce jobs is one of the nodes
starting to swap (typically the node the request was made on).

Streaming results and pre-reduction[0] (where applicable) can often
lower the memory overhead of running MapReduce jobs on large numbers
of objects.

That would be the first thing I would check.  Also potentially
applicable: http://docs.basho.com/riak/latest/cookbooks/Linux-Performance-Tuning/


0: http://docs.basho.com/riak/1.1.4/references/appendices/MapReduce-Implementation/,
pre-reduce is down at the bottom.



On Sun, Apr 14, 2013 at 6:47 PM, Chris Corbyn <chris at w3style.co.uk> wrote:
> All,
>
> Just copying this from my stackoverflow post, as the riak tag doesn't get
> much love over there :) It's fine to just outright say that Riak is never
> going to work efficiently in this case because I'm inherently depending on
> MapReduce.
>
> Everywhere I read, people say you shouldn't use Riak's MapReduce over an
> entire bucket and that there are other ways of achieving your goals. I'm not
> sure how, though. I'm also not clear on why using an entire bucket is slow,
> if you only have one bucket in the entire system, so either way, you need to
> go over all the entries. Passing in a list of key-bucket pairs has the same
> effect. Maybe the rule should be "don't use MapReduce with more than a
> handful of keys". Which makes me wonder (apart from link traversal) what use
> it really has in the real world.
>
> I have a list of 500K+ documents that represent sales data. I need to view
> this data in different ways: for example, how much revenue was made in each
> month the business was operating? How much revenue did each product raise?
> How many of each product were sold in a given month? I always thought
> MapReduce was supposed to be good at solving these types of aggregate
> problems. That sounds like a myth now though, unless we're just looking at
> Hadoop.
>
> My documents are all in a bucket named 'sales' and they are records with the
> following fields (as native Erlang records, not JSON):
>
>     {"id":1, "product_key": "cyber-pet-toy", "price": "10.00", "tax":
> "1.00", "created_at": 1365931758}.
>
> Let's take the example where I need to report the total revenue for each
> product in each month over the past 4 years (that's basically the entire
> bucket, but that's just the requirement), how does one use Riak's MapReduce
> to do that efficiently? Even just trying to use an identity map operation on
> the data I get a timeout after ~30 seconds, which MySQL handles in
> milliseconds.
>
> I'm doing this in Erlang (using the protocol buffers client), but any
> language is fine for an explanation.
>
> The equivalent SQL (MySQL) would be:
>
>   SELECT SUM(price)                         AS revenue,
>          FROM_UNIXTIME(created_at, '%Y-%m') AS month,
>          product_key
>     FROM sales
> GROUP BY month, product_key
> ORDER BY month ASC;
>
> Even with secondary indexes, there's still a MapReduce involved in this
> query, once you have the list of keys to process.
>
> (Ordering not important right now).
>
> Cheers,
>
> Chris
>
> _______________________________________________
> riak-users mailing list
> riak-users at lists.basho.com
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>




More information about the riak-users mailing list