Using Riak to perform aggregate queries

Jeremiah Peschka jeremiah.peschka at gmail.com
Sun Apr 14 22:52:43 EDT 2013


Hi Chris,

Responses are inline

---
Jeremiah Peschka - Founder, Brent Ozar Unlimited
MCITP: SQL Server 2008, MVP
Cloudera Certified Developer for Apache Hadoop


On Sun, Apr 14, 2013 at 4:47 PM, Chris Corbyn <chris at w3style.co.uk> wrote:

> All,
>
> Just copying this from my stackoverflow post, as the riak tag doesn't get
> much love over there :) It's fine to just outright say that Riak is never
> going to work efficiently in this case because I'm inherently depending on
> MapReduce.
>

MapReduce works well when:

a) you know the set of objects you're going to MapReduce over
b) you need to return the entire object
c) when you plan to manipulate a lot of data inside Riak - read all sales
records and write them out as shipping invoices


>
> Everywhere I read, people say you shouldn't use Riak's MapReduce over an
> entire bucket and that there are other ways of achieving your goals. I'm
> not sure how, though. I'm also not clear on why using an entire bucket is
> slow, if you only have one bucket in the entire system, so either way, you
> need to go over all the entries. Passing in a list of key-bucket pairs has
> the same effect. Maybe the rule should be "don't use MapReduce with more
> than a handful of keys". Which makes me wonder (apart from link traversal)
> what use it really has in the real world.
>

>From my limited understanding of things, MR across an entire bucket is slow
because you need to scan the entire key space. Much like a SELECT with no
WHERE clause, a MapReduce without a key list is going to sweep across the
entire bucket pulling all data off of disk in the process. Someone better
versed in Riak internals would have to chime in on why this is so slow.

I do know, however, that having on master bucket o' things may not be the
best idea. If you're storing sales data, a better approach may be to
segment buckets out by primary querying criteria.


>
> I have a list of 500K+ documents that represent sales data. I need to view
> this data in different ways: for example, how much revenue was made in each
> month the business was operating? How much revenue did each product raise?
> How many of each product were sold in a given month? I always thought
> MapReduce was supposed to be good at solving these types of aggregate
> problems. That sounds like a myth now though, unless we're just looking at
> Hadoop.
>

MapReduce can aggregate well. RDBMSes can aggregate well, too. But there
are practical limits to each paradigm that you have to be aware of. From
watching the list and working with people in the real world there are some
paradigms that you want to use in Riak that don't immediately seem obvious
coming from the RDBMS world.

- Know your queries. Examine your application and make a list of common
queries that you're going to be running. You'll want to be familiar with
this list because you're going to try to write your data in a way that
answers these questions.
- Know your query pattern. Understand how these queries are working with
the data itself. If many queries retrieve data aggregated for a day, week,
month, store, sales region, or other logical division then you'll want to
make sure that you can answer them quickly. (See Choosing the Right Tool
[3])
- Pre-aggregate your data. If you know you're going to have to perform
averages, you may want to look interesting ways to write your data (see
Rolling with Eventual Consistency [1] and Little-Known Awesome Algorithms:
Fenwick Trees – Rapidly Find Cumulative Frequency Sums [2]).

MR can be slow since all records need to be streamed through the
originating node. Using pre-reduce[4] can ease the load


[1]: http://basho.com/rolling-with-eventual-consistency-2/
[2]:
http://www.swageroo.com/wordpress/little-known-awesome-algorithms-fenwick-range-trees-rapidly-find-cumulative-frequency-sums/
[3]: http://basho.com/choosing-the-right-tool/
[4]:
http://docs.basho.com/riak/1.1.4/references/appendices/MapReduce-Implementation/


>
> My documents are all in a bucket named 'sales' and they are records with
> the following fields (as native Erlang records, not JSON):
>
>     {"id":1, "product_key": "cyber-pet-toy", "price": "10.00", "tax":
> "1.00", "created_at": 1365931758}.
>
> Let's take the example where I need to report the total revenue for each
> product in each month over the past 4 years (that's basically the entire
> bucket, but that's just the requirement), how does one use Riak's MapReduce
> to do that efficiently? Even just trying to use an identity map operation
> on the data I get a timeout after ~30 seconds, which MySQL handles in
> milliseconds.
>
> I'm doing this in Erlang (using the protocol buffers client), but any
> language is fine for an explanation.
>
> The equivalent SQL (MySQL) would be:
>
>   SELECT SUM(price)                         AS revenue,
>          FROM_UNIXTIME(created_at, '%Y-%m') AS month,
>          product_key
>     FROM sales
> GROUP BY month, product_key
> ORDER BY month ASC;
>
> Even with secondary indexes, there's still a MapReduce involved in this
> query, once you have the list of keys to process.
>
> (Ordering not important right now).
>
> Cheers,
>
> Chris
>
> _______________________________________________
> riak-users mailing list
> riak-users at lists.basho.com
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.basho.com/pipermail/riak-users_lists.basho.com/attachments/20130414/fded481d/attachment.html>


More information about the riak-users mailing list