Using Riak to perform aggregate queries
chris at w3style.co.uk
Sun Apr 14 23:09:20 EDT 2013
> MapReduce works well when:
> a) you know the set of objects you're going to MapReduce over
> b) you need to return the entire object
> c) when you plan to manipulate a lot of data inside Riak - read all sales records and write them out as shipping invoices
Right, thanks. I do know the set of objects… that set is the full bucket, because our finance department want to be able to request report on the sales data over many years. I actually tried getting all the keys in a separate place, then requesting my data using a list of bucket-key pairs. This caused my app to run out of memory on the client side.
> Everywhere I read, people say you shouldn't use Riak's MapReduce over an entire bucket and that there are other ways of achieving your goals. I'm not sure how, though. I'm also not clear on why using an entire bucket is slow, if you only have one bucket in the entire system, so either way, you need to go over all the entries. Passing in a list of key-bucket pairs has the same effect. Maybe the rule should be "don't use MapReduce with more than a handful of keys". Which makes me wonder (apart from link traversal) what use it really has in the real world.
> From my limited understanding of things, MR across an entire bucket is slow because you need to scan the entire key space. Much like a SELECT with no WHERE clause, a MapReduce without a key list is going to sweep across the entire bucket pulling all data off of disk in the process. Someone better versed in Riak internals would have to chime in on why this is so slow.
> I do know, however, that having on master bucket o' things may not be the best idea. If you're storing sales data, a better approach may be to segment buckets out by primary querying criteria.
The thing is, the *only* querying criteria for this app is to aggregate the data for the purpose of reporting. Our finance guys want to be able to check a few boxes to specify how they want the data grouped, then generate the report. We actually do this data warehousing style in MySQL at the moment, where the data is renormalized about as far as we can, but still allowing for some run time use of GROUP BY and ordering.
I could use separate buckets for each product, perhaps, but we'd still be looking at a lot of data and I'd have to repeatedly query for each product, then merge all those aggregates together.
> I have a list of 500K+ documents that represent sales data. I need to view this data in different ways: for example, how much revenue was made in each month the business was operating? How much revenue did each product raise? How many of each product were sold in a given month? I always thought MapReduce was supposed to be good at solving these types of aggregate problems. That sounds like a myth now though, unless we're just looking at Hadoop.
> MapReduce can aggregate well. RDBMSes can aggregate well, too. But there are practical limits to each paradigm that you have to be aware of. From watching the list and working with people in the real world there are some paradigms that you want to use in Riak that don't immediately seem obvious coming from the RDBMS world.
> - Know your queries. Examine your application and make a list of common queries that you're going to be running. You'll want to be familiar with this list because you're going to try to write your data in a way that answers these questions.
As I say, the only query we'll be running is to aggregate the data for the purpose of reporting. I was exploring Riak because its MapReduce attracted me and reading various articles they show off FaceBook-like networks (which are naturally huge) and Twitter stream analysis etc. But for aggregating data from huge record sets, it just seems like it can't do it. People say to use secondary indexes, but that only helps to you to narrow down your recordset… if you *know* your recordset is going to contain hundreds of thousands of records, then it seems like Riak is going to give you hell. In other systems, the place where MapReduce shines is for big data processing; but in Riak's case its the inverse.
> - Know your query pattern. Understand how these queries are working with the data itself. If many queries retrieve data aggregated for a day, week, month, store, sales region, or other logical division then you'll want to make sure that you can answer them quickly. (See Choosing the Right Tool )
We're never requesting just a view on a single day, or month etc… on the UI we present a table with a row for each period that has been grouped by (e.g. a row for each month that the business has been operating). I'm actually ripping the data out of MySQL and into Riak for this, but keeping it in MySQL in the first place seems way more efficient. Especially if I have to employ the same data warehousing tricks to pre-aggregate the data as much as I can anyway. I could probably iterate through each month and perform repeated MapReduce queries across those months, but I suspect that would be extremely slow too, due to the repeat querying.
Is this just a current limitation in Riak, or a fundamental design conflict that means Riak can never be used to solve these kinds of problems?
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the riak-users