Schema Architecture, Map Reduce & Key Lists

Bryan Fink bryan at basho.com
Thu Feb 10 14:54:33 EST 2011


On Thu, Feb 10, 2011 at 12:35 PM, Mat Ellis <mat at tecnh.com> wrote:
> We are converting a mysql based schema to Riak using Ripple. We're tracking
> a lot of clicks, and each click belongs to a cascade of other objects:
> click -> placement -> campaign -> customer
> i.e. we do a lot of operations on these clicks grouped by placement or sets
> of placements.
… snip …
> On a related noob-note, what would be the best way of creating a set of the
> clicks for a given placement? Map Reduce or Riak Search or some other
> method?

Hi, Mat.  I have an alternative strategy I think you could try if
you're up for stepping outside of the Ripple interface.  Your incoming
clicks reminded me of other stream data I've processed before, so the
basic idea is to store clicks as a stream, and then process that
stream later.  The tools I'd use to do this are Luwak[1] and
luwak_mr[2].

First, store all clicks, as they arrive, in one Luwak file (or maybe
one Luwak file per host accepting clicks, depending on your service's
arrangement).  Luwak has a streaming interface that's available
natively in distributed Erlang, or over HTTP by exploiting the
"chunked" encoding type.  Roll over to a new file on whatever
convenient trigger you like (time period, timeout, manual
intervention, etc.).

Next, use map/reduce to process the stream.  The luwak_mr utility will
allow you to specify a Luwak file by name, and it will handle toss
each of the chunks of that file to various cluster nodes for
processing.  The first stage of your map/reduce query just needs to be
able to handle any single chunk of the file.

I've posted a few examples about how to use the luwak_mr
utility.[3][4][5]  They deal with analyzing events in baseball games
(another sort of stream of events).

Pros:
 - No need to list keys.
 - The time to process a day's data should be proportional to the
number of clicks on that day (i.e. proportional to the size of the
file).

Caveats:
 - Luwak works best with write-once data.  Modifying a block of a
Luwak file after it has been written causes the block to be copied,
and the old version of the block is not deleted.  (Even if some of
your data is modification-heavy, this might work for the non-modified
parts … like the key list for a day's clicks?)
 - I don't have good numbers for Luwak's speed/efficiency.
 - I've only recently started experimenting with Luwak in this
map/reducing manner, so I'm not sure if there are other pitfalls.

[1] http://wiki.basho.com/Luwak.html
[2] http://contrib.basho.com/luwak_mr.html
[3] http://blog.beerriot.com/2011/01/16/mapreducing-luwak/
[4] http://blog.basho.com/2011/01/20/baseball-batting-average%2c-using-riak-map/reduce/
[5] http://blog.basho.com/2011/01/26/fixing-the-count/



More information about the riak-users mailing list