Schema Architecture, Map Reduce & Key Lists

Mat Ellis mat at
Thu Feb 10 19:13:15 EST 2011

Good idea, thanks.


On Feb 10, 2011, at 4:10 PM, Alexander Sicular wrote:

> i would change the model and have another stream for "converted" clicks.
> -Alexander Sicular
> @siculars
> On Feb 10, 2011, at 5:58 PM, Mat Ellis wrote:
>> Thanks Bryan, that certainly looks interesting. The clicks are amended but just once and only a tiny percentage (when they convert). We're basically doing what you describe: taking a click stream and processing it once into a set of summary tables for reporting & decision making. We'll take a look at it as soon as we've finished getting our head around the Ripple goodness.
>> Cheers
>> M.
>> On Feb 10, 2011, at 11:54 AM, Bryan Fink wrote:
>>> On Thu, Feb 10, 2011 at 12:35 PM, Mat Ellis <mat at> wrote:
>>>> We are converting a mysql based schema to Riak using Ripple. We're tracking
>>>> a lot of clicks, and each click belongs to a cascade of other objects:
>>>> click -> placement -> campaign -> customer
>>>> i.e. we do a lot of operations on these clicks grouped by placement or sets
>>>> of placements.
>>> … snip …
>>>> On a related noob-note, what would be the best way of creating a set of the
>>>> clicks for a given placement? Map Reduce or Riak Search or some other
>>>> method?
>>> Hi, Mat.  I have an alternative strategy I think you could try if
>>> you're up for stepping outside of the Ripple interface.  Your incoming
>>> clicks reminded me of other stream data I've processed before, so the
>>> basic idea is to store clicks as a stream, and then process that
>>> stream later.  The tools I'd use to do this are Luwak[1] and
>>> luwak_mr[2].
>>> First, store all clicks, as they arrive, in one Luwak file (or maybe
>>> one Luwak file per host accepting clicks, depending on your service's
>>> arrangement).  Luwak has a streaming interface that's available
>>> natively in distributed Erlang, or over HTTP by exploiting the
>>> "chunked" encoding type.  Roll over to a new file on whatever
>>> convenient trigger you like (time period, timeout, manual
>>> intervention, etc.).
>>> Next, use map/reduce to process the stream.  The luwak_mr utility will
>>> allow you to specify a Luwak file by name, and it will handle toss
>>> each of the chunks of that file to various cluster nodes for
>>> processing.  The first stage of your map/reduce query just needs to be
>>> able to handle any single chunk of the file.
>>> I've posted a few examples about how to use the luwak_mr
>>> utility.[3][4][5]  They deal with analyzing events in baseball games
>>> (another sort of stream of events).
>>> Pros:
>>> - No need to list keys.
>>> - The time to process a day's data should be proportional to the
>>> number of clicks on that day (i.e. proportional to the size of the
>>> file).
>>> Caveats:
>>> - Luwak works best with write-once data.  Modifying a block of a
>>> Luwak file after it has been written causes the block to be copied,
>>> and the old version of the block is not deleted.  (Even if some of
>>> your data is modification-heavy, this might work for the non-modified
>>> parts … like the key list for a day's clicks?)
>>> - I don't have good numbers for Luwak's speed/efficiency.
>>> - I've only recently started experimenting with Luwak in this
>>> map/reducing manner, so I'm not sure if there are other pitfalls.
>>> [1]
>>> [2]
>>> [3]
>>> [4]
>>> [5]
>> _______________________________________________
>> riak-users mailing list
>> riak-users at

More information about the riak-users mailing list