MapReduce filtering question

Sean Cribbs sean at
Fri Nov 19 17:43:21 EST 2010

On Nov 19, 2010, at 2:15 PM, Parker Thompson wrote:

> I'm experimenting with Riak by trying to port a simple a/b testing framework that's currently SQL backed. Since I'm using Ripple/riak-client my code below are in Ruby/JS.
> The domain model is fairly simple. I have visitors, which get created for any user who hits the site, visitors see alternatives (currently these are ActiveRecord objects) and are tracked by creating experiences (the joining of a alternative ID and a visitor). Finally, as visitors do things we track events, which are distinguished from one another by their classes.

The first concept you'll have to give up with Riak is "join tables", since you can't have indexes on them in the same way as you can with a relational DB.  A more natural model would be to have a "double" of the ActiveRecord object, which has the same key/id, and then links to all visitors who viewed that alternative.  That is, you'd have another model (or maybe just an RObject, depending on how you want to deal with it), like so:

class Riak::Alternative
  include Ripple::Document
  many :visitors, :class_name => "Riak::Visitor"
  property :alternative_id, Integer, :presence => true
  key_on :alternative_id

Then some portions of your MapReduce query will become simpler, some more difficult.  I'm using a technique below I blogged about called "forwarding", which puts the data you want to return at the end of the query in the keyData for subsequent phases.  In a relational DB you'd probably use a nested SELECT or some crazy group by/having combination.  The Riak version feels more like a fanout (and double-back).


def visitors_who_shared
    add("riak_alternatives", ar_id.to_s).
    link(:bucket => 'riak_visitors').
    reduce(["riak_kv_mapreduce", "reduce_set_union"]).
    map(map_identity, :keep => true).

# Inspect the links, select the ones that point to events, put the visitor's key as the keyData
# You could also put the whole object in the keyData, but this saves bandwidth and computation.
def link_to_events_forward_visitor
function(object, keyData, arg){
    return object.values[0].metadata.Links.reduce(function(acc, link){
                                                      if(link[0] == "events")
                                                          acc.push([link[0], link[1], object.key]);
                                                      return acc;

# If the data is a ShareEvent, map to the visitor who created it
def map_share_event_to_visitor
function(v, keyData){
    var data = JSON.parse(v.values[0].data);
    if(data._type == "Riak::ShareEvent" ){
        return [["visitors", keyData]];
    } else {
        return [];

def map_identity
  function(v){ return [v]; }


The result of your visitors_who_shared method could then be used to vivify Visitor objects (it's straightforward, but I'm not putting the code here).

Long term, you'll want to be creating your own Javascript built-in functions instead of passing the source along with every query.    I've also only solved one issue with your schema above (denormalizing the "experiences" into "alternatives").  Please ask again if you have other questions/issues.

Sean Cribbs <sean at>
Developer Advocate
Basho Technologies, Inc.

More information about the riak-users mailing list