SV: Riak + Disco (MapReduce alternative)

Antonio Rohman Fernandez rohman at
Wed Apr 17 11:52:03 EDT 2013


I think it might not be much of a problem to not repeat Maps... From
this example on their site:

disco.core import Job, result_iterator 

def map(line, params):
word in line.split():
 yield word, 1 

def reduce(iter, params):
disco.util import kvgroup
 for word, counts in kvgroup(sorted(iter)):

yield word, sum(counts) 

if __name__ == '__main__':
 input =
 job =
Job().run(input=input, map=map, reduce=reduce)
 for word, count in
 print word,

I can imagine putting the URL to an index on the __main__ section to
get an array of keys... split by lines the array of keys in plain text
and send it to the maps functions... i deduce from the example that each
map represents a line in the input... in which i will do a Riak GET of
that key ( line ) and select the data i want to reduce from that object
to send it to the reduce function... 

RIAK INDEX: get all
sales of today
MAP: get a key, check if the customer is a woman between
18 and 25 years old, then return 1 ( any other return 0 )
all the 1s to give a total counter 

Riak MR can do it also, but for
what we saw in the list some time ago, is not wise to use MR for this
can of operation ( moreover on demand basis ) 

I still have to try it
and see... but it would be a nice way to do a distributed multiGET and
reduce the data to a result without hustling Riak... 


On 17.04.2013 16:19, Jens Rantil wrote: 

> Hi, 
> I've been
following the Disco Project for a couple of years. The tricky part with
using Disco with Riak would be to make sure each map phase is not
executed multiple times over the same data*. Also, since each map phase
would (preferably) run on the same host as its data (for data locality),
you would also have to make sure to only iterate over data that is
associated with the vnode for that physical host. 
> If you can
easily extract host-specific keys for a specific vnode, then this is
doable. However, either the Disco master or the Disco job submitter will
need to have all this data when a job is submitted. 
> Also, I'm not
sure that it will help very much that both are written in Erlang. 
Some ideas, 
> Jens 
> * Obviously, you could also chain your
mapreduce jobs in Disco to remove duplicate maps, but this introduces
> FRÅN: riak-users
[mailto:riak-users-bounces at] FÖR Antonio Rohman
> SKICKAT: den 17 april 2013 13:15
riak-users at
> ÄMNE: Riak + Disco (MapReduce alternative)

> Hello everybody, 
> Has anyone tried to use Riak with Disco? [ [1] ] I was looking for Hadoop alternatives ( as
the RIAK-HADOOP connector project seems not going anywhere ) and I think
Disco is quite interesting, moreover is written in Erlang same as Riak.
Looks like it would be a good match! 
> As seen in the mailing list,
seems that Riak's built-in MapReduce is not suitable for much of the
queries I would be interested on doing... My idea would be to leverage
the MapReduce work to a Hadoop ( or Disco, or another ) cluster that
will do the GETs on the Riak cluster through an Index ( as suggested on
this list... do multi-gets instead of MR ) and reduce the data
independently. Does anybody has suggestions about this? 
> Thanks,
> [2] 
> CEO, Founder & Lead
> rohman at 
> [3]
> [4]
> Wedding Album [5]



CEO, Founder & Lead Engineer
rohman at 		 

PROJECTS [3] [4]
Wedding Album [5] 



CEO, Founder & Lead
rohman at 		 
[3] [4]
Wedding Album [5] 


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: blocked.gif
Type: image/gif
Size: 118 bytes
Desc: not available
URL: <>

More information about the riak-users mailing list