SV: Riak + Disco (MapReduce alternative)

Antonio Rohman Fernandez rohman at mahalostudio.com
Wed Apr 17 11:52:03 EDT 2013


 

I think it might not be much of a problem to not repeat Maps... From
this example on their site:


---------------------------------------------------------------
from
disco.core import Job, result_iterator 

def map(line, params):
 for
word in line.split():
 yield word, 1 

def reduce(iter, params):
 from
disco.util import kvgroup
 for word, counts in kvgroup(sorted(iter)):

yield word, sum(counts) 

if __name__ == '__main__':
 input =
["http://discoproject.org/media/text/chekhov.txt"]
 job =
Job().run(input=input, map=map, reduce=reduce)
 for word, count in
result_iterator(job.wait()):
 print word,
count
---------------------------------------------------------------


I can imagine putting the URL to an index on the __main__ section to
get an array of keys... split by lines the array of keys in plain text
and send it to the maps functions... i deduce from the example that each
map represents a line in the input... in which i will do a Riak GET of
that key ( line ) and select the data i want to reduce from that object
to send it to the reduce function... 

Example:
RIAK INDEX: get all
sales of today
MAP: get a key, check if the customer is a woman between
18 and 25 years old, then return 1 ( any other return 0 )
REDUCE: sum
all the 1s to give a total counter 

Riak MR can do it also, but for
what we saw in the list some time ago, is not wise to use MR for this
can of operation ( moreover on demand basis ) 

I still have to try it
and see... but it would be a nice way to do a distributed multiGET and
reduce the data to a result without hustling Riak... 

Thanks,
Rohman


On 17.04.2013 16:19, Jens Rantil wrote: 

> Hi, 
> 
> I've been
following the Disco Project for a couple of years. The tricky part with
using Disco with Riak would be to make sure each map phase is not
executed multiple times over the same data*. Also, since each map phase
would (preferably) run on the same host as its data (for data locality),
you would also have to make sure to only iterate over data that is
associated with the vnode for that physical host. 
> 
> If you can
easily extract host-specific keys for a specific vnode, then this is
doable. However, either the Disco master or the Disco job submitter will
need to have all this data when a job is submitted. 
> 
> Also, I'm not
sure that it will help very much that both are written in Erlang. 
> 
>
Some ideas, 
> 
> Jens 
> 
> * Obviously, you could also chain your
mapreduce jobs in Disco to remove duplicate maps, but this introduces
overhead. 
> 
> FRÅN: riak-users
[mailto:riak-users-bounces at lists.basho.com] FÖR Antonio Rohman
Fernandez
> SKICKAT: den 17 april 2013 13:15
> TILL:
riak-users at lists.basho.com
> ÄMNE: Riak + Disco (MapReduce alternative)

> 
> Hello everybody, 
> 
> Has anyone tried to use Riak with Disco? [
http://discoproject.org [1] ] I was looking for Hadoop alternatives ( as
the RIAK-HADOOP connector project seems not going anywhere ) and I think
Disco is quite interesting, moreover is written in Erlang same as Riak.
Looks like it would be a good match! 
> 
> As seen in the mailing list,
seems that Riak's built-in MapReduce is not suitable for much of the
queries I would be interested on doing... My idea would be to leverage
the MapReduce work to a Hadoop ( or Disco, or another ) cluster that
will do the GETs on the Riak cluster through an Index ( as suggested on
this list... do multi-gets instead of MR ) and reduce the data
independently. Does anybody has suggestions about this? 
> 
> Thanks,
>
Rohman 
> 
> [2] 
> 
> ANTONIO ROHMAN FERNANDEZ
> CEO, Founder & Lead
Engineer
> rohman at mahalostudio.com 
> 
> PROJECTS
> MaruBatsu.es [3]
>
PupCloud.com [4]
> Wedding Album [5]

-- 

 		 [2]

 ANTONIO ROHMAN
FERNANDEZ
CEO, Founder & Lead Engineer
rohman at mahalostudio.com 		 

PROJECTS
MaruBatsu.es [3]
PupCloud.com [4]
Wedding Album [5] 

-- 

 		
[2]

 ANTONIO ROHMAN FERNANDEZ
CEO, Founder & Lead
Engineer
rohman at mahalostudio.com 		 
 PROJECTS
MaruBatsu.es
[3]
PupCloud.com [4]
Wedding Album [5] 

 

Links:
------
[1]
http://discoproject.org
[2] http://mahalostudio.com
[3]
http://marubatsu.es
[4] http://pupcloud.com
[5]
http://wedding.mahalostudio.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.basho.com/pipermail/riak-users_lists.basho.com/attachments/20130417/73cb983c/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: blocked.gif
Type: image/gif
Size: 118 bytes
Desc: not available
URL: <http://lists.basho.com/pipermail/riak-users_lists.basho.com/attachments/20130417/73cb983c/attachment.gif>


More information about the riak-users mailing list