multi-get (yet again)

Kresten Krab Thorup krab at trifork.com
Thu Aug 9 05:11:06 EDT 2012


The only issue with this approach is AFAIK that M/R effectively runs with R=1, i.e. it doesn't ensure that a value is consistent across replicas.  

IMHO riak_kv_mapreduce should have a map_get_object_value, which does a proper RiakClient:get, i.e. something like this: [will be slower, but will honour the bucket's default R value].

map_get_object_value({error, notfound}=NF, KD, Action) ->                                     
    notfound_map_action(NF, KD, Action);                                                      
map_get_object_value(RO, KD, Action) ->                                                       
    {ok, RiakClient} = riak:local_client(),                                                   
    case RiakClient:get(riak_object:bucket(RO),riak_object:bucket(RO)) of                     
        {error, notfound}=NF ->                                                               
            notfound_map_action(NF, KD, Action);                                              
        {ok, RiakObject} ->                                                                         
            [riak_object:get_value(RiakObject)]                                               
    end.                                                                                      
                                                                                              
                                                                                              


Kresten


On Aug 9, 2012, at 10:46 AM, Parnell Springmeyer <ixmatus at gmail.com> wrote:

> Jeremy,
> 
> I was looking for something similar and first built an extra handler onto an internal erlang cowboy API server that used maelstrom (my own worker pool OTP application).
> 
> It was used to make a simple POST with a string of the {bucket, key} pairs and the server would concurrently GET and combine the results and send it back. This was very fast (thousands of keys GET in ms).
> 
> Since that seemed gross, I then decided (based on some input from someone else on the list) to try using a simple Map/Reduce phase that did not use javascript but the erlang functions (since those are going to be really fast and take advantage Erlang's concurrency better than the javascript VM's).
> 
> In python, you can do this to run that type of M/R phase without knowing any Erlang code:
> 
> client = riak.RiakClient()
> 
> # Add your KNOWN bucket and key pairs (you can do this in a loop)
> query = client.add(bucket, key)
> query.add(bucket, key)
> query.add(bucket, key)
> etc… (as many as you like)
> 
> # Now tell the map and reduce phases to use Erlang module "riak_kv_mapreduce" and its given function 
> # "map_object_value" and "reduce_set_union".
> results = client.map(["riak_kv_mapreduce", "map_object_value"]) \
>                 .reduce(["riak_kv_mapreduce", "reduce_set_union"]) \
>                 .run()
> 
> The above returns results faster for me, than the brokered multi-get approach I used (I guarantee my brokered multi-get is faster than anything you can do with python + gevent, if that's the case, the M/R phase is definitely the route you want to go).
> 
> So IMHO, it is very fast as long as you know the buckets and keys you want to get.
> 
> On Aug 9, 2012, at 12:11 AM, Jeremy Dunck wrote:
> 
>> I'm new to riak and need multi-get (that is, getting the value and/or
>> existence of keys in a single network-trip latency).
>> 
>> I was wondering what the latency of the map-reduce approach is?
>> http://lists.basho.com/pipermail/riak-users_lists.basho.com/2011-February/003229.html
>> 
>> Alternatively, has anyone tried scaling concurrent gets (perhaps with
>> evented io) to do many concurrent requests and combining results on
>> the client?
>> 
>> I am toying with a python+gevent multiget function.  If the stance is
>> still that a multiget operation doesn't belong in core, I'm a little
>> surprised that there doesn't seem to at least be a nice client-lib API
>> func to do it.  It sure seems useful...
>> 
>> In my use-case, the immediate need is to know whether a db insert
>> needs to be done.  We're handling too many keys to want to store in
>> memory (so no redis, etc), and we don't want to go to the db more than
>> we need to, so it seems riak would be good here.  But we're getting
>> 1000s of potential insert keys and want to whittle down all those to a
>> relative few db inserts.
>> 
>> So I was thinking riak key-per-id, and insert to the db iff the riak
>> key doesn't exist, then add the riak key.  We'll get some race
>> conditions on the insert, but that's OK in our case.
>> 
>> We do need low latency on the riak check, though, hence either
>> multiplexing w/ eventing or map-reduce (if that latency is actually
>> good).
>> 
>> Am I doing it wrong?
>> 
>> _______________________________________________
>> riak-users mailing list
>> riak-users at lists.basho.com
>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
> 
> 
> _______________________________________________
> riak-users mailing list
> riak-users at lists.basho.com
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com



Mobile: + 45 2343 4626 | Skype: krestenkrabthorup | Twitter: @drkrab
Trifork A/S  |  Margrethepladsen 4  | DK- 8000 Aarhus C |  Phone : +45 8732 8787  |  www.trifork.com
 







More information about the riak-users mailing list