MapReduce scalability

Elias Levy fearsome.lucidity at gmail.com
Thu Feb 28 12:49:13 EST 2013


On Thu, Feb 28, 2013 at 5:53 AM, <riak-users-request at lists.basho.com> wrote:

> The recommended way to implement efficient multiget is to perform normal
> GET operations in parallel. If you are retrieving 20 objects, you don't
> necessarily need to do all 20 GETs in parallel, but could set it up to use
> perhaps 3 or 4 connections. If you then pair this with a connection pool
> that can grow and shrink in size (perhaps between a minimum and a maximum
> value) as load requires, you should be able to retrieve the objects in a
> reasonable time without overloading the cluster.
>

That's easier said than done with many of the existing Riak clients.  E.g.
the Ruby client is a blocking client and as far as I know does not support
pipelining.  You can spin up threads to do the work, but if you got to
fetch a few hundred items, spinning up that many threads may not
be feasible or performant, and the lack of pipelining means you incur
increased latency to service such large requests.

In general we've found that while MR is slower for bulk fetching of a few
records, it is faster when fetching large number of them.  I am willing to
take the R=1 hit on some types of data access to cut down the overall fetch
time.

In addition, MR has the benefit that you can implement some map or reduce
phases to limit what data is transfered to the client.  We keep track of
some stats for some object types in a stats object, but we usually want to
fetch only a subset of the whole object.  We implement some Erlang map
functions that give us just what we want when using MR to fetch in bulk.

Now, if Basho implemented a proper multi-get call, I think many customers
would be rather happy.

Elias Levy
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.basho.com/pipermail/riak-users_lists.basho.com/attachments/20130228/0fb4e1e0/attachment.html>


More information about the riak-users mailing list