MapReduce scalability

Boris Okner boris.okner at gmail.com
Tue Feb 26 21:18:16 EST 2013


Thanks Christian,

The problem I'm trying to solve is to find the way to retrieve values for
limited number of keys with the best possible latency (or maybe with decent
latency which is balanced with decent throughput). Let's say we have keys
stored in some cache
on top of Riak, and want to retrieve values, 20 at the time, to be able to
implement pagination. Another alternative to mapreduce would to send
multiple asynchronous gets, but then we'd have to worry about connection
pool being exhausted if there's too many such "page" requests. So what
would be the proper way to deal with the situation when we need to emulate
multiple key retrieval?

On Tue, Feb 26, 2013 at 1:57 AM, Christian Dahlqvist <christian at basho.com>wrote:

> Hi Boris,
>
> MapReduce is a very flexible and powerful way of querying Riak and allows
> processing to be performed locally where the data resides, which allows for
> efficient processing of larger data sets. A result of this is that every
> mapreduce job requires a covering set of vnodes (all vnodes that hold the
> data required for processing) to participate, meaning that it puts
> considerable more load on the system compared to straight K/V access and
> therefore does not scale quite as well. It is primarily designed for batch
> type processing over reasonably large amounts of data and scales well with
> increased data volumes as new nodes are added. We do however usually not
> recommended using it as an interface for realtime queries where low and
> predictable latencies are required and the concurrency level, and therefore
> load level on the cluster, can not be controlled.
>
> I am not sure I understand what you mean by the performance degrading with
> the number of nodes, unless you are strictly measuring latency rather than
> throughput. As the number of nodes increase, it gets more and more likely
> that multiple physical nodes will be involved in the job, which will add to
> the amount of communication and coordination required between the nodes,
> thereby increasing latency. Could you please explain in more detail what
> you are trying to achieve?
>
> Best regards,
>
> Christian
>
>
> On 25 Feb 2013, at 16:41, Boris Okner <boris.okner at gmail.com> wrote:
>
> Hello,
>
> I'm experimenting with 2 Riak 1.3.0 nodes (both are "bare metal"), and it
> looks like mapreduce performs better when one of the nodes is down. The
> mapreduce requests are running on 20-key blocks. So am I doing something
> wrong, or is it an expected behaviour, i.e. mapreduce degrades with the the
> number of nodes increased? If the former, could
> you give me some pointers on how to set up it to get advantage of multiple
> nodes?
>
> Thanks in advance for your help,
> Boris
> _______________________________________________
> riak-users mailing list
> riak-users at lists.basho.com
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.basho.com/pipermail/riak-users_lists.basho.com/attachments/20130226/1e8c9e99/attachment.html>


More information about the riak-users mailing list