MapReduce scalability

Christian Dahlqvist christian at basho.com
Thu Feb 28 04:32:20 EST 2013


Hi Boris,

Apart from not scaling quite as well as straight K/V access, emulating multiGET through MapReduce also has another significant drawback. MapReduce has no concept of quorum reads, and only work on a single copy of the data, which can be thought of basically as a read with R=1 that does not trigger read-repair. It is therefore possible that it can give inconsistent or incorrect results if all replicas do not have the same data. It is worth noting that MapReduce was designed as a way to efficiently spread compute work across the cluster, and re-appropriating it for use with data collection is not its designed purpose.

The recommended way to implement efficient multiget is to perform normal GET operations in parallel. If you are retrieving 20 objects, you don't necessarily need to do all 20 GETs in parallel, but could set it up to use perhaps 3 or 4 connections. If you then pair this with a connection pool that can grow and shrink in size (perhaps between a minimum and a maximum value) as load requires, you should be able to retrieve the objects in a reasonable time without overloading the cluster.

Best regards,

Christian


On 27 Feb 2013, at 02:18, Boris Okner <boris.okner at gmail.com> wrote:

> Thanks Christian,
> 
> The problem I'm trying to solve is to find the way to retrieve values for limited number of keys with the best possible latency (or maybe with decent latency which is balanced with decent throughput). Let's say we have keys stored in some cache 
> on top of Riak, and want to retrieve values, 20 at the time, to be able to implement pagination. Another alternative to mapreduce would to send multiple asynchronous gets, but then we'd have to worry about connection pool being exhausted if there's too many such "page" requests. So what would be the proper way to deal with the situation when we need to emulate multiple key retrieval?
> 
> On Tue, Feb 26, 2013 at 1:57 AM, Christian Dahlqvist <christian at basho.com> wrote:
> Hi Boris,
> 
> MapReduce is a very flexible and powerful way of querying Riak and allows processing to be performed locally where the data resides, which allows for efficient processing of larger data sets. A result of this is that every mapreduce job requires a covering set of vnodes (all vnodes that hold the data required for processing) to participate, meaning that it puts considerable more load on the system compared to straight K/V access and therefore does not scale quite as well. It is primarily designed for batch type processing over reasonably large amounts of data and scales well with increased data volumes as new nodes are added. We do however usually not recommended using it as an interface for realtime queries where low and predictable latencies are required and the concurrency level, and therefore load level on the cluster, can not be controlled.
> 
> I am not sure I understand what you mean by the performance degrading with the number of nodes, unless you are strictly measuring latency rather than throughput. As the number of nodes increase, it gets more and more likely that multiple physical nodes will be involved in the job, which will add to the amount of communication and coordination required between the nodes, thereby increasing latency. Could you please explain in more detail what you are trying to achieve?
> 
> Best regards,
> 
> Christian 
> 
> 
> On 25 Feb 2013, at 16:41, Boris Okner <boris.okner at gmail.com> wrote:
> 
>> Hello,
>> 
>> I'm experimenting with 2 Riak 1.3.0 nodes (both are "bare metal"), and it looks like mapreduce performs better when one of the nodes is down. The mapreduce requests are running on 20-key blocks. So am I doing something wrong, or is it an expected behaviour, i.e. mapreduce degrades with the the number of nodes increased? If the former, could 
>> you give me some pointers on how to set up it to get advantage of multiple nodes?
>> 
>> Thanks in advance for your help,
>> Boris
>> _______________________________________________
>> riak-users mailing list
>> riak-users at lists.basho.com
>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
> 
> 
> _______________________________________________
> riak-users mailing list
> riak-users at lists.basho.com
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.basho.com/pipermail/riak-users_lists.basho.com/attachments/20130228/dc90567a/attachment.html>


More information about the riak-users mailing list