Riak doesn't use consistent hashing

Ben Tilly btilly at gmail.com
Thu May 26 10:21:36 EDT 2011


Performance is fine.  However requests get a "not found" response for an
extended period of time.  See
http://lists.basho.com/pipermail/riak-users_lists.basho.com/2011-May/thread.html#4078for
previous discussion of what sounds like the same issue.

On Thu, May 26, 2011 at 6:57 AM, Jonathan Langevin <
jlangevin at loomlearning.com> wrote:

> That sounds quite disconcerting. What happens to the performance of the
> cluster when this occurs?*
>
>  <http://www.loomlearning.com/>
>  Jonathan Langevin
> Systems Administrator
> Loom Inc.
> Wilmington, NC: (910) 241-0433 - jlangevin at loomlearning.com -
> www.loomlearning.com - Skype: intel352
> *
>
>
> On Thu, May 26, 2011 at 1:54 AM, Greg Nelson <grourk at dropcam.com> wrote:
>
>>  I've been doing some digging through the details of how a node joins a
>> cluster.  When you hear that Riak uses consistent hashing, you'd expect it
>> to distribute keys to nodes by hashing keys onto the ring AND hashing nodes
>> onto the ring.  Keys belong to the closest node on the ring, in the
>> clockwise direction.  Add a node, it hashes onto the ring and takes over
>> some keys.  Ordinarily the node would hash onto the ring in several places,
>> to achieve better spread.  Some data (roughly 1 / #nodes) moves to the new
>> node from each of the other nodes, and everything else stays the same.
>>
>> In what Amazon describes as operationally simpler (strategy 3 in the
>> Dynamo paper), the ring is instead divided into equally-sized partitions.
>>  Nodes are hashed onto the ring, and preflists are calculated by walking
>> clockwise from a partition, skipping partitions on already visited nodes.
>>  Riak does something similar: it divides the ring into equally-sized
>> partitions, then nodes "randomly" claim partitions.  However, the skipping
>> bit isn't part of Riak's preflist calculation.  Instead, nodes claim
>> partitions in such a way as to be spaced out by target_n_val, to obviate the
>> need for skipping.
>>
>> Now, getting back to what happens when a node joins.  The new node
>> calculates a new ring state that maintains the target_n_val invariant, as
>> well as trying to keep even spread of partitions per node.  The algorithm
>> (default_choose_claim) is heuristic and greedy in nature, and recursively
>> transfers partitions to the new node until optimal spread is achieved,
>> maintaining target_n_val along the way.  But if -- during one of those
>> recursive calls -- it can't meet the target_n_val, it will throw up its
>> hands and completely re-do the whole ring (by calling claim_rebalance_n).
>>  Striping the partitions across nodes, in a round-robin fashion.  When that
>> happens, most of the data needs to be handed off between nodes.
>>
>> This happens a lot, with many ring sizes.  With ring_creation_size=128
>> (i.e., 128 partitions), it will happen when adding node 9 (87.5% of data
>> moves), adding node 12 (82%), adding node 15 (80%), adding node 19 (94%).
>>  It happens with all ring sizes >= 128 (256, 512, 1024, ...).  It appears
>> that any ring_creation_size (64 by default) is safe for growing to 8 nodes
>> or so.  But if you want to go beyond that...  A ring size of >= 128 with
>> more than 8 nodes doesn't seem all that unusual, surely someone has hit this
>> before?  I've filed a bug report here:
>> https://issues.basho.com/show_bug.cgi?id=1111
>>
>> Anyway, this feels like a bit of a departure from consistent hashing.  In
>> fact, could this not be replaced by normal hashing + a lookup table mapping
>> intervals of the hash space to nodes?  And isn't that simply sharding?
>>
>> At any rate, I believe the claim algorithm can be improved to avoid those
>> "throw up hands and stripe everything" scenarios.  In fact, here is such an
>> implementation:  https://github.com/basho/riak_core/pull/55.  It is still
>> heuristic and greedy, but it seems to do a better job of avoiding re-stripe.
>>  Test results are attached in a zip on the bug linked above.  I'd love to
>> get the riak_core gurus at Basho to look at this and help validate it.  It
>> probably could use some cleaning up, but I want to make sure there aren't
>> other invariants or considerations I'm leaving out -- besides maintaining
>> target_n_val, keeping optimal partition spread, and minimizing handoff
>> between ring states.
>>
>> -Greg
>>
>> _______________________________________________
>> riak-users mailing list
>> riak-users at lists.basho.com
>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>>
>>
>
> _______________________________________________
> riak-users mailing list
> riak-users at lists.basho.com
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.basho.com/pipermail/riak-users_lists.basho.com/attachments/20110526/536342ff/attachment.html>


More information about the riak-users mailing list