Riak doesn't use consistent hashing
jlangevin at loomlearning.com
Thu May 26 09:57:47 EDT 2011
That sounds quite disconcerting. What happens to the performance of the
cluster when this occurs?*
Wilmington, NC: (910) 241-0433 - jlangevin at loomlearning.com -
www.loomlearning.com - Skype: intel352
On Thu, May 26, 2011 at 1:54 AM, Greg Nelson <grourk at dropcam.com> wrote:
> I've been doing some digging through the details of how a node joins a
> cluster. When you hear that Riak uses consistent hashing, you'd expect it
> to distribute keys to nodes by hashing keys onto the ring AND hashing nodes
> onto the ring. Keys belong to the closest node on the ring, in the
> clockwise direction. Add a node, it hashes onto the ring and takes over
> some keys. Ordinarily the node would hash onto the ring in several places,
> to achieve better spread. Some data (roughly 1 / #nodes) moves to the new
> node from each of the other nodes, and everything else stays the same.
> In what Amazon describes as operationally simpler (strategy 3 in the Dynamo
> paper), the ring is instead divided into equally-sized partitions. Nodes
> are hashed onto the ring, and preflists are calculated by walking clockwise
> from a partition, skipping partitions on already visited nodes. Riak does
> something similar: it divides the ring into equally-sized partitions, then
> nodes "randomly" claim partitions. However, the skipping bit isn't part of
> Riak's preflist calculation. Instead, nodes claim partitions in such a way
> as to be spaced out by target_n_val, to obviate the need for skipping.
> Now, getting back to what happens when a node joins. The new node
> calculates a new ring state that maintains the target_n_val invariant, as
> well as trying to keep even spread of partitions per node. The algorithm
> (default_choose_claim) is heuristic and greedy in nature, and recursively
> transfers partitions to the new node until optimal spread is achieved,
> maintaining target_n_val along the way. But if -- during one of those
> recursive calls -- it can't meet the target_n_val, it will throw up its
> hands and completely re-do the whole ring (by calling claim_rebalance_n).
> Striping the partitions across nodes, in a round-robin fashion. When that
> happens, most of the data needs to be handed off between nodes.
> This happens a lot, with many ring sizes. With ring_creation_size=128
> (i.e., 128 partitions), it will happen when adding node 9 (87.5% of data
> moves), adding node 12 (82%), adding node 15 (80%), adding node 19 (94%).
> It happens with all ring sizes >= 128 (256, 512, 1024, ...). It appears
> that any ring_creation_size (64 by default) is safe for growing to 8 nodes
> or so. But if you want to go beyond that... A ring size of >= 128 with
> more than 8 nodes doesn't seem all that unusual, surely someone has hit this
> before? I've filed a bug report here:
> Anyway, this feels like a bit of a departure from consistent hashing. In
> fact, could this not be replaced by normal hashing + a lookup table mapping
> intervals of the hash space to nodes? And isn't that simply sharding?
> At any rate, I believe the claim algorithm can be improved to avoid those
> "throw up hands and stripe everything" scenarios. In fact, here is such an
> implementation: https://github.com/basho/riak_core/pull/55. It is still
> heuristic and greedy, but it seems to do a better job of avoiding re-stripe.
> Test results are attached in a zip on the bug linked above. I'd love to
> get the riak_core gurus at Basho to look at this and help validate it. It
> probably could use some cleaning up, but I want to make sure there aren't
> other invariants or considerations I'm leaving out -- besides maintaining
> target_n_val, keeping optimal partition spread, and minimizing handoff
> between ring states.
> riak-users mailing list
> riak-users at lists.basho.com
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the riak-users