Understanding Riaks rebalancing and handoff behaviour

Sven Riedel sven.riedel at scoreloop.com
Fri Nov 12 02:43:54 EST 2010

On Nov 11, 2010, at 4:05 PM, Nico Meyer wrote:

>>> I discovered another problem while debugging this. I you restart (or
>>> it
>>> crashes) a node that you removed from the cluster which still has
>>> data,
>>> it won't start handing off it's data afterwards. The reason being,
>>> that
>>> is the node watcher also does not get notified that the other nodes
>>> are
>>> up, and so all of them are considered down. This also can only be
>>> worked
>>> around manually via the erlang console.
>> Why would that have to be worked around at all? My understanding is
>> through the data duplication within the ring having a single node
>> encounter a messy and fatal accident shouldn't destabilize the entire
>> ring.  The nodes which contain the duplicate data would just take over
>> until a replacement node gets added, and the newly dead node is
>> removed (ok, via console).
> It's not a problem right away. But since the replicated data is not
> actively synchronized in the background the keys that were not copied
> until the node dies have one less replica. That is until they are read
> at least once, at which point read repair does replicate the key again.
> So it depends on your setup and requirements, if this is acceptable or
> not.

So if the relevant data isn't read in a while and two more nodes go down (with 
an n_val of 3), there is a chance that some data is lost.

>> So this still leaves me with some of my original questions open:
>>>> 1. What would normally trigger a rebalancing of the nodes? 
>>>> 2. Is there a way to manually trigger a rebalancing?
>>>> 3. Did I do anything wrong with the procedure described above to
>>>> be left in the current odd state by riak?
> Every vnode, which is responsible for one partition, checks after 60
> seconds of inactivity if its is residing one the node where it should
> be, according to the ring state. If not, the data is send to the correct
> node. So the rebalancing of data is triggered by rebalancing the
> partitions among the nodes in the ring.
> The ring is balanced during the gossiping of the ring state, which is
> done by every node with another random node at every 0-60 (also
> randomly) seconds.
> In the worst case it could take some minutes before the ring stabilizes,
> but its statistically likely to converge faster.

Ah, if the partition ownership changes, the data should get relocated straight away.
I had gossip put down only for immediate topology changes (which worked fine in my case, it just never got around to redistributing the data).

> So there's is nothing to trigger manually really. One problem I see has
> to do again with restarting a node, which still has data that should be
> handed off to another node. Initially only the vnodes that are owned by
> a node started, which by definition don't include the ones to be handed
> off. But if the vnodes are never started, they won't perform the
> handoff.

This sounds less confidence inspiring. On the off-chance that the node (or hardware) crashes between telling a node to leave the ring and the data handoff having been completed, you'll have more work on your hands than just restarting the node, and you have to know about this as well. This isn't a scenario that will happen often, but from the point of view of someone who wants the least amount of manual intervention when something goes wrong, I'd prefer resilience over performance.

> It kind of works anyway, but the vnodes are started and transfered
> sequentially. Normally four partitions are transferred in parallel, so I
> don't know if this is by design or by accident. The details are
> convoluted enough to suspect the latter.
> In any case this would also make also have the effect that those
> partitions won't show up in the output of riak-admin transfers, since
> only running vnodes are considered.

From what I saw in my case the number of handoffs were displayed correctly in the beginning, however the numbers didn't decrease (or change at all) as data got handed around.

> I also forgot, that I patched another problem in my own version of riak,
> which will prevent any further handoff of data after four of them failed
> with an error or timeout. This probably happened in your case, if your A
> nodes became unresponsive for 30 minutes (did the machine swap by the
> way?).

A was up and responsive the entire time. No swapping; the machines don't have any swap space configured. However, they were rather weak in the IO department, so the load did rise to 10 and above on bulk inserts, and to around 6 during the handoffs IIRC (on a two core machine). One of the reasons I wanted to check things out on instances with more IO performance.

> I should probably create a bug report for this, with my patch attached.
> Stupid laziness!
> After reading your original post again, I think almost all of the things
> you saw can be explained by the bug that I mentioned in my first answer
> (the ring status of removed nodes is not synchronized with the remaining
> nodes). The problem obviously becomes worse if you remove several nodes
> at a time.

Which means that I shouldn't just wait for riak-admin ringready to return TRUE, but for the data handoffs to have completed as well before changing the network topology again?

Thanks for your answers, they have been enlightening.


Scoreloop AG, Brecherspitzstrasse 8, 81541 Munich, Germany, www.scoreloop.com
sven.riedel at scoreloop.com

Sitz der Gesellschaft: München, Registergericht: Amtsgericht München, HRB 174805 
Vorstand: Dr. Marc Gumpinger (Vorsitzender), Dominik Westner, Christian van der Leeden, Vorsitzender des Aufsichtsrates: Olaf Jacobi 

More information about the riak-users mailing list