Understanding Riaks rebalancing and handoff behaviour

Nico Meyer nico.meyer at adition.com
Fri Nov 12 09:57:57 EST 2010

Am Freitag, den 12.11.2010, 08:43 +0100 schrieb Sven Riedel:
> > 
> > It's not a problem right away. But since the replicated data is not
> > actively synchronized in the background the keys that were not copied
> > until the node dies have one less replica. That is until they are read
> > at least once, at which point read repair does replicate the key again.
> > So it depends on your setup and requirements, if this is acceptable or
> > not.
> So if the relevant data isn't read in a while and two more nodes go down (with 
> an n_val of 3), there is a chance that some data is lost.

Correct. This might be highly unlikely though.

> > 
> >> 
> >> So this still leaves me with some of my original questions open:
> >>>> 
> >>>> 1. What would normally trigger a rebalancing of the nodes? 
> >>>> 2. Is there a way to manually trigger a rebalancing?
> >>>> 3. Did I do anything wrong with the procedure described above to
> >>>> be left in the current odd state by riak?
> >> 
> > 
> > Every vnode, which is responsible for one partition, checks after 60
> > seconds of inactivity if its is residing one the node where it should
> > be, according to the ring state. If not, the data is send to the correct
> > node. So the rebalancing of data is triggered by rebalancing the
> > partitions among the nodes in the ring.
> > The ring is balanced during the gossiping of the ring state, which is
> > done by every node with another random node at every 0-60 (also
> > randomly) seconds.
> > In the worst case it could take some minutes before the ring stabilizes,
> > but its statistically likely to converge faster.
> Ah, if the partition ownership changes, the data should get relocated straight away.
> I had gossip put down only for immediate topology changes (which worked fine in my case, it just never got around to redistributing the data).
> > 
> > So there's is nothing to trigger manually really. One problem I see has
> > to do again with restarting a node, which still has data that should be
> > handed off to another node. Initially only the vnodes that are owned by
> > a node started, which by definition don't include the ones to be handed
> > off. But if the vnodes are never started, they won't perform the
> > handoff.
> This sounds less confidence inspiring. On the off-chance that the node (or hardware) crashes between telling a node to leave the ring and the data handoff having been completed, you'll have more work on your hands than just restarting the node, and you have to know about this as well. This isn't a scenario that will happen often, but from the point of view of someone who wants the least amount of manual intervention when something goes wrong, I'd prefer resilience over performance.
As I mentioned in my next paragraph (not very clearly I must confess)
the handoff would eventually start, if not for the node watcher problem
I also mentioned in my first post. Scott Lystig Fritchie created a bug
report for this (https://issues.basho.com/show_bug.cgi?id=878). But this
problem only exists for nodes that are to be removed.
Other nodes might also need to handoff data, if some of their partitions
should be moved to another node. This happens also if you add a node.

The handoffs will eventually start, but it will take longer then normal.
At the start of the node and at every ring update a random vnode for a
parition not owned by the node is started, which might be empty, and
performs a handoff, which consequently might not have to do anything.
When the handoff finishes another random vnode is started (could be the
same vnode though).

Effectively all partitions in the ring are eventually tested, one after
another, and transferred if necessary. But how long it will take to
transfer all partitions is somewhat unpredictable because of the random
part. Also, only one at a time will be transferred, which might take
longer than usual.

> > It kind of works anyway, but the vnodes are started and transfered
> > sequentially. Normally four partitions are transferred in parallel, so I
> > don't know if this is by design or by accident. The details are
> > convoluted enough to suspect the latter.
> > In any case this would also make also have the effect that those
> > partitions won't show up in the output of riak-admin transfers, since
> > only running vnodes are considered.
> From what I saw in my case the number of handoffs were displayed correctly in the beginning, however the numbers didn't decrease (or change at all) as data got handed around.

If the node is not restarted, all vnodes are still running, so the
number would be correct. Most likely no handoffs were being done
anymore, because four of them crashed and the locks are not cleared in
this case. By default only four handoffs are allowed at the same time.
Have you looked in for the error messages I mentioned in your logs?

> > 
> > I also forgot, that I patched another problem in my own version of riak,
> > which will prevent any further handoff of data after four of them failed
> > with an error or timeout. This probably happened in your case, if your A
> > nodes became unresponsive for 30 minutes (did the machine swap by the
> > way?).
> A was up and responsive the entire time. No swapping; the machines don't have any swap space configured. However, they were rather weak in the IO department, so the load did rise to 10 and above on bulk inserts, and to around 6 during the handoffs IIRC (on a two core machine). One of the reasons I wanted to check things out on instances with more IO performance.

You wrote riak-admin status said the nodes were down, thats what I
meant. It means at least that the Erlang VM was unresponsive.

> > I should probably create a bug report for this, with my patch attached.
> > Stupid laziness!
> > 
> > After reading your original post again, I think almost all of the things
> > you saw can be explained by the bug that I mentioned in my first answer
> > (the ring status of removed nodes is not synchronized with the remaining
> > nodes). The problem obviously becomes worse if you remove several nodes
> > at a time.
> Which means that I shouldn't just wait for riak-admin ringready to return TRUE, but for the data handoffs to have completed as well before changing the network topology again?

Yes. By using e.g. 'df' or listing the directories in your bitcask dir,
which should be empty if everything went well.
But with the two bugs that are still present it might never finish.
The workaround for this is quite involved at the moment.
I will try to create bug reports in the next few days.

> Thanks for your answers, they have been enlightening.
> Regards,
> Sven
> ------------------------------------------
> Scoreloop AG, Brecherspitzstrasse 8, 81541 Munich, Germany, www.scoreloop.com
> sven.riedel at scoreloop.com
> Sitz der Gesellschaft: München, Registergericht: Amtsgericht München, HRB 174805 
> Vorstand: Dr. Marc Gumpinger (Vorsitzender), Dominik Westner, Christian van der Leeden, Vorsitzender des Aufsichtsrates: Olaf Jacobi 

More information about the riak-users mailing list