Understanding Riaks rebalancing and handoff behaviour

Nico Meyer nico.meyer at adition.com
Thu Nov 11 10:05:03 EST 2010


Hi,

Am Donnerstag, den 11.11.2010, 07:59 +0100 schrieb Sven Riedel:
> Hi,
> thanks for the detailed reply. So you would suggest that somehow the
> partition allocation got into an incosistent state across nodes. I'll
> have to check the logs to see if anything similar to your dump pops
> up.
> 
> > So I compared the ring states manually using the console, and in the
> > ring state on the removed node quite a few partitions where assigned
> > to
> > different nodes than what the other nodes thought.
> > After I manually synced the ring on the leaving node with the rest
> > of
> > the cluster by doing this on the console:
> > 
> > {ok,Ring} = rpc:call('riak at otherNode', riak_core_ring_manger,
> > get_my_ring, []).
> > riak_core_ring_manager:set_my_ring(R).
> > 
> > 
> 
> 
> That ought to be 
> 
> 
> riak_core_ring_manager:set_my_ring( Ring ).
> 
> 
> right? Just verifying because my Erlang is rather rudimentary :)

You are right.

> > Also riak-admin ringready will not recognize this problem, as far as
> > I
> > read the code, because only the ring states of the current ring
> > members
> > are compared. I haven't tried it, cause I am still on 0.12.0. 
> > The same is apparently true for riak-admin transfers, which might
> > tell
> > you that there are no handoffs left, even if the removed node still
> > has
> > data.
> > 
> 
> 
> I'm running 0.13.0, so if we're stumbling over the same cause it's
> still there.
> 
> > 
> > I discovered another problem while debugging this. I you restart (or
> > it
> > crashes) a node that you removed from the cluster which still has
> > data,
> > it won't start handing off it's data afterwards. The reason being,
> > that
> > is the node watcher also does not get notified that the other nodes
> > are
> > up, and so all of them are considered down. This also can only be
> > worked
> > around manually via the erlang console.
> > 
> 
> 
> Why would that have to be worked around at all? My understanding is
> through the data duplication within the ring having a single node
> encounter a messy and fatal accident shouldn't destabilize the entire
> ring.  The nodes which contain the duplicate data would just take over
> until a replacement node gets added, and the newly dead node is
> removed (ok, via console).
> 

It's not a problem right away. But since the replicated data is not
actively synchronized in the background the keys that were not copied
until the node dies have one less replica. That is until they are read
at least once, at which point read repair does replicate the key again.
So it depends on your setup and requirements, if this is acceptable or
not.

> 
> So this still leaves me with some of my original questions open:
> > > 
> > > 1. What would normally trigger a rebalancing of the nodes? 
> > > 2. Is there a way to manually trigger a rebalancing?
> > > 3. Did I do anything wrong with the procedure described above to
> > > be left in the current odd state by riak?
> 

Every vnode, which is responsible for one partition, checks after 60
seconds of inactivity if its is residing one the node where it should
be, according to the ring state. If not, the data is send to the correct
node. So the rebalancing of data is triggered by rebalancing the
partitions among the nodes in the ring.
The ring is balanced during the gossiping of the ring state, which is
done by every node with another random node at every 0-60 (also
randomly) seconds.
In the worst case it could take some minutes before the ring stabilizes,
but its statistically likely to converge faster.

So there's is nothing to trigger manually really. One problem I see has
to do again with restarting a node, which still has data that should be
handed off to another node. Initially only the vnodes that are owned by
a node started, which by definition don't include the ones to be handed
off. But if the vnodes are never started, they won't perform the
handoff.
It kind of works anyway, but the vnodes are started and transfered
sequentially. Normally four partitions are transferred in parallel, so I
don't know if this is by design or by accident. The details are
convoluted enough to suspect the latter.
In any case this would also make also have the effect that those
partitions won't show up in the output of riak-admin transfers, since
only running vnodes are considered.

I also forgot, that I patched another problem in my own version of riak,
which will prevent any further handoff of data after four of them failed
with an error or timeout. This probably happened in your case, if your A
nodes became unresponsive for 30 minutes (did the machine swap by the
way?).
I should probably create a bug report for this, with my patch attached.
Stupid laziness!

After reading your original post again, I think almost all of the things
you saw can be explained by the bug that I mentioned in my first answer
(the ring status of removed nodes is not synchronized with the remaining
nodes). The problem obviously becomes worse if you remove several nodes
at a time.


I should really file some bug reports for all the problems I found, but
it is just too much effort for me right now. I have fixed the problems
that bugged us the most myself, so I should at least provide the patches
like a good open source citizen :-)

Bye,

Nico








More information about the riak-users mailing list