Understanding Riaks rebalancing and handoff behaviour

Nico Meyer nico.meyer at adition.com
Wed Nov 10 07:05:45 EST 2010


I have also seen similar (not quite the same though) problems when
removing nodes from the cluster. Normally what happens, is that most
partitions are moved away from the node that was removed but the handoff
of some partitions will always produce an error like this:

=ERROR REPORT==== 10-Nov-2010::11:13:28 ===
Handoff sender riak_kv_vnode 1097553475690883148533821910506661759878332153856 failed error:{badmatch,

On the node to which the data is handed off an error like this occurs:

=ERROR REPORT==== 10-Nov-2010::11:36:14 ===

** Generic server <0.9664.1698> terminating

** Last message in was {tcp,#Port<0.12814448>,




** When Server state == {state,#Port<0.12814448>,

** Reason for termination ==

** {normal,{gen_fsm,sync_send_all_state_event,




I cut the large binaries here for the sake of brevity.

Digging a little deeper I noticed that the distribution of partitions on
the node that was removed, as shown by riak-admin status, was always a
little different from the other nodes, which were still members of the
ring. Usually the partitions are quite unbalanced on the removed node.

So I compared the ring states manually using the console, and in the
ring state on the removed node quite a few partitions where assigned to
different nodes than what the other nodes thought.
After I manually synced the ring on the leaving node with the rest of
the cluster by doing this on the console:

{ok,Ring} = rpc:call('riak at otherNode', riak_core_ring_manger,
get_my_ring, []).

the remaining partitions could be handed off successfully. Of course the
nodename in the ring is wrong after that.

My analysis of what happens was as follows:

-when the node is removed the ring state is updated accordingly and sent
to all nodes, including the one that was just removed

-during some latter round of gossip, the ring is changed again. But this
time it is not updated on the node that was just removed, since only
members of the ring are considered for gossiping.

-the leaving node does handoff some partitions to nodes which don't own
that partition (anymore)

-now two things can happen:
  -the (wrong) node happens to have data for the partition in question
(so the vnode is running) -> the handoff works, although the data is
shipped off again shortly afterwards to the correct node. This is very
inefficient, but it works
  -the node has no data for that partition -> the vnode is startet on
demand an immediately exits again, because it tries to do a handoff
itself (after a timeout of 0 seconds when first started) which
immediately completes

I am not sure why the ring changes twice. Maybe it has to do with the
fact that the target_n_val setting is not consistent within at least one
of our clusters, which I just discovered myself. It doesn't seem to
depend on the number of nodes, or the ring_creations size, as I have
seen this problem consistently even after removing several nodes in a
row and on two different clusters with a ring size of 1024 and 64

Also riak-admin ringready will not recognize this problem, as far as I
read the code, because only the ring states of the current ring members
are compared. I haven't tried it, cause I am still on 0.12.0. 
The same is apparently true for riak-admin transfers, which might tell
you that there are no handoffs left, even if the removed node still has

I discovered another problem while debugging this. I you restart (or it
crashes) a node that you removed from the cluster which still has data,
it won't start handing off it's data afterwards. The reason being, that
is the node watcher also does not get notified that the other nodes are
up, and so all of them are considered down. This also can only be worked
around manually via the erlang console.

If any of this problems have been fixed in a recent version, I am sorry
for taking up your time and bandwidth :-). A short skimming of the
relevant modules didn't reveal anything hinting in that way.


Am Dienstag, den 09.11.2010, 16:08 +0100 schrieb Sven Riedel:
> Hi,
> I'm currently assessing how well riak fits our needs as a large scale data store. 
> In the course of testing riak, I've set up a cluster in Amazons with 6 nodes across two EC2 instances (m2.xlarge). After seeing surprisingly a surprisingly bad write performance (which I'll write more on in a separate post once I've finished my tests), I wanted to migrate the cluster to instances with a better IO performance.
> Lets call the original EC2 instances A and B. The plan was to migrate the cluster to new EC2 instances called C and D. During the following actions no other processes were reading/writing from/to the cluster. All instances are in the same availability zone.
> What I did so far was to tell all riak nodes on B to leave the ring and let the ring re-stabilize. One surprising behaviour here was that the riak nodes on A suddenly all went into deep sleep mode (process state D) for about 30 minutes, and all riak-admin status/transfer calls claimed all nodes were down when in fact they weren't and were quite busy. But left to themselves they sorted everything out in the end.
> Then I set up 3 new riak nodes on C and told them to join the cluster.
> So far everything went well. riak-admin transfers showed me that both the nodes on A and the nodes on C were waiting on/for handoffs. However, the handoffs didn't start. I gave the cluster an hour, but no data transfer got initiated to the new nodes. 
> Since I didn't find any way to manually trigger the handoff, I told all the nodes on A (riak01, riak02 and riak03) to leave the cluster and after the last node on A left the ring, the handoffs started.
> After all the data in riak01 got moved to the nodes on C, the master process shut down and the handoff for the remaining data from riak02 and riak03 stopped. I tried restarting riak01 manually, however riak-admin ringready claims that riak01 and riak04 (on C) disagree on the partition owners. riak-admin transfers still lists the same amount of partitions awaiting handoff as when the the handoff to the nodes on C started.
> My current data distribution is as follows (via du -c):
> On A:
> 1780 riak01/data
> 188948 riak02/data
> 3766736 riak03/data
> On B:
> 13215908 riak04/data
> 1855584 riak05/data
> 5745076 riak06/data
> riak04 and riak05 are awaiting the handoff of 341 partitions, riak06 of 342 partitions.
> The ring_creation_size is 512, n_val for the bucket is 3, w is set to 1.
> My questions at this point are:
> 1. What would normally trigger a rebalancing of the nodes? 
> 2. Is there a way to manually trigger a rebalancing?
> 3. Did I do anything wrong with the procedure described above to be left in the current odd state by riak?
> 4. How would I rectify this situation in a production environment?
> Regards,
> Sven
> ------------------------------------------
> Scoreloop AG, Brecherspitzstrasse 8, 81541 Munich, Germany, www.scoreloop.com
> sven.riedel at scoreloop.com
> Sitz der Gesellschaft: München, Registergericht: Amtsgericht München, HRB 174805 
> Vorstand: Dr. Marc Gumpinger (Vorsitzender), Dominik Westner, Christian van der Leeden, Vorsitzender des Aufsichtsrates: Olaf Jacobi 
> _______________________________________________
> riak-users mailing list
> riak-users at lists.basho.com
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

More information about the riak-users mailing list