Understanding Riaks rebalancing and handoff behaviour

Sven Riedel sven.riedel at scoreloop.com
Mon Nov 15 03:32:45 EST 2010


On Nov 12, 2010, at 3:57 PM, Nico Meyer wrote:

> Am Freitag, den 12.11.2010, 08:43 +0100 schrieb Sven Riedel:
>>> 
>>> It's not a problem right away. But since the replicated data is not
>>> actively synchronized in the background the keys that were not copied
>>> until the node dies have one less replica. That is until they are read
>>> at least once, at which point read repair does replicate the key again.
>>> So it depends on your setup and requirements, if this is acceptable or
>>> not.
>> 
>> So if the relevant data isn't read in a while and two more nodes go down (with 
>> an n_val of 3), there is a chance that some data is lost.
>> 
> 
> Correct. This might be highly unlikely though.

I agree. But still good to know to be fully informed about the risks involved when choosing a small n_val with large, write-heavy datasets on few nodes :)

> 
>>> It kind of works anyway, but the vnodes are started and transfered
>>> sequentially. Normally four partitions are transferred in parallel, so I
>>> don't know if this is by design or by accident. The details are
>>> convoluted enough to suspect the latter.
>>> In any case this would also make also have the effect that those
>>> partitions won't show up in the output of riak-admin transfers, since
>>> only running vnodes are considered.
>> 
>> From what I saw in my case the number of handoffs were displayed correctly in the beginning, however the numbers didn't decrease (or change at all) as data got handed around.
>> 
> 
> If the node is not restarted, all vnodes are still running, so the
> number would be correct. Most likely no handoffs were being done
> anymore, because four of them crashed and the locks are not cleared in
> this case. By default only four handoffs are allowed at the same time.
> Have you looked in for the error messages I mentioned in your logs?

I just had a look and I see a slightly different exception: 
On riak01 (which had been sending out data, and subsequently shut down):

ERROR: "Handoff receiver for partition ~p exiting abnormally after processing ~p objects: ~p\n" - [ 0,
{ noproc, 
  {gen_fsm,
   sync_send_all_state_event,
   [<0.24196.2404>,
    {handoff_data, 
    <<...binary...>>
   },
   60000]}}]

Nothing interesting that I can see on the receiving end(s).


> 
> 
>>> I should probably create a bug report for this, with my patch attached.
>>> Stupid laziness!
>>> 
>>> After reading your original post again, I think almost all of the things
>>> you saw can be explained by the bug that I mentioned in my first answer
>>> (the ring status of removed nodes is not synchronized with the remaining
>>> nodes). The problem obviously becomes worse if you remove several nodes
>>> at a time.
>> 
>> Which means that I shouldn't just wait for riak-admin ringready to return TRUE, but for the data handoffs to have completed as well before changing the network topology again?
>> 
> 
> Yes. By using e.g. 'df' or listing the directories in your bitcask dir,
> which should be empty if everything went well.
> But with the two bugs that are still present it might never finish.
> The workaround for this is quite involved at the moment.
> I will try to create bug reports in the next few days.
> 

I see, this wasn't clear to me. This was then probably what caused everything to get off track. 
I think I'll throw away the test cluster then and start over, this time not just waiting for ringready. Maybe this should be mentioned more prominently in the documentation alongside the description of riak-admin ringready. Having a "settled ring" implied to me that it's ok to proceed with topology changes :)
> 

Regards,
Sven

------------------------------------------
Scoreloop AG, Brecherspitzstrasse 8, 81541 Munich, Germany, www.scoreloop.com
sven.riedel at scoreloop.com

Sitz der Gesellschaft: München, Registergericht: Amtsgericht München, HRB 174805 
Vorstand: Dr. Marc Gumpinger (Vorsitzender), Dominik Westner, Christian van der Leeden, Vorsitzender des Aufsichtsrates: Olaf Jacobi 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.basho.com/pipermail/riak-users_lists.basho.com/attachments/20101115/a514f3d1/attachment.html>


More information about the riak-users mailing list