riak handoffs stalled

Ciprian Manea ciprian at basho.com
Mon Jul 14 09:01:52 EDT 2014


Hi Leonid,

Lets try to increase the handoff_timeout and see if it can solve your
problem.

Could you please paste the below code in a $ riak attach

riak_core_util:rpc_every_member_ann(application,set_env,[riak_core,
handoff_timeout, 5400000],infinity).
riak_core_util:rpc_every_member_ann(application,set_env,[riak_core,
handoff_receive_timeout, 5400000],infinity).

You should be able to exit back at the shell prompt by pressing ^D

Could you please archive/compress and send me directly by email:

+ the ring directory (including its content) from one of your riak nodes
+ recent log files (console.log, error.log, crash.log if any), same node


Thanks,
Ciprian


On Mon, Jul 14, 2014 at 3:33 PM, Леонид Рябоштан <
leonid.riaboshtan at twiket.com> wrote:

> Hello,
>
> riak version is 1.1.4-1. We set transfer limit in config made it equal to
> 4.
>
> I don't think we have riak-admin transfer-limit or riak-admin cluster plan.
>
> The problem is that damn nodes can't pass partition between each other,
> probably because they're too big. Each 5k files(leveldb backend) and
> weights 10GB each. There're no problems with smaller partitions. We can't
> find anything usefull on handoff fail in riak or system logs. Seems like
> ulimit and erlang ports are way higher, we increased it 4 times today.
>
> It begins like:
> 2014-07-14 12:22:45.518 UTC [info]
> <0.10544.0>@riak_core_handoff_sender:start_fold:83 Starting handoff of
> partition riak_kv_vnode 68507889249886074290797726533575766546371837952
> from 'riak at 192.168.153.182' to 'riak at 192.168.164.133'
>
> And ends like:
> 2014-07-14 08:43:28.829 UTC [error]
> <0.2264.0>@riak_core_handoff_sender:start_fold:152 Handoff of partition
> riak_kv_vnode 68507889249886074290797726533575766546371837952 from '
> riak at 192.168.153.182' to 'riak at 192.168.164.133' FAILED after sending
> 1318000 objects in 1455.15 seconds: closed
> 2014-07-14 10:40:18.294 UTC [error]
> <0.11555.0>@riak_core_handoff_sender:start_fold:152 Handoff of partition
> riak_kv_vnode 68507889249886074290797726533575766546371837952 from '
> riak at 192.168.153.182' to 'riak at 192.168.164.133' FAILED after sending
> 911000 objects in 2734.48 seconds: closed
> 2014-07-14 09:43:43.197 UTC [error]
> <0.26922.2>@riak_core_handoff_sender:start_fold:152 Handoff of partition
> riak_kv_vnode 68507889249886074290797726533575766546371837952 from '
> riak at 192.168.153.182' to 'riak at 192.168.164.133' FAILED after sending
> 32000 objects in 963.06 seconds: timeout
>
> Maybe we need to check something else on target node? Actually it always
> runs in GC problems:
> 2014-07-14 12:30:03.579 UTC [info]
> <0.99.0>@riak_core_sysmon_handler:handle_event:85 monitor long_gc <0.468.0>
> [{initial_call,{riak_kv_js_vm,init,1}},{almost_current_function,{xmerl_ucs,expand_utf8_1,3}},{message_queue_len,0}]
> [{timeout,118},{old_heap_block_size,0},{heap_block_size,196418},{mbuf_size,0},{stack_size,45},{old_heap_size,0},{heap_size,136165}]
> 2014-07-14 12:30:44.386 UTC [info]
> <0.99.0>@riak_core_sysmon_handler:handle_event:85 monitor long_gc <0.713.0>
> [{initial_call,{riak_core_vnode,init,1}},{almost_current_function,{gen_fsm,loop,7}},{message_queue_len,0}]
> [{timeout,126},{old_heap_block_size,0},{heap_block_size,1597},{mbuf_size,0},{stack_size,38},{old_heap_size,0},{heap_size,658}]
>
> Probably we have some CPU issues here, but node is not under load
> currently.
>
> Thank you,
> Leonid
>
>
> 2014-07-14 16:11 GMT+04:00 Ciprian Manea <ciprian at basho.com>:
>
> Hi Leonid,
>>
>> Which Riak version are you running?
>>
>> Have you committed* the cluster plan after issuing the cluster
>> force-remove <node> commands?
>>
>> What is the output of $ riak-admin transfer-limit, ran from one of your
>> riak nodes?
>>
>>
>> *Do not run this command yet if you have not done it already.
>> Please run a riak-admin cluster plan and attach its output here.
>>
>>
>> Thanks,
>> Ciprian
>>
>>
>> On Mon, Jul 14, 2014 at 2:41 PM, Леонид Рябоштан <
>> leonid.riaboshtan at twiket.com> wrote:
>>
>>> Hello, guys,
>>>
>>> It seems like we ran into emergency. I wonder if there can be any help
>>> on that.
>>>
>>> Everything that happened below was because we were trying to rebalace
>>> space used by nodes that we running out of space.
>>>
>>> Cluster is 7 machines now, member_status looks like:
>>> Attempting to restart script through sudo -u riak
>>> ================================= Membership
>>> ==================================
>>> Status     Ring    Pending    Node
>>>
>>> -------------------------------------------------------------------------------
>>> valid      15.6%     20.3%    'riak at 192.168.135.180'
>>> valid       0.0%      0.0%    'riak at 192.168.152.90'
>>> valid       0.0%      0.0%    'riak at 192.168.153.182'
>>> valid      26.6%     23.4%    'riak at 192.168.164.133'
>>> valid      27.3%     21.1%    'riak at 192.168.177.36'
>>> valid       8.6%     15.6%    'riak at 192.168.194.138'
>>> valid      21.9%     19.5%    'riak at 192.168.194.149'
>>>
>>> -------------------------------------------------------------------------------
>>> Valid:7 / Leaving:0 / Exiting:0 / Joining:0 / Down:0
>>>
>>> 2 nodes with 0 Ring was made to force leave the cluster, they have
>>> plenty of data on them which is now seems to be not accessible. Handoffs
>>> are stuck it seems. Node 'riak at 192.168.152.90'(is in same situation as '
>>> riak at 192.168.153.182') tries to handoff partitions to '
>>> riak at 192.168.164.133' but fails for unknown reason after huge
>>> timeouts(from 5 to 40 minutes). Partition it's trying to move is about 10Gb
>>> in size. It grows slowly on target node, but probably it's just usual
>>> writes from normal operation. It doesn't get any smaller on source node.
>>>
>>> I wonder is there any way to let cluster know that we want those nodes
>>> to be actually members of source node and there's no actual need to
>>> transfer them? How to redo cluster ownership balance? Revert this
>>> force-leave stuff.
>>>
>>> Thank you,
>>> Leonid
>>>
>>> _______________________________________________
>>> riak-users mailing list
>>> riak-users at lists.basho.com
>>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.basho.com/pipermail/riak-users_lists.basho.com/attachments/20140714/2953288e/attachment.html>


More information about the riak-users mailing list