Riak cluster unresponsive after single node failure

Armon Dadgar armon.dadgar at gmail.com
Tue May 8 13:25:14 EDT 2012


We are currently running a 4 node cluster with 1.1.2 on Ubuntu 10.04, and 
are experiencing an issue where losing a single node has cause the entire
cluster to fail.

Nagios reported that node 1 had failed, shortly after, all the logs are filled with:
2012-05-08 08:13:22.319 [error] <0.27873.2568>@riak_kv_put_fsm:prepare:199 Unable to forward put for {<<"session">>,<<"3a538aaa-b503-4a2e-94f9-7b62074815c7">>} to 'riak at east-riak-001.cluster.kiip.me' - nodedown
2012-05-08 08:21:11.890 [error] <0.16614.2569>@riak_core_handoff_sender:start_fold:178 Handoff of partition riak_kv_vnode 456719261665907161938651510223838443642478919680 from 'riak at east-riak-004.cluster.kiip.me' to 'riak at east-riak-001.cluster.kiip.me' failed exit:{noproc,{gen_server2,call,[{riak_kv_handoff_listener,'riak at east-riak-001.cluster.kiip.me'},handoff_port,infinity]}}
2012-05-08 08:23:18.005 [error] <0.19071.2569>@riak_kv_put_fsm:prepare:199 Unable to forward put for {<<"session">>,<<"7a015cff-d361-4a38-a624-004f3e7bc76a">>} to 'riak at east-riak-001.cluster.kiip.me' - timeout
2012-05-08 08:23:21.312 [error] <0.19178.2569>@riak_kv_put_fsm:prepare:199 Unable to forward put for {<<"session">>,<<"b9bd30fa-ddba-4219-b1d9-e4b761797615">>} to 'riak at east-riak-001.cluster.kiip.me' - timeout
...
2012-05-08 08:30:26.379 [info] <0.77.0>@riak_core_sysmon_handler:handle_event:85 monitor busy_dist_port <0.4921.2570> [{initial_call,{riak_kv_put_fsm,init,1}},{almost_current_function,{erlang,bif_return_trap,1}},{message_queue_len,0}] {#Port<0.35446433>,'riak at east-riak-001.cluster.kiip.me'}
2012-05-08 08:30:26.556 [info] <0.77.0>@riak_core_sysmon_handler:handle_event:89 Monitor got {suppressed,port_events,7}
2012-05-08 08:30:26.616 [info] <0.77.0>@riak_core_sysmon_handler:handle_event:85 monitor busy_dist_port <0.4930.2570> [{initial_call,{riak_kv_put_fsm,init,1}},{almost_current_function,{erlang,bif_return_trap,1}},{message_queue_len,0}] {#Port<0.35446433>,'riak at east-riak-001.cluster.kiip.me'}
2012-05-08 08:30:27.565 [info] <0.77.0>@riak_core_sysmon_handler:handle_event:89 Monitor got {suppressed,port_events,4}
2012-05-08 08:30:27.668 [info] <0.77.0>@riak_core_sysmon_handler:handle_event:85 monitor busy_dist_port <0.3151.2570> [{initial_call,{riak_core_vnode,init,1}},{almost_current_function,{erlang,bif_return_trap,1}},{message_queue_len,0}] {#Port<0.35446433>,'riak at east-riak-001.cluster.kiip.me'}
...
2012-05-08 10:20:30.088 [info] <0.77.0>@riak_core_sysmon_handler:handle_event:85 monitor busy_dist_port <0.31200.2576> [{initial_call,{riak_kv_put_fsm,init,1}},{almost_curren
t_function,{erlang,bif_return_trap,1}},{message_queue_len,0}] {#Port<0.35534018>,'riak at east-riak-001.cluster.kiip.me'}
2012-05-08 10:20:31.261 [info] <0.77.0>@riak_core_sysmon_handler:handle_event:85 monitor busy_dist_port <0.31202.2576> [{initial_call,{riak_kv_put_fsm,init,1}},{almost_curren
t_function,{erlang,bif_return_trap,1}},{message_queue_len,0}] {#Port<0.35534018>,'riak at east-riak-001.cluster.kiip.me'}
2012-05-08 10:20:32.736 [info] <0.77.0>@riak_core_sysmon_handler:handle_event:85 monitor busy_dist_port <0.1629.2570> [{initial_call,{riak_core_vnode,init,1}},{almost_current
_function,{erlang,bif_return_trap,1}},{message_queue_len,0}] {#Port<0.35534018>,'riak at east-riak-001.cluster.kiip.me'}
2012-05-08 10:20:33.552 [info] <0.77.0>@riak_core_sysmon_handler:handle_event:89 Monitor got {suppressed,port_events,3}




Now all the logs are basically being completely filled with "monitor busy_dist_port <0.22647.2610> [{initial_call,{riak_kv_put_fsm,init,1}},{almost_current_function,{erlang,bif_return_trap,1}},{message_queue_len,0}] {#Port<0.35563927>,'riak at east-riak-001.cluster.kiip.me'}" or similar.

Riak-admin is unable to report any information about the cluster, and same with Riak Control.
Both just timeout and return:

production-vpc east-riak-002 riak $ riak-admin ring_status
Attempting to restart script through sudo -u riak
RPC to 'riak at east-riak-002.cluster.kiip.me' failed: {'EXIT',
                                                     {timeout,
                                                      {gen_server,call,
                                                       [riak_core_gossip,
                                                        legacy_gossip]}}}


At this point, the cluster has stopped responding to any requests as far as I can tell,
or any operations that do complete take well over 60 seconds for a single put with w=1.

Wondering if anybody else has seen this, and if so any advise for getting it resolved?

Best Regards,

Armon Dadgar

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.basho.com/pipermail/riak-users_lists.basho.com/attachments/20120508/8ed7c0a8/attachment.html>


More information about the riak-users mailing list