Riak cluster unresponsive after single node failure

Armon Dadgar armon.dadgar at gmail.com
Tue May 8 14:18:19 EDT 2012


Hey,

The cluster is back up and running by going through the following steps:
  1) Force terminate east-riak-001 using the AWS console
  2) "riak-admin down riak at east-riak-001.cluster.kiip.me" on ALL nodes
  3) riak stop && riak start  on ALL nodes

All the nodes appeared to have been blocked trying to talk to riak 001 which was
the ring claimant at the time. Doing this seems to have cleared the state enough for
the cluster to make progress again.

In regards to the other questions:
  * Backend : LevelDB
  * OS: Ubuntu 10.04
  * Size: 500 bytes - 1KB
  * Traffic: 300 ops/sec

I will send the vm.args file offline too.

Best Regards,

Armon Dadgar


On Tuesday, May 8, 2012 at 11:12 AM, Mark Phillips wrote:

> Hey Armon, 
> 
> So "monitor busy_dist_port" means your nodes aren't talking but we need to figure out why. Specifically it looks like you're kv vnodes aren't able to communicate.
> 
> First questions
> 
> * Which backend are you using? 
> * What OS?
> * What size are your values?
> * What is the typical traffic (ops/second) on the cluster?
> 
> Also, if you could send a copy of your vm.args (probably best off-list) that would be helpful, too. 
> 
> Mark 
> 
> On Tue, May 8, 2012 at 10:25 AM, Armon Dadgar <armon.dadgar at gmail.com (mailto:armon.dadgar at gmail.com)> wrote:
> > We are currently running a 4 node cluster with 1.1.2 on Ubuntu 10.04, and 
> > are experiencing an issue where losing a single node has cause the entire
> > cluster to fail.
> > 
> > Nagios reported that node 1 had failed, shortly after, all the logs are filled with: 
> > 2012-05-08 08:13:22.319 [error] <0.27873.2568>@riak_kv_put_fsm:prepare:199 Unable to forward put for {<<"session">>,<<"3a538aaa-b503-4a2e-94f9-7b62074815c7">>} to 'riak at east-riak-001.cluster.kiip.me (mailto:riak at east-riak-001.cluster.kiip.me)' - nodedown
> > 2012-05-08 08:21:11.890 [error] <0.16614.2569>@riak_core_handoff_sender:start_fold:178 Handoff of partition riak_kv_vnode 456719261665907161938651510223838443642478919680 from 'riak at east-riak-004.cluster.kiip.me (mailto:riak at east-riak-004.cluster.kiip.me)' to 'riak at east-riak-001.cluster.kiip.me (mailto:riak at east-riak-001.cluster.kiip.me)' failed exit:{noproc,{gen_server2,call,[{riak_kv_handoff_listener,'riak at east-riak-001.cluster.kiip.me (mailto:riak at east-riak-001.cluster.kiip.me)'},handoff_port,infinity]}}
> > 2012-05-08 08:23:18.005 [error] <0.19071.2569>@riak_kv_put_fsm:prepare:199 Unable to forward put for {<<"session">>,<<"7a015cff-d361-4a38-a624-004f3e7bc76a">>} to 'riak at east-riak-001.cluster.kiip.me (mailto:riak at east-riak-001.cluster.kiip.me)' - timeout
> > 2012-05-08 08:23:21.312 [error] <0.19178.2569>@riak_kv_put_fsm:prepare:199 Unable to forward put for {<<"session">>,<<"b9bd30fa-ddba-4219-b1d9-e4b761797615">>} to 'riak at east-riak-001.cluster.kiip.me (mailto:riak at east-riak-001.cluster.kiip.me)' - timeout
> > ...
> > 2012-05-08 08:30:26.379 [info] <0.77.0>@riak_core_sysmon_handler:handle_event:85 monitor busy_dist_port <0.4921.2570> [{initial_call,{riak_kv_put_fsm,init,1}},{almost_current_function,{erlang,bif_return_trap,1}},{message_queue_len,0}] {#Port<0.35446433>,'riak at east-riak-001.cluster.kiip.me (mailto:riak at east-riak-001.cluster.kiip.me)'}
> > 2012-05-08 08:30:26.556 [info] <0.77.0>@riak_core_sysmon_handler:handle_event:89 Monitor got {suppressed,port_events,7}
> > 2012-05-08 08:30:26.616 [info] <0.77.0>@riak_core_sysmon_handler:handle_event:85 monitor busy_dist_port <0.4930.2570> [{initial_call,{riak_kv_put_fsm,init,1}},{almost_current_function,{erlang,bif_return_trap,1}},{message_queue_len,0}] {#Port<0.35446433>,'riak at east-riak-001.cluster.kiip.me (mailto:riak at east-riak-001.cluster.kiip.me)'}
> > 2012-05-08 08:30:27.565 [info] <0.77.0>@riak_core_sysmon_handler:handle_event:89 Monitor got {suppressed,port_events,4}
> > 2012-05-08 08:30:27.668 [info] <0.77.0>@riak_core_sysmon_handler:handle_event:85 monitor busy_dist_port <0.3151.2570> [{initial_call,{riak_core_vnode,init,1}},{almost_current_function,{erlang,bif_return_trap,1}},{message_queue_len,0}] {#Port<0.35446433>,'riak at east-riak-001.cluster.kiip.me (mailto:riak at east-riak-001.cluster.kiip.me)'}
> > ...
> > 2012-05-08 10:20:30.088 [info] <0.77.0>@riak_core_sysmon_handler:handle_event:85 monitor busy_dist_port <0.31200.2576> [{initial_call,{riak_kv_put_fsm,init,1}},{almost_curren
> > t_function,{erlang,bif_return_trap,1}},{message_queue_len,0}] {#Port<0.35534018>,'riak at east-riak-001.cluster.kiip.me (mailto:riak at east-riak-001.cluster.kiip.me)'}
> > 2012-05-08 10:20:31.261 [info] <0.77.0>@riak_core_sysmon_handler:handle_event:85 monitor busy_dist_port <0.31202.2576> [{initial_call,{riak_kv_put_fsm,init,1}},{almost_curren
> > t_function,{erlang,bif_return_trap,1}},{message_queue_len,0}] {#Port<0.35534018>,'riak at east-riak-001.cluster.kiip.me (mailto:riak at east-riak-001.cluster.kiip.me)'}
> > 2012-05-08 10:20:32.736 [info] <0.77.0>@riak_core_sysmon_handler:handle_event:85 monitor busy_dist_port <0.1629.2570> [{initial_call,{riak_core_vnode,init,1}},{almost_current
> > _function,{erlang,bif_return_trap,1}},{message_queue_len,0}] {#Port<0.35534018>,'riak at east-riak-001.cluster.kiip.me (mailto:riak at east-riak-001.cluster.kiip.me)'}
> > 2012-05-08 10:20:33.552 [info] <0.77.0>@riak_core_sysmon_handler:handle_event:89 Monitor got {suppressed,port_events,3}
> > 
> > 
> > 
> > 
> > Now all the logs are basically being completely filled with "monitor busy_dist_port <0.22647.2610> [{initial_call,{riak_kv_put_fsm,init,1}},{almost_current_function,{erlang,bif_return_trap,1}},{message_queue_len,0}] {#Port<0.35563927>,'riak at east-riak-001.cluster.kiip.me (mailto:riak at east-riak-001.cluster.kiip.me)'}" or similar. 
> > 
> > Riak-admin is unable to report any information about the cluster, and same with Riak Control.
> > Both just timeout and return:
> > 
> > production-vpc east-riak-002 riak $ riak-admin ring_status 
> > Attempting to restart script through sudo -u riak
> > RPC to 'riak at east-riak-002.cluster.kiip.me (mailto:riak at east-riak-002.cluster.kiip.me)' failed: {'EXIT',
> >                                                      {timeout,
> >                                                       {gen_server,call,
> >                                                        [riak_core_gossip,
> >                                                         legacy_gossip]}}}
> > 
> > 
> > At this point, the cluster has stopped responding to any requests as far as I can tell,
> > or any operations that do complete take well over 60 seconds for a single put with w=1.
> > 
> > Wondering if anybody else has seen this, and if so any advise for getting it resolved?
> > 
> > Best Regards,
> > 
> > Armon Dadgar
> > 
> > 
> > _______________________________________________
> > riak-users mailing list
> > riak-users at lists.basho.com (mailto:riak-users at lists.basho.com)
> > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
> > 
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.basho.com/pipermail/riak-users_lists.basho.com/attachments/20120508/c8d4c4e7/attachment.html>


More information about the riak-users mailing list