full cluster failure

Gil Teixeira gil at pixelbeam.org
Wed Jun 13 00:42:40 EDT 2012


Let me try and place some more information about my problem here.
Like I explain the full cluster becomes unresponsive in this scenario.(when one node abruptly fails)

This is the console.log of one cluster node:
right after boot
2012-06-13 12:25:19.548 [info] <0.1721.0> Application bitcask started on node 'riak at 192.168.1.201'
2012-06-13 12:25:19.574 [info] <0.1975.0>@riak_core:wait_for_application:390 Wait complete for application riak_pipe (0 seconds)
2012-06-13 12:25:19.579 [info] <0.2119.0>@riak_core:wait_for_application:396 Waiting for application riak_kv to start (0 seconds).
2012-06-13 12:25:19.581 [info] <0.1721.0> Application riak_kv started on node 'riak at 192.168.1.201'
2012-06-13 12:25:19.583 [info] <0.1721.0> Application riak_search started on node 'riak at 192.168.1.201'
2012-06-13 12:25:19.583 [info] <0.1721.0> Application basho_stats started on node 'riak at 192.168.1.201'
2012-06-13 12:25:19.591 [info] <0.1721.0> Application runtime_tools started on node 'riak at 192.168.1.201'
2012-06-13 12:25:19.652 [info] <0.1906.0>@riak_core:wait_for_service:410 Wait complete for service riak_pipe (0 seconds)
2012-06-13 12:25:19.680 [info] <0.2119.0>@riak_core:wait_for_application:390 Wait complete for application riak_kv (0 seconds)
2012-06-13 12:25:19.802 [info] <0.2009.0>@riak_core:wait_for_service:410 Wait complete for service riak_kv (0 seconds)


Seams normal to me ;)
nothing on error.log and crash.log 

I disconnect a node from the network and when i try to retrieve data from the cluster:


2012-06-13 12:34:55.714 [info] <0.7.0> Application lager started on node 'riak at 192.168.1.201'
2012-06-13 12:34:55.742 [info] <0.7.0> Application public_key started on node 'riak at 192.168.1.201'
2012-06-13 12:34:55.749 [info] <0.7.0> Application ssl started on node 'riak at 192.168.1.201'
2012-06-13 12:34:55.757 [info] <0.7.0> Application riak_core started on node 'riak at 192.168.1.201'
2012-06-13 12:34:55.786 [info] <0.7.0> Application riak_control started on node 'riak at 192.168.1.201'
2012-06-13 12:34:55.787 [info] <0.7.0> Application basho_metrics started on node 'riak at 192.168.1.201'
2012-06-13 12:34:55.788 [info] <0.7.0> Application cluster_info started on node 'riak at 192.168.1.201'
2012-06-13 12:34:55.794 [info] <0.7.0> Application merge_index started on node 'riak at 192.168.1.201'
2012-06-13 12:34:55.799 [info] <0.192.0>@riak_core:wait_for_service:416 Waiting for service riak_pipe to start (0 seconds)
2012-06-13 12:34:55.809 [info] <0.261.0>@riak_core:wait_for_application:396 Waiting for application riak_pipe to start (0 seconds).
2012-06-13 12:34:55.810 [info] <0.7.0> Application riak_pipe started on node 'riak at 192.168.1.201'
2012-06-13 12:34:55.820 [info] <0.7.0> Application inets started on node 'riak at 192.168.1.201'
2012-06-13 12:34:55.822 [info] <0.7.0> Application mochiweb started on node 'riak at 192.168.1.201'
2012-06-13 12:34:55.825 [info] <0.7.0> Application erlang_js started on node 'riak at 192.168.1.201'
2012-06-13 12:34:55.829 [info] <0.7.0> Application luke started on node 'riak at 192.168.1.201'
2012-06-13 12:34:55.837 [info] <0.295.0>@riak_core:wait_for_service:416 Waiting for service riak_kv to start (0 seconds)
2012-06-13 12:34:55.855 [info] <0.312.0>@riak_kv_js_vm:init:76 Spidermonkey VM (thread stack: 16MB, max heap: 8MB, pool: riak_kv_js_map) host starting (<0.312.0>)
2012-06-13 12:34:55.857 [info] <0.313.0>@riak_kv_js_vm:init:76 Spidermonkey VM (thread stack: 16MB, max heap: 8MB, pool: riak_kv_js_map) host starting (<0.313.0>)
2012-06-13 12:34:55.858 [info] <0.314.0>@riak_kv_js_vm:init:76 Spidermonkey VM (thread stack: 16MB, max heap: 8MB, pool: riak_kv_js_map) host starting (<0.314.0>)
2012-06-13 12:34:55.860 [info] <0.315.0>@riak_kv_js_vm:init:76 Spidermonkey VM (thread stack: 16MB, max heap: 8MB, pool: riak_kv_js_map) host starting (<0.315.0>)
2012-06-13 12:34:55.862 [info] <0.316.0>@riak_kv_js_vm:init:76 Spidermonkey VM (thread stack: 16MB, max heap: 8MB, pool: riak_kv_js_map) host starting (<0.316.0>)
2012-06-13 12:34:55.864 [info] <0.317.0>@riak_kv_js_vm:init:76 Spidermonkey VM (thread stack: 16MB, max heap: 8MB, pool: riak_kv_js_map) host starting (<0.317.0>)
2012-06-13 12:34:55.866 [info] <0.318.0>@riak_kv_js_vm:init:76 Spidermonkey VM (thread stack: 16MB, max heap: 8MB, pool: riak_kv_js_map) host starting (<0.318.0>)
2012-06-13 12:34:55.868 [info] <0.319.0>@riak_kv_js_vm:init:76 Spidermonkey VM (thread stack: 16MB, max heap: 8MB, pool: riak_kv_js_map) host starting (<0.319.0>)
2012-06-13 12:34:55.870 [info] <0.321.0>@riak_kv_js_vm:init:76 Spidermonkey VM (thread stack: 16MB, max heap: 8MB, pool: riak_kv_js_reduce) host starting (<0.321.0>)
2012-06-13 12:34:55.871 [info] <0.322.0>@riak_kv_js_vm:init:76 Spidermonkey VM (thread stack: 16MB, max heap: 8MB, pool: riak_kv_js_reduce) host starting (<0.322.0>)
2012-06-13 12:34:55.873 [info] <0.323.0>@riak_kv_js_vm:init:76 Spidermonkey VM (thread stack: 16MB, max heap: 8MB, pool: riak_kv_js_reduce) host starting (<0.323.0>)
2012-06-13 12:34:55.875 [info] <0.324.0>@riak_kv_js_vm:init:76 Spidermonkey VM (thread stack: 16MB, max heap: 8MB, pool: riak_kv_js_reduce) host starting (<0.324.0>)
2012-06-13 12:34:55.877 [info] <0.325.0>@riak_kv_js_vm:init:76 Spidermonkey VM (thread stack: 16MB, max heap: 8MB, pool: riak_kv_js_reduce) host starting (<0.325.0>)
2012-06-13 12:34:55.879 [info] <0.326.0>@riak_kv_js_vm:init:76 Spidermonkey VM (thread stack: 16MB, max heap: 8MB, pool: riak_kv_js_reduce) host starting (<0.326.0>)
2012-06-13 12:34:55.880 [info] <0.328.0>@riak_kv_js_vm:init:76 Spidermonkey VM (thread stack: 16MB, max heap: 8MB, pool: riak_kv_js_hook) host starting (<0.328.0>)
2012-06-13 12:34:55.882 [info] <0.329.0>@riak_kv_js_vm:init:76 Spidermonkey VM (thread stack: 16MB, max heap: 8MB, pool: riak_kv_js_hook) host starting (<0.329.0>)
2012-06-13 12:34:55.887 [info] <0.7.0> Application bitcask started on node 'riak at 192.168.1.201'
2012-06-13 12:34:55.922 [info] <0.261.0>@riak_core:wait_for_application:390 Wait complete for application riak_pipe (0 seconds)
2012-06-13 12:34:55.934 [info] <0.431.0>@riak_core:wait_for_application:396 Waiting for application riak_kv to start (0 seconds).
2012-06-13 12:34:55.938 [info] <0.7.0> Application riak_kv started on node 'riak at 192.168.1.201'
2012-06-13 12:34:55.940 [info] <0.7.0> Application riak_search started on node 'riak at 192.168.1.201'
2012-06-13 12:34:55.941 [info] <0.7.0> Application basho_stats started on node 'riak at 192.168.1.201'
2012-06-13 12:34:55.945 [info] <0.7.0> Application runtime_tools started on node 'riak at 192.168.1.201'
2012-06-13 12:34:56.015 [info] <0.192.0>@riak_core:wait_for_service:410 Wait complete for service riak_pipe (0 seconds)
2012-06-13 12:34:56.047 [info] <0.431.0>@riak_core:wait_for_application:390 Wait complete for application riak_kv (0 seconds)
2012-06-13 12:34:56.156 [info] <0.295.0>@riak_core:wait_for_service:410 Wait complete for service riak_kv (0 seconds)
2012-06-13 12:35:24.556 [error] <0.729.0> gen_fsm <0.729.0> in state wait_pipeline_shutdown terminated with reason: {sink_died,shutdown}
2012-06-13 12:35:24.564 [error] <0.729.0> CRASH REPORT Process <0.729.0> with 0 neighbours crashed with reason: {sink_died,shutdown}
2012-06-13 12:35:24.572 [error] <0.193.0> Supervisor riak_pipe_builder_sup had child undefined started with {riak_pipe_builder,start_link,undefined} at <0.729.0> exit with reason {sink_died,shutdown} in context child_terminated
2012-06-13 12:35:24.573 [error] <0.724.0> gen_fsm <0.724.0> in state wait_pipeline_shutdown terminated with reason: {sink_died,normal}
2012-06-13 12:35:24.575 [error] <0.724.0> CRASH REPORT Process <0.724.0> with 2 neighbours crashed with reason: {sink_died,normal}
2012-06-13 12:35:24.577 [error] <0.193.0> Supervisor riak_pipe_builder_sup had child undefined started with {riak_pipe_builder,start_link,undefined} at <0.724.0> exit with reason {sink_died,normal} in context child_terminated
2012-06-13 12:35:24.578 [error] <0.195.0> Supervisor riak_pipe_qcover_sup had child undefined started with {riak_core_coverage_fsm,start_link,undefined} at <0.731.0> exit with reason {sink_died,normal} in context child_terminated


on error.log i get:

2012-06-13 12:34:41.710 [error] <0.1880.0> Supervisor riak_control_sup had child riak_control_session started with riak_control_session:start_link() at <0.1881.0> exit with reason killed in context shutdown_error
2012-06-13 12:35:24.556 [error] <0.729.0> gen_fsm <0.729.0> in state wait_pipeline_shutdown terminated with reason: {sink_died,shutdown}
2012-06-13 12:35:24.564 [error] <0.729.0> CRASH REPORT Process <0.729.0> with 0 neighbours crashed with reason: {sink_died,shutdown}
2012-06-13 12:35:24.572 [error] <0.193.0> Supervisor riak_pipe_builder_sup had child undefined started with {riak_pipe_builder,start_link,undefined} at <0.729.0> exit with reason {sink_died,shutdown} in context child_terminated
2012-06-13 12:35:24.573 [error] <0.724.0> gen_fsm <0.724.0> in state wait_pipeline_shutdown terminated with reason: {sink_died,normal}
2012-06-13 12:35:24.575 [error] <0.724.0> CRASH REPORT Process <0.724.0> with 2 neighbours crashed with reason: {sink_died,normal}
2012-06-13 12:35:24.577 [error] <0.193.0> Supervisor riak_pipe_builder_sup had child undefined started with {riak_pipe_builder,start_link,undefined} at <0.724.0> exit with reason {sink_died,normal} in context child_terminated
2012-06-13 12:35:24.578 [error] <0.195.0> Supervisor riak_pipe_qcover_sup had child undefined started with {riak_core_coverage_fsm,start_link,undefined} at <0.731.0> exit with reason {sink_died,normal} in context child_terminated
2012-06-13 12:36:25.529 [error] <0.159.0> ** Node 'riak at 192.168.1.204' not responding **
** Removing (timedout) connection **


Some more info:
This is the state after disconnecting one node:
root at db1:/var/log/riak# riak-admin member_status
Attempting to restart script through sudo -u riak
================================= Membership ==================================
Status     Ring    Pending    Node
-------------------------------------------------------------------------------
valid      18.8%      --      'riak at 192.168.1.201'
valid      18.8%      --      'riak at 192.168.1.202'
valid      18.8%      --      'riak at 192.168.1.203'
valid      25.0%      --      'riak at 192.168.1.204'
valid      18.8%      --      'riak at 192.168.1.205'
-------------------------------------------------------------------------------
Valid:5 / Leaving:0 / Exiting:0 / Joining:0 / Down:0
root at db1:/var/log/riak# riak-admin ring_status
Attempting to restart script through sudo -u riak
================================== Claimant ===================================
Claimant:  'riak at 192.168.1.201'
Status:     up
Ring Ready: true

============================== Ownership Handoff ==============================
No pending changes.

============================== Unreachable Nodes ==============================
The following nodes are unreachable: ['riak at 192.168.1.204']

WARNING: The cluster state will not converge until all nodes
are up. Once the above nodes come back online, convergence
will continue. If the outages are long-term or permanent, you
can either mark the nodes as down (riak-admin down NODE) or
forcibly remove the nodes from the cluster (riak-admin
force-remove NODE) to allow the remaining nodes to settle.
root at db1:/var/log/riak# 







On Jun 12, 2012, at 11:10 PM, Gil Teixeira wrote:

> Anything… please do help me in regard to what files i should send… 
> 
> On Jun 12, 2012, at 10:55 PM, Mark Phillips wrote:
> 
>> Hi Gil,
>> 
>> Anything in the logs you can share?
>> 
>> Mark
>> 
>> 
>> On Jun 12, 2012, at 1:28, Gil Teixeira <gil at pixelbeam.org> wrote:
>> 
>>> Hi,
>>> 
>>> Hellow, I am faily new to RIAK, and am experiencing a very peculiar problem. 
>>> We are conducting a RIAK evaluation to see if it fits our purposes, because on paper it seams to be a perfect fit.
>>> 
>>> To create an evaluating scenario i have setup 5 nodes under vmware(the network is bridged)
>>> 
>>> The cluster runs just fine, and i can down nodes for maintenance as expected perfectly, while the cluster keeps serving the data as expected. Node reboots are also tolerated well by the cluster as expected.
>>> 
>>> But if a note unexpectedly fails (hard power off or sudden network disconnection) all nodes get exited and the full cluster becomes inaccessible until the failed node is back up or until i manually mark the failed node as down with risk-admin.
>>> 
>>> I was expecting a cluster (n3) with 5 nodes to simply tolerate 1 node failure transparently. 
>>> 
>>> Is there something i may be doing wrong? Is this the expected behavior?
>>> 
>>> I would appreciate any light anyone could shine on this subject.
>>> 
>>> Thank you,
>>> Gil
>>> _______________________________________________
>>> riak-users mailing list
>>> riak-users at lists.basho.com
>>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
> 
> 
> _______________________________________________
> riak-users mailing list
> riak-users at lists.basho.com
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com





More information about the riak-users mailing list