Bitcask node won't restart

Keith Dreibelbis kdreibel at gmail.com
Fri Apr 1 17:34:19 EDT 2011


Thanks for the response, Dan.  Yes, the problem is that it *looks* like
starting node 2 was successful (says it's ALIVE, shows up in ps).  But it's
not responding to pings, isn't usable, and nodes 1 and 3 say node 2 isn't
connected.

As you suggested, here is the output of riak-admin status for the 3 nodes,
and I'll attach a tarball for node 2's log directory.

Keith

kratos:dev1 keith$ bin/riak-admin status
1-minute stats for 'dev1 at 127.0.0.1'
-------------------------------------------
vnode gets : 0
vnode_puts : 0
read_repairs : 0
vnode_gets_total : 6251
vnode_puts_total : 1064
node_gets : 0
node_gets_total : 4786
node_get_fsm_time_mean : 0
node_get_fsm_time_median : 0
node_get_fsm_time_95 : 0
node_get_fsm_time_99 : 0
node_get_fsm_time_100 : 0
node_puts : 0
node_puts_total : 774
node_put_fsm_time_mean : 0
node_put_fsm_time_median : 0
node_put_fsm_time_95 : 0
node_put_fsm_time_99 : 0
node_put_fsm_time_100 : 0
read_repairs_total : 354
cpu_nprocs : 127
cpu_avg1 : 164
cpu_avg5 : 202
cpu_avg15 : 205
mem_total : 3264444000
mem_allocated : 3155680000
disk : [{"/",488050672,13}]
nodename : 'dev1 at 127.0.0.1'
connected_nodes : ['dev3 at 127.0.0.1']
sys_driver_version : <<"1.5">>
sys_global_heaps_size : 0
sys_heap_type : private
sys_logical_processors : 2
sys_otp_release : <<"R14B01">>
sys_process_count : 206
sys_smp_support : true
sys_system_version : <<"Erlang R14B01 (erts-5.8.2) [source] [64-bit]
[smp:2:2] [rq:2] [async-threads:64] [hipe] [kernel-poll:true]">>
sys_system_architecture : <<"i386-apple-darwin10.7.0">>
sys_threads_enabled : true
sys_thread_pool_size : 64
sys_wordsize : 8
ring_members : ['dev1 at 127.0.0.1','dev2 at 127.0.0.1','dev3 at 127.0.0.1']
ring_num_partitions : 64
ring_ownership : <<"[{'dev3 at 127.0.0.1',21},{'dev2 at 127.0.0.1',21},{'
dev1 at 127.0.0.1',22}]">>
ring_creation_size : 64
storage_backend : riak_kv_bitcask_backend
pbc_connects_total : 350
pbc_connects : 0
pbc_active : 0
riak_err_version : <<"1.0.1">>
runtime_tools_version : <<"1.8.4.1">>
basho_stats_version : <<"1.0.1">>
luwak_version : <<"1.0.0">>
skerl_version : <<"1.0.0">>
riak_kv_version : <<"0.14.0">>
bitcask_version : <<"1.1.5">>
riak_core_version : <<"0.14.0">>
riak_sysmon_version : <<"0.9.0">>
luke_version : <<"0.2.3">>
erlang_js_version : <<"0.5.0">>
mochiweb_version : <<"1.7.1">>
webmachine_version : <<"1.8.0">>
crypto_version : <<"2.0.2">>
os_mon_version : <<"2.2.5">>
cluster_info_version : <<"1.1.0">>
sasl_version : <<"2.1.9.2">>
stdlib_version : <<"1.17.2">>
kernel_version : <<"2.14.2">>
executing_mappers : 0

kratos:dev2 keith$ bin/riak-admin status
Node is not running!

kratos:dev3 keith$ bin/riak-admin status
1-minute stats for 'dev3 at 127.0.0.1'
-------------------------------------------
vnode gets : 0
vnode_puts : 0
read_repairs : 0
vnode_gets_total : 7061
vnode_puts_total : 1198
node_gets : 0
node_gets_total : 0
node_get_fsm_time_mean : 0
node_get_fsm_time_median : 0
node_get_fsm_time_95 : 0
node_get_fsm_time_99 : 0
node_get_fsm_time_100 : 0
node_puts : 0
node_puts_total : 0
node_put_fsm_time_mean : 0
node_put_fsm_time_median : 0
node_put_fsm_time_95 : 0
node_put_fsm_time_99 : 0
node_put_fsm_time_100 : 0
read_repairs_total : 0
cpu_nprocs : 134
cpu_avg1 : 118
cpu_avg5 : 161
cpu_avg15 : 184
mem_total : 3264252000
mem_allocated : 3189744000
disk : [{"/",488050672,13}]
nodename : 'dev3 at 127.0.0.1'
connected_nodes : ['dev1 at 127.0.0.1']
sys_driver_version : <<"1.5">>
sys_global_heaps_size : 0
sys_heap_type : private
sys_logical_processors : 2
sys_otp_release : <<"R14B01">>
sys_process_count : 205
sys_smp_support : true
sys_system_version : <<"Erlang R14B01 (erts-5.8.2) [source] [64-bit]
[smp:2:2] [rq:2] [async-threads:64] [hipe] [kernel-poll:true]">>
sys_system_architecture : <<"i386-apple-darwin10.7.0">>
sys_threads_enabled : true
sys_thread_pool_size : 64
sys_wordsize : 8
ring_members : ['dev1 at 127.0.0.1','dev2 at 127.0.0.1','dev3 at 127.0.0.1']
ring_num_partitions : 64
ring_ownership : <<"[{'dev3 at 127.0.0.1',21},{'dev2 at 127.0.0.1',21},{'
dev1 at 127.0.0.1',22}]">>
ring_creation_size : 64
storage_backend : riak_kv_bitcask_backend
pbc_connects_total : 0
pbc_connects : 0
pbc_active : 0
riak_err_version : <<"1.0.1">>
runtime_tools_version : <<"1.8.4.1">>
basho_stats_version : <<"1.0.1">>
luwak_version : <<"1.0.0">>
skerl_version : <<"1.0.0">>
riak_kv_version : <<"0.14.0">>
bitcask_version : <<"1.1.5">>
riak_core_version : <<"0.14.0">>
riak_sysmon_version : <<"0.9.0">>
luke_version : <<"0.2.3">>
erlang_js_version : <<"0.5.0">>
mochiweb_version : <<"1.7.1">>
webmachine_version : <<"1.8.0">>
crypto_version : <<"2.0.2">>
os_mon_version : <<"2.2.5">>
cluster_info_version : <<"1.1.0">>
sasl_version : <<"2.1.9.2">>
stdlib_version : <<"1.17.2">>
kernel_version : <<"2.14.2">>
executing_mappers : 0



On Fri, Apr 1, 2011 at 2:17 PM, Dan Reverri <dan at basho.com> wrote:

> Hi Keith,
>
> The first set of errors you saw ("Protocol: ~p: register error: ~p~n")
> indicate an Erlang node was already running with this name; node 2 may have
> been running in the background without you realizing it.
>
> The second error which occurred when choosing a different name was probably
> due to a port binding issue; this means the ports node 2 tried binding to
> (handoff, web, pb) were already occupied. Again, node 2 may have already
> been running in the background.
>
> After rebooting the machine it looks like starting node 2 was successful.
> Regarding the ringready failure, can you run "riak-admin status" on all
> three nodes? Also, can you send in the log files for node 2 (the entire log
> directory would be great)?
>
> Thanks,
> Dan
>
> Daniel Reverri
> Developer Advocate
> Basho Technologies, Inc.
> dan at basho.com
>
>
> On Fri, Apr 1, 2011 at 1:57 PM, Keith Dreibelbis <kdreibel at gmail.com>wrote:
>
>> Hi riak-users,
>>
>> I have a node in a cluster of 3 that failed and won't come back up.  This
>> is in a dev environment, so it's not like there's critical data on there.
>> However, rather than start over with a new install, I want to learn how to
>> recover from such a failure in production.  I figured there was enough
>> redundancy such that node 2 could recover with (at worst) a little help from
>> nodes 1 and 3.
>>
>> When I tried to restart/reboot (I tried both), this showed up in
>> erlang.log.1:
>>
>> Exec: /Users/keith/src/riak/dev/dev2/erts-5.8.2/bin/erlexec -boot
>> /Users/keith/src/riak/dev/dev2/releases/0.14.0/riak             -embedded
>> -config /Users/keith/src/riak/dev/dev2/etc/app.confi
>>
>> g             -args_file /Users/keith/src/riak/dev/dev2/etc/vm.args --
>> console
>>
>> Root: /Users/keith/src/riak/dev/dev2
>>
>> {error_logger,{{2011,3,31},{16,43,35}},"Protocol: ~p: register error:
>> ~p~n",["inet_tcp",{{badmatch,{error,duplicate_name}},[{inet_tcp_dist,listen,1},{net_kernel,start_protos,4},{net_kernel,sta
>>
>>
>> rt_protos,3},{net_kernel,init_node,2},{net_kernel,init,1},{gen_server,init_it,6},{proc_lib,init_p_do_apply,3}]}]}^M
>>
>>
>> {error_logger,{{2011,3,31},{16,43,35}},crash_report,[[{initial_call,{net_kernel,init,['Argument__1']}},{pid,<0.20.0>},{registered_name,[]},{error_info,{exit,{error,badarg},[{gen_server,init_it
>>
>>
>> ,6},{proc_lib,init_p_do_apply,3}]}},{ancestors,[net_sup,kernel_sup,<0.10.0>]},{messages,[]},{links,[#Port<0.138>,<0.17.0>]},{dictionary,[{longnames,true}]},{trap_exit,true},{status,running},{h
>>
>> eap_size,377},{stack_size,24},{reductions,456}],[]]}^M
>>
>>
>> {error_logger,{{2011,3,31},{16,43,35}},supervisor_report,[{supervisor,{local,net_sup}},{errorContext,start_error},{reason,{'EXIT',nodistribution}},{offender,[{pid,undefined},{name,net_kernel},
>>
>> {mfargs,{net_kernel,start_link,[['dev2 at 127.0.0.1
>> ',longnames]]}},{restart_type,permanent},{shutdown,2000},{child_type,worker}]}]}^M
>>
>>
>> {error_logger,{{2011,3,31},{16,43,35}},supervisor_report,[{supervisor,{local,kernel_sup}},{errorContext,start_error},{reason,shutdown},{offender,[{pid,undefined},{name,net_sup},{mfargs,{erl_di
>>
>>
>> stribution,start_link,[]}},{restart_type,permanent},{shutdown,infinity},{child_type,supervisor}]}]}^M
>>
>>
>> {error_logger,{{2011,3,31},{16,43,35}},std_info,[{application,kernel},{exited,{shutdown,{kernel,start,[normal,[]]}}},{type,permanent}]}^M
>>
>> {"Kernel pid
>> terminated",application_controller,"{application_start_failure,kernel,{shutdown,{kernel,start,[normal,[]]}}}"}^M
>>
>> ^M
>>
>> Crash dump was written to: erl_crash.dump^M
>>
>> Kernel pid terminated (application_controller)
>> ({application_start_failure,kernel,{shutdown,{kernel,start,[normal,[]]}}})^M
>>
>>
>> http://wiki.basho.com/Recovering-a-failed-node.html suggests starting the
>> node in console mode with a different name.  This didn't help, it just
>> crashed again.  I'm using bitcask (the default) while the example on that
>> page gives output like InnoDB would return.
>>
>> kratos:dev2 keith$ bin/riak console -name differentname at nohost
>> Exec: /Users/keith/src/riak/dev/dev2/erts-5.8.2/bin/erlexec -boot
>> /Users/keith/src/riak/dev/dev2/releases/0.14.0/riak             -embedded
>> -config /Users/keith/src/riak/dev/dev2/etc/app.config             -args_file
>> /Users/keith/src/riak/dev/dev2/etc/vm.args -- console -name
>> differentname at nohost
>> Root: /Users/keith/src/riak/dev/dev2
>> Erlang R14B01 (erts-5.8.2) [source] [64-bit] [smp:2:2] [rq:2]
>> [async-threads:64] [hipe] [kernel-poll:true]
>>
>>
>> =INFO REPORT==== 31-Mar-2011::17:35:05 ===
>>     alarm_handler: {set,{system_memory_high_watermark,[]}}
>> ** Found 0 name clashes in code paths
>>
>> =INFO REPORT==== 31-Mar-2011::17:35:05 ===
>>     application: riak_core
>>     exited: {shutdown,{riak_core_app,start,[normal,[]]}}
>>     type: permanent
>>
>> =INFO REPORT==== 31-Mar-2011::17:35:05 ===
>>     alarm_handler: {clear,system_memory_high_watermark}
>> {"Kernel pid
>> terminated",application_controller,"{application_start_failure,riak_core,{shutdown,{riak_core_app,start,[normal,[]]}}}"}
>>
>> Crash dump was written to: erl_crash.dump
>> Kernel pid terminated (application_controller)
>> ({application_start_failure,riak_core,{shutdown,{riak_core_app,start,[normal,[]]}}})
>> kratos:dev2 keith$
>>
>> After I rebooted my machine and tried starting the trio of riak nodes,
>> again node 2 is not responding to pings, and "riak-admin ringready" from
>> nodes 1 and 3 complain that node 2 is down.  But in the log, node 2 is
>> saying it's ALIVE.  Also, I can see processes for all 3 nodes in ps:
>>
>> kratos:~ keith$ ps auxww | grep riak
>> keith      360   0.2  3.4  2606932 143044 s006  Ss+  12:05PM   3:21.61
>> /Users/keith/src/riak/dev/dev1/erts-5.8.2/bin/beam.smp -K true -A 64 --
>> -root /Users/keith/src/riak/dev/dev1 -progname riak -- -home /Users/keith --
>> -boot /Users/keith/src/riak/dev/dev1/releases/0.14.0/riak -embedded -config
>> /Users/keith/src/riak/dev/dev1/etc/app.config -name dev1 at 127.0.0.1-setcookie riak -- console
>> keith      580   0.1  2.0  2549924  85492 s008  Ss+  12:05PM   2:24.08
>> /Users/keith/src/riak/dev/dev3/erts-5.8.2/bin/beam.smp -K true -A 64 --
>> -root /Users/keith/src/riak/dev/dev3 -progname riak -- -home /Users/keith --
>> -boot /Users/keith/src/riak/dev/dev3/releases/0.14.0/riak -embedded -config
>> /Users/keith/src/riak/dev/dev3/etc/app.config -name dev3 at 127.0.0.1-setcookie riak -- console
>> keith      380   0.0  0.0  2435004    268   ??  S    12:05PM   0:00.08
>> /Users/keith/src/riak/dev/dev1/erts-5.8.2/bin/epmd -daemon
>> keith      358   0.0  0.0  2434988    264   ??  S    12:05PM   0:00.01
>> /Users/keith/src/riak/dev/dev1/erts-5.8.2/bin/run_erl -daemon
>> /tmp//Users/keith/src/riak/dev/dev1// /Users/keith/src/riak/dev/dev1/log
>> exec /Users/keith/src/riak/dev/dev1/bin/riak console
>> keith     1633   0.0  0.0  2435548      0 s010  R+    1:34PM   0:00.00
>> grep riak
>> keith      578   0.0  0.0  2434988    264   ??  S    12:05PM   0:00.00
>> /Users/keith/src/riak/dev/dev3/erts-5.8.2/bin/run_erl -daemon
>> /tmp//Users/keith/src/riak/dev/dev3// /Users/keith/src/riak/dev/dev3/log
>> exec /Users/keith/src/riak/dev/dev3/bin/riak console
>> keith      470   0.0  2.0  2548688  83584 s007  Ss+  12:05PM   0:33.41
>> /Users/keith/src/riak/dev/dev2/erts-5.8.2/bin/beam.smp -K true -A 64 --
>> -root /Users/keith/src/riak/dev/dev2 -progname riak -- -home /Users/keith --
>> -boot /Users/keith/src/riak/dev/dev2/releases/0.14.0/riak -embedded -config
>> /Users/keith/src/riak/dev/dev2/etc/app.config -name dev2 at 127.0.0.1-setcookie riak -- console
>> keith      468   0.0  0.0  2434988    264   ??  S    12:05PM   0:00.01
>> /Users/keith/src/riak/dev/dev2/erts-5.8.2/bin/run_erl -daemon
>> /tmp//Users/keith/src/riak/dev/dev2// /Users/keith/src/riak/dev/dev2/log
>> exec /Users/keith/src/riak/dev/dev2/bin/riak console
>> kratos:~ keith$
>>
>> I've attached the erl_crash.dump file.  Anyone have an explanation or
>> suggestions on how to proceed?
>>
>>
>> Keith
>>
>>
>> _______________________________________________
>> riak-users mailing list
>> riak-users at lists.basho.com
>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.basho.com/pipermail/riak-users_lists.basho.com/attachments/20110401/dbade3fe/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: dev2-log.tar.gz
Type: application/x-gzip
Size: 15704 bytes
Desc: not available
URL: <http://lists.basho.com/pipermail/riak-users_lists.basho.com/attachments/20110401/dbade3fe/attachment.gz>


More information about the riak-users mailing list