Bitcask node won't restart

Keith Dreibelbis kdreibel at gmail.com
Fri Apr 1 20:16:50 EDT 2011


Hi Dan,

It seems I have to say "never mind, it fixed itself".  I killed it and ran
the console, like you suggested, and after it output some messages about
handoffs and merges, I did the commands you mentioned:

(dev2 at 127.0.0.1)1> node().
'dev2 at 127.0.0.1'
(dev2 at 127.0.0.1)2> erlang:get_cookie().
riak
(dev2 at 127.0.0.1)3> q().

and then "riak start" and the node is now happily back in the ring.  What's
surprised me was that "riak restart" and "riak reboot" didn't seem to do
anything in this situation.  It just got into an unresponsive state, and the
process had to be killed to fix it.  But perhaps this is the normal thing to
do for an unresponsive node?  Anyway, thanks for the help, my problem is
resolved.


Keith


On Fri, Apr 1, 2011 at 4:41 PM, Dan Reverri <dan at basho.com> wrote:

> Hi Keith,
>
> Can you try attaching to node 2 using "riak attach"? If that doesn't work,
> manually kill node 2 and run "riak console".
>
> Once you have access to the console, type the following:
> 1> node().
> % the console will output the node name here
>
> 2> erlang:get_cookie().
> % the console will output the cookie here
>
> Let me know what those commands output.
>
> Thanks,
> Dan
>
> Daniel Reverri
> Developer Advocate
> Basho Technologies, Inc.
> dan at basho.com
>
>
> On Fri, Apr 1, 2011 at 2:34 PM, Keith Dreibelbis <kdreibel at gmail.com>wrote:
>
>> Thanks for the response, Dan.  Yes, the problem is that it *looks* like
>> starting node 2 was successful (says it's ALIVE, shows up in ps).  But it's
>> not responding to pings, isn't usable, and nodes 1 and 3 say node 2 isn't
>> connected.
>>
>> As you suggested, here is the output of riak-admin status for the 3 nodes,
>> and I'll attach a tarball for node 2's log directory.
>>
>> Keith
>>
>> kratos:dev1 keith$ bin/riak-admin status
>> 1-minute stats for 'dev1 at 127.0.0.1'
>> -------------------------------------------
>> vnode gets : 0
>> vnode_puts : 0
>> read_repairs : 0
>> vnode_gets_total : 6251
>> vnode_puts_total : 1064
>> node_gets : 0
>> node_gets_total : 4786
>> node_get_fsm_time_mean : 0
>> node_get_fsm_time_median : 0
>> node_get_fsm_time_95 : 0
>> node_get_fsm_time_99 : 0
>> node_get_fsm_time_100 : 0
>> node_puts : 0
>> node_puts_total : 774
>> node_put_fsm_time_mean : 0
>> node_put_fsm_time_median : 0
>> node_put_fsm_time_95 : 0
>> node_put_fsm_time_99 : 0
>> node_put_fsm_time_100 : 0
>> read_repairs_total : 354
>> cpu_nprocs : 127
>> cpu_avg1 : 164
>> cpu_avg5 : 202
>> cpu_avg15 : 205
>> mem_total : 3264444000
>> mem_allocated : 3155680000
>> disk : [{"/",488050672,13}]
>> nodename : 'dev1 at 127.0.0.1'
>> connected_nodes : ['dev3 at 127.0.0.1']
>> sys_driver_version : <<"1.5">>
>> sys_global_heaps_size : 0
>> sys_heap_type : private
>> sys_logical_processors : 2
>> sys_otp_release : <<"R14B01">>
>> sys_process_count : 206
>> sys_smp_support : true
>> sys_system_version : <<"Erlang R14B01 (erts-5.8.2) [source] [64-bit]
>> [smp:2:2] [rq:2] [async-threads:64] [hipe] [kernel-poll:true]">>
>> sys_system_architecture : <<"i386-apple-darwin10.7.0">>
>> sys_threads_enabled : true
>> sys_thread_pool_size : 64
>> sys_wordsize : 8
>> ring_members : ['dev1 at 127.0.0.1','dev2 at 127.0.0.1','dev3 at 127.0.0.1']
>> ring_num_partitions : 64
>> ring_ownership : <<"[{'dev3 at 127.0.0.1',21},{'dev2 at 127.0.0.1',21},{'
>> dev1 at 127.0.0.1',22}]">>
>> ring_creation_size : 64
>> storage_backend : riak_kv_bitcask_backend
>> pbc_connects_total : 350
>> pbc_connects : 0
>> pbc_active : 0
>> riak_err_version : <<"1.0.1">>
>> runtime_tools_version : <<"1.8.4.1">>
>> basho_stats_version : <<"1.0.1">>
>> luwak_version : <<"1.0.0">>
>> skerl_version : <<"1.0.0">>
>> riak_kv_version : <<"0.14.0">>
>> bitcask_version : <<"1.1.5">>
>> riak_core_version : <<"0.14.0">>
>> riak_sysmon_version : <<"0.9.0">>
>> luke_version : <<"0.2.3">>
>> erlang_js_version : <<"0.5.0">>
>> mochiweb_version : <<"1.7.1">>
>> webmachine_version : <<"1.8.0">>
>> crypto_version : <<"2.0.2">>
>> os_mon_version : <<"2.2.5">>
>> cluster_info_version : <<"1.1.0">>
>> sasl_version : <<"2.1.9.2">>
>> stdlib_version : <<"1.17.2">>
>> kernel_version : <<"2.14.2">>
>> executing_mappers : 0
>>
>> kratos:dev2 keith$ bin/riak-admin status
>> Node is not running!
>>
>> kratos:dev3 keith$ bin/riak-admin status
>> 1-minute stats for 'dev3 at 127.0.0.1'
>> -------------------------------------------
>> vnode gets : 0
>> vnode_puts : 0
>> read_repairs : 0
>> vnode_gets_total : 7061
>> vnode_puts_total : 1198
>> node_gets : 0
>> node_gets_total : 0
>> node_get_fsm_time_mean : 0
>> node_get_fsm_time_median : 0
>> node_get_fsm_time_95 : 0
>> node_get_fsm_time_99 : 0
>> node_get_fsm_time_100 : 0
>> node_puts : 0
>> node_puts_total : 0
>> node_put_fsm_time_mean : 0
>> node_put_fsm_time_median : 0
>> node_put_fsm_time_95 : 0
>> node_put_fsm_time_99 : 0
>> node_put_fsm_time_100 : 0
>> read_repairs_total : 0
>> cpu_nprocs : 134
>> cpu_avg1 : 118
>> cpu_avg5 : 161
>> cpu_avg15 : 184
>> mem_total : 3264252000
>> mem_allocated : 3189744000
>> disk : [{"/",488050672,13}]
>> nodename : 'dev3 at 127.0.0.1'
>> connected_nodes : ['dev1 at 127.0.0.1']
>> sys_driver_version : <<"1.5">>
>> sys_global_heaps_size : 0
>> sys_heap_type : private
>> sys_logical_processors : 2
>> sys_otp_release : <<"R14B01">>
>> sys_process_count : 205
>> sys_smp_support : true
>> sys_system_version : <<"Erlang R14B01 (erts-5.8.2) [source] [64-bit]
>> [smp:2:2] [rq:2] [async-threads:64] [hipe] [kernel-poll:true]">>
>> sys_system_architecture : <<"i386-apple-darwin10.7.0">>
>> sys_threads_enabled : true
>> sys_thread_pool_size : 64
>> sys_wordsize : 8
>> ring_members : ['dev1 at 127.0.0.1','dev2 at 127.0.0.1','dev3 at 127.0.0.1']
>> ring_num_partitions : 64
>> ring_ownership : <<"[{'dev3 at 127.0.0.1',21},{'dev2 at 127.0.0.1',21},{'
>> dev1 at 127.0.0.1',22}]">>
>> ring_creation_size : 64
>> storage_backend : riak_kv_bitcask_backend
>> pbc_connects_total : 0
>> pbc_connects : 0
>> pbc_active : 0
>> riak_err_version : <<"1.0.1">>
>> runtime_tools_version : <<"1.8.4.1">>
>> basho_stats_version : <<"1.0.1">>
>> luwak_version : <<"1.0.0">>
>> skerl_version : <<"1.0.0">>
>> riak_kv_version : <<"0.14.0">>
>> bitcask_version : <<"1.1.5">>
>> riak_core_version : <<"0.14.0">>
>> riak_sysmon_version : <<"0.9.0">>
>> luke_version : <<"0.2.3">>
>> erlang_js_version : <<"0.5.0">>
>> mochiweb_version : <<"1.7.1">>
>> webmachine_version : <<"1.8.0">>
>> crypto_version : <<"2.0.2">>
>> os_mon_version : <<"2.2.5">>
>> cluster_info_version : <<"1.1.0">>
>> sasl_version : <<"2.1.9.2">>
>> stdlib_version : <<"1.17.2">>
>> kernel_version : <<"2.14.2">>
>> executing_mappers : 0
>>
>>
>>
>> On Fri, Apr 1, 2011 at 2:17 PM, Dan Reverri <dan at basho.com> wrote:
>>
>>> Hi Keith,
>>>
>>> The first set of errors you saw ("Protocol: ~p: register error: ~p~n")
>>> indicate an Erlang node was already running with this name; node 2 may have
>>> been running in the background without you realizing it.
>>>
>>> The second error which occurred when choosing a different name was
>>> probably due to a port binding issue; this means the ports node 2 tried
>>> binding to (handoff, web, pb) were already occupied. Again, node 2 may have
>>> already been running in the background.
>>>
>>> After rebooting the machine it looks like starting node 2 was successful.
>>> Regarding the ringready failure, can you run "riak-admin status" on all
>>> three nodes? Also, can you send in the log files for node 2 (the entire log
>>> directory would be great)?
>>>
>>> Thanks,
>>> Dan
>>>
>>> Daniel Reverri
>>> Developer Advocate
>>> Basho Technologies, Inc.
>>> dan at basho.com
>>>
>>>
>>> On Fri, Apr 1, 2011 at 1:57 PM, Keith Dreibelbis <kdreibel at gmail.com>wrote:
>>>
>>>> Hi riak-users,
>>>>
>>>> I have a node in a cluster of 3 that failed and won't come back up.
>>>>  This is in a dev environment, so it's not like there's critical data on
>>>> there. However, rather than start over with a new install, I want to learn
>>>> how to recover from such a failure in production.  I figured there was
>>>> enough redundancy such that node 2 could recover with (at worst) a little
>>>> help from nodes 1 and 3.
>>>>
>>>> When I tried to restart/reboot (I tried both), this showed up in
>>>> erlang.log.1:
>>>>
>>>> Exec: /Users/keith/src/riak/dev/dev2/erts-5.8.2/bin/erlexec -boot
>>>> /Users/keith/src/riak/dev/dev2/releases/0.14.0/riak             -embedded
>>>> -config /Users/keith/src/riak/dev/dev2/etc/app.confi
>>>>
>>>> g             -args_file /Users/keith/src/riak/dev/dev2/etc/vm.args --
>>>> console
>>>>
>>>> Root: /Users/keith/src/riak/dev/dev2
>>>>
>>>> {error_logger,{{2011,3,31},{16,43,35}},"Protocol: ~p: register error:
>>>> ~p~n",["inet_tcp",{{badmatch,{error,duplicate_name}},[{inet_tcp_dist,listen,1},{net_kernel,start_protos,4},{net_kernel,sta
>>>>
>>>>
>>>> rt_protos,3},{net_kernel,init_node,2},{net_kernel,init,1},{gen_server,init_it,6},{proc_lib,init_p_do_apply,3}]}]}^M
>>>>
>>>>
>>>> {error_logger,{{2011,3,31},{16,43,35}},crash_report,[[{initial_call,{net_kernel,init,['Argument__1']}},{pid,<0.20.0>},{registered_name,[]},{error_info,{exit,{error,badarg},[{gen_server,init_it
>>>>
>>>>
>>>> ,6},{proc_lib,init_p_do_apply,3}]}},{ancestors,[net_sup,kernel_sup,<0.10.0>]},{messages,[]},{links,[#Port<0.138>,<0.17.0>]},{dictionary,[{longnames,true}]},{trap_exit,true},{status,running},{h
>>>>
>>>> eap_size,377},{stack_size,24},{reductions,456}],[]]}^M
>>>>
>>>>
>>>> {error_logger,{{2011,3,31},{16,43,35}},supervisor_report,[{supervisor,{local,net_sup}},{errorContext,start_error},{reason,{'EXIT',nodistribution}},{offender,[{pid,undefined},{name,net_kernel},
>>>>
>>>> {mfargs,{net_kernel,start_link,[['dev2 at 127.0.0.1
>>>> ',longnames]]}},{restart_type,permanent},{shutdown,2000},{child_type,worker}]}]}^M
>>>>
>>>>
>>>> {error_logger,{{2011,3,31},{16,43,35}},supervisor_report,[{supervisor,{local,kernel_sup}},{errorContext,start_error},{reason,shutdown},{offender,[{pid,undefined},{name,net_sup},{mfargs,{erl_di
>>>>
>>>>
>>>> stribution,start_link,[]}},{restart_type,permanent},{shutdown,infinity},{child_type,supervisor}]}]}^M
>>>>
>>>>
>>>> {error_logger,{{2011,3,31},{16,43,35}},std_info,[{application,kernel},{exited,{shutdown,{kernel,start,[normal,[]]}}},{type,permanent}]}^M
>>>>
>>>> {"Kernel pid
>>>> terminated",application_controller,"{application_start_failure,kernel,{shutdown,{kernel,start,[normal,[]]}}}"}^M
>>>>
>>>> ^M
>>>>
>>>> Crash dump was written to: erl_crash.dump^M
>>>>
>>>> Kernel pid terminated (application_controller)
>>>> ({application_start_failure,kernel,{shutdown,{kernel,start,[normal,[]]}}})^M
>>>>
>>>>
>>>> http://wiki.basho.com/Recovering-a-failed-node.html suggests starting
>>>> the node in console mode with a different name.  This didn't help, it just
>>>> crashed again.  I'm using bitcask (the default) while the example on that
>>>> page gives output like InnoDB would return.
>>>>
>>>> kratos:dev2 keith$ bin/riak console -name differentname at nohost
>>>> Exec: /Users/keith/src/riak/dev/dev2/erts-5.8.2/bin/erlexec -boot
>>>> /Users/keith/src/riak/dev/dev2/releases/0.14.0/riak             -embedded
>>>> -config /Users/keith/src/riak/dev/dev2/etc/app.config             -args_file
>>>> /Users/keith/src/riak/dev/dev2/etc/vm.args -- console -name
>>>> differentname at nohost
>>>> Root: /Users/keith/src/riak/dev/dev2
>>>> Erlang R14B01 (erts-5.8.2) [source] [64-bit] [smp:2:2] [rq:2]
>>>> [async-threads:64] [hipe] [kernel-poll:true]
>>>>
>>>>
>>>> =INFO REPORT==== 31-Mar-2011::17:35:05 ===
>>>>     alarm_handler: {set,{system_memory_high_watermark,[]}}
>>>> ** Found 0 name clashes in code paths
>>>>
>>>> =INFO REPORT==== 31-Mar-2011::17:35:05 ===
>>>>     application: riak_core
>>>>     exited: {shutdown,{riak_core_app,start,[normal,[]]}}
>>>>     type: permanent
>>>>
>>>> =INFO REPORT==== 31-Mar-2011::17:35:05 ===
>>>>     alarm_handler: {clear,system_memory_high_watermark}
>>>> {"Kernel pid
>>>> terminated",application_controller,"{application_start_failure,riak_core,{shutdown,{riak_core_app,start,[normal,[]]}}}"}
>>>>
>>>> Crash dump was written to: erl_crash.dump
>>>> Kernel pid terminated (application_controller)
>>>> ({application_start_failure,riak_core,{shutdown,{riak_core_app,start,[normal,[]]}}})
>>>> kratos:dev2 keith$
>>>>
>>>> After I rebooted my machine and tried starting the trio of riak nodes,
>>>> again node 2 is not responding to pings, and "riak-admin ringready" from
>>>> nodes 1 and 3 complain that node 2 is down.  But in the log, node 2 is
>>>> saying it's ALIVE.  Also, I can see processes for all 3 nodes in ps:
>>>>
>>>> kratos:~ keith$ ps auxww | grep riak
>>>> keith      360   0.2  3.4  2606932 143044 s006  Ss+  12:05PM   3:21.61
>>>> /Users/keith/src/riak/dev/dev1/erts-5.8.2/bin/beam.smp -K true -A 64 --
>>>> -root /Users/keith/src/riak/dev/dev1 -progname riak -- -home /Users/keith --
>>>> -boot /Users/keith/src/riak/dev/dev1/releases/0.14.0/riak -embedded -config
>>>> /Users/keith/src/riak/dev/dev1/etc/app.config -name dev1 at 127.0.0.1-setcookie riak -- console
>>>> keith      580   0.1  2.0  2549924  85492 s008  Ss+  12:05PM   2:24.08
>>>> /Users/keith/src/riak/dev/dev3/erts-5.8.2/bin/beam.smp -K true -A 64 --
>>>> -root /Users/keith/src/riak/dev/dev3 -progname riak -- -home /Users/keith --
>>>> -boot /Users/keith/src/riak/dev/dev3/releases/0.14.0/riak -embedded -config
>>>> /Users/keith/src/riak/dev/dev3/etc/app.config -name dev3 at 127.0.0.1-setcookie riak -- console
>>>> keith      380   0.0  0.0  2435004    268   ??  S    12:05PM   0:00.08
>>>> /Users/keith/src/riak/dev/dev1/erts-5.8.2/bin/epmd -daemon
>>>> keith      358   0.0  0.0  2434988    264   ??  S    12:05PM   0:00.01
>>>> /Users/keith/src/riak/dev/dev1/erts-5.8.2/bin/run_erl -daemon
>>>> /tmp//Users/keith/src/riak/dev/dev1// /Users/keith/src/riak/dev/dev1/log
>>>> exec /Users/keith/src/riak/dev/dev1/bin/riak console
>>>> keith     1633   0.0  0.0  2435548      0 s010  R+    1:34PM   0:00.00
>>>> grep riak
>>>> keith      578   0.0  0.0  2434988    264   ??  S    12:05PM   0:00.00
>>>> /Users/keith/src/riak/dev/dev3/erts-5.8.2/bin/run_erl -daemon
>>>> /tmp//Users/keith/src/riak/dev/dev3// /Users/keith/src/riak/dev/dev3/log
>>>> exec /Users/keith/src/riak/dev/dev3/bin/riak console
>>>> keith      470   0.0  2.0  2548688  83584 s007  Ss+  12:05PM   0:33.41
>>>> /Users/keith/src/riak/dev/dev2/erts-5.8.2/bin/beam.smp -K true -A 64 --
>>>> -root /Users/keith/src/riak/dev/dev2 -progname riak -- -home /Users/keith --
>>>> -boot /Users/keith/src/riak/dev/dev2/releases/0.14.0/riak -embedded -config
>>>> /Users/keith/src/riak/dev/dev2/etc/app.config -name dev2 at 127.0.0.1-setcookie riak -- console
>>>> keith      468   0.0  0.0  2434988    264   ??  S    12:05PM   0:00.01
>>>> /Users/keith/src/riak/dev/dev2/erts-5.8.2/bin/run_erl -daemon
>>>> /tmp//Users/keith/src/riak/dev/dev2// /Users/keith/src/riak/dev/dev2/log
>>>> exec /Users/keith/src/riak/dev/dev2/bin/riak console
>>>> kratos:~ keith$
>>>>
>>>> I've attached the erl_crash.dump file.  Anyone have an explanation or
>>>> suggestions on how to proceed?
>>>>
>>>>
>>>> Keith
>>>>
>>>>
>>>> _______________________________________________
>>>> riak-users mailing list
>>>> riak-users at lists.basho.com
>>>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>>>>
>>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.basho.com/pipermail/riak-users_lists.basho.com/attachments/20110401/fc5081f0/attachment.html>


More information about the riak-users mailing list