Bitcask node won't restart

Keith Dreibelbis kdreibel at gmail.com
Fri Apr 1 16:57:45 EDT 2011


Hi riak-users,

I have a node in a cluster of 3 that failed and won't come back up.  This is
in a dev environment, so it's not like there's critical data on there.
However, rather than start over with a new install, I want to learn how to
recover from such a failure in production.  I figured there was enough
redundancy such that node 2 could recover with (at worst) a little help from
nodes 1 and 3.

When I tried to restart/reboot (I tried both), this showed up in
erlang.log.1:

   Exec: /Users/keith/src/riak/dev/dev2/erts-5.8.2/bin/erlexec -boot
/Users/keith/src/riak/dev/dev2/releases/0.14.0/riak             -embedded
-config /Users/keith/src/riak/dev/dev2/etc/app.confi

g             -args_file /Users/keith/src/riak/dev/dev2/etc/vm.args --
console

Root: /Users/keith/src/riak/dev/dev2

{error_logger,{{2011,3,31},{16,43,35}},"Protocol: ~p: register error:
~p~n",["inet_tcp",{{badmatch,{error,duplicate_name}},[{inet_tcp_dist,listen,1},{net_kernel,start_protos,4},{net_kernel,sta

rt_protos,3},{net_kernel,init_node,2},{net_kernel,init,1},{gen_server,init_it,6},{proc_lib,init_p_do_apply,3}]}]}^M

{error_logger,{{2011,3,31},{16,43,35}},crash_report,[[{initial_call,{net_kernel,init,['Argument__1']}},{pid,<0.20.0>},{registered_name,[]},{error_info,{exit,{error,badarg},[{gen_server,init_it

,6},{proc_lib,init_p_do_apply,3}]}},{ancestors,[net_sup,kernel_sup,<0.10.0>]},{messages,[]},{links,[#Port<0.138>,<0.17.0>]},{dictionary,[{longnames,true}]},{trap_exit,true},{status,running},{h

eap_size,377},{stack_size,24},{reductions,456}],[]]}^M

{error_logger,{{2011,3,31},{16,43,35}},supervisor_report,[{supervisor,{local,net_sup}},{errorContext,start_error},{reason,{'EXIT',nodistribution}},{offender,[{pid,undefined},{name,net_kernel},

{mfargs,{net_kernel,start_link,[['dev2 at 127.0.0.1
',longnames]]}},{restart_type,permanent},{shutdown,2000},{child_type,worker}]}]}^M

{error_logger,{{2011,3,31},{16,43,35}},supervisor_report,[{supervisor,{local,kernel_sup}},{errorContext,start_error},{reason,shutdown},{offender,[{pid,undefined},{name,net_sup},{mfargs,{erl_di

stribution,start_link,[]}},{restart_type,permanent},{shutdown,infinity},{child_type,supervisor}]}]}^M

{error_logger,{{2011,3,31},{16,43,35}},std_info,[{application,kernel},{exited,{shutdown,{kernel,start,[normal,[]]}}},{type,permanent}]}^M

{"Kernel pid
terminated",application_controller,"{application_start_failure,kernel,{shutdown,{kernel,start,[normal,[]]}}}"}^M

^M

Crash dump was written to: erl_crash.dump^M

Kernel pid terminated (application_controller)
({application_start_failure,kernel,{shutdown,{kernel,start,[normal,[]]}}})^M


http://wiki.basho.com/Recovering-a-failed-node.html suggests starting the
node in console mode with a different name.  This didn't help, it just
crashed again.  I'm using bitcask (the default) while the example on that
page gives output like InnoDB would return.

kratos:dev2 keith$ bin/riak console -name differentname at nohost
Exec: /Users/keith/src/riak/dev/dev2/erts-5.8.2/bin/erlexec -boot
/Users/keith/src/riak/dev/dev2/releases/0.14.0/riak             -embedded
-config /Users/keith/src/riak/dev/dev2/etc/app.config             -args_file
/Users/keith/src/riak/dev/dev2/etc/vm.args -- console -name
differentname at nohost
Root: /Users/keith/src/riak/dev/dev2
Erlang R14B01 (erts-5.8.2) [source] [64-bit] [smp:2:2] [rq:2]
[async-threads:64] [hipe] [kernel-poll:true]


=INFO REPORT==== 31-Mar-2011::17:35:05 ===
    alarm_handler: {set,{system_memory_high_watermark,[]}}
** Found 0 name clashes in code paths

=INFO REPORT==== 31-Mar-2011::17:35:05 ===
    application: riak_core
    exited: {shutdown,{riak_core_app,start,[normal,[]]}}
    type: permanent

=INFO REPORT==== 31-Mar-2011::17:35:05 ===
    alarm_handler: {clear,system_memory_high_watermark}
{"Kernel pid
terminated",application_controller,"{application_start_failure,riak_core,{shutdown,{riak_core_app,start,[normal,[]]}}}"}

Crash dump was written to: erl_crash.dump
Kernel pid terminated (application_controller)
({application_start_failure,riak_core,{shutdown,{riak_core_app,start,[normal,[]]}}})
kratos:dev2 keith$

After I rebooted my machine and tried starting the trio of riak nodes, again
node 2 is not responding to pings, and "riak-admin ringready" from nodes 1
and 3 complain that node 2 is down.  But in the log, node 2 is saying it's
ALIVE.  Also, I can see processes for all 3 nodes in ps:

kratos:~ keith$ ps auxww | grep riak
keith      360   0.2  3.4  2606932 143044 s006  Ss+  12:05PM   3:21.61
/Users/keith/src/riak/dev/dev1/erts-5.8.2/bin/beam.smp -K true -A 64 --
-root /Users/keith/src/riak/dev/dev1 -progname riak -- -home /Users/keith --
-boot /Users/keith/src/riak/dev/dev1/releases/0.14.0/riak -embedded -config
/Users/keith/src/riak/dev/dev1/etc/app.config -name
dev1 at 127.0.0.1-setcookie riak -- console
keith      580   0.1  2.0  2549924  85492 s008  Ss+  12:05PM   2:24.08
/Users/keith/src/riak/dev/dev3/erts-5.8.2/bin/beam.smp -K true -A 64 --
-root /Users/keith/src/riak/dev/dev3 -progname riak -- -home /Users/keith --
-boot /Users/keith/src/riak/dev/dev3/releases/0.14.0/riak -embedded -config
/Users/keith/src/riak/dev/dev3/etc/app.config -name
dev3 at 127.0.0.1-setcookie riak -- console
keith      380   0.0  0.0  2435004    268   ??  S    12:05PM   0:00.08
/Users/keith/src/riak/dev/dev1/erts-5.8.2/bin/epmd -daemon
keith      358   0.0  0.0  2434988    264   ??  S    12:05PM   0:00.01
/Users/keith/src/riak/dev/dev1/erts-5.8.2/bin/run_erl -daemon
/tmp//Users/keith/src/riak/dev/dev1// /Users/keith/src/riak/dev/dev1/log
exec /Users/keith/src/riak/dev/dev1/bin/riak console
keith     1633   0.0  0.0  2435548      0 s010  R+    1:34PM   0:00.00 grep
riak
keith      578   0.0  0.0  2434988    264   ??  S    12:05PM   0:00.00
/Users/keith/src/riak/dev/dev3/erts-5.8.2/bin/run_erl -daemon
/tmp//Users/keith/src/riak/dev/dev3// /Users/keith/src/riak/dev/dev3/log
exec /Users/keith/src/riak/dev/dev3/bin/riak console
keith      470   0.0  2.0  2548688  83584 s007  Ss+  12:05PM   0:33.41
/Users/keith/src/riak/dev/dev2/erts-5.8.2/bin/beam.smp -K true -A 64 --
-root /Users/keith/src/riak/dev/dev2 -progname riak -- -home /Users/keith --
-boot /Users/keith/src/riak/dev/dev2/releases/0.14.0/riak -embedded -config
/Users/keith/src/riak/dev/dev2/etc/app.config -name
dev2 at 127.0.0.1-setcookie riak -- console
keith      468   0.0  0.0  2434988    264   ??  S    12:05PM   0:00.01
/Users/keith/src/riak/dev/dev2/erts-5.8.2/bin/run_erl -daemon
/tmp//Users/keith/src/riak/dev/dev2// /Users/keith/src/riak/dev/dev2/log
exec /Users/keith/src/riak/dev/dev2/bin/riak console
kratos:~ keith$

I've attached the erl_crash.dump file.  Anyone have an explanation or
suggestions on how to proceed?


Keith
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.basho.com/pipermail/riak-users_lists.basho.com/attachments/20110401/373bc121/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: erl_crash.dump.gz
Type: application/x-gzip
Size: 71019 bytes
Desc: not available
URL: <http://lists.basho.com/pipermail/riak-users_lists.basho.com/attachments/20110401/373bc121/attachment.gz>


More information about the riak-users mailing list