Crashed node has Bitcask merge errors on restart

Jeff Pollard jeff.pollard at gmail.com
Fri Aug 5 08:49:36 EDT 2011


Update: now the node has crashed, due to the following lines in the
sasl-error.log (see below).  I've also attached the crash dump to this
email.

Real quickly though, just to confirm - If we wanted to restore the node from
a recent backup, the procedure is as simple as:

   1. Stop the node.
   2. Restore the bitcask and ring directories from a recent backup (~12
   hours old) to the node
   3. Start the node

That correct?  Any gotchas or anything else I should know about that
process?

=SUPERVISOR REPORT==== 5-Aug-2011::03:01:20 ===
     Supervisor: {local,riak_kv_sup}
     Context:    child_terminated
     Reason:
{{badmatch,{error,{{badmatch,{error,emfile}},[{bitcask,scan_key_files,3},{bitcask,init_keydir,2},{bitcask,open,2},{riak_kv_bitcask_backend,start,2},{riak_kv_vnode,init,1},{riak_core_vnode,init,1},{gen_fsm,init_it,6},{proc_lib,init_p_do_apply,3}]}}},[{riak_core_vnode_master,get_vnode,2},{riak_core_vnode_master,handle_cast,2},{gen_server,handle_msg,5},{proc_lib,init_p_do_apply,3}]}
     Offender:
[{pid,<0.32266.2>},{name,riak_kv_vnode_master},{mfa,{riak_core_vnode_master,start_link,[riak_kv_vnode,riak_kv_legacy_vnode]}},{restart_type,permanent},{shutdown,5000},{child_type,worker}]


=SUPERVISOR REPORT==== 5-Aug-2011::03:01:20 ===
     Supervisor: {local,riak_kv_sup}
     Context:    shutdown
     Reason:     reached_max_restart_intensity
     Offender:
[{pid,<0.32266.2>},{name,riak_kv_vnode_master},{mfa,{riak_core_vnode_master,start_link,[riak_kv_vnode,riak_kv_legacy_vnode]}},{restart_type,permanent},{shutdown,5000},{child_type,worker}]

On Fri, Aug 5, 2011 at 1:12 AM, Jeff Pollard <jeff.pollard at gmail.com> wrote:

> Hey All,
>
> We had one of our riak node servers crash, and when booted back up it's now
> in this very inconsistent state where it responds to requests for a while
> (minute or two), then all requests time out for a little while, then go back
> to not responding to requests.  It's been ~90 minutes since the crash and
> reboot of the server, and we're still in this bad state.
>
> We use the bitcask data store, and looking through the logs I see a lot of
> merge failures in the sasl-error.log file.  See this gist for the tail -n
> 2000 of the sasl-error.log.  The interesting bit is mostly at the bottom:
>
> https://gist.github.com/1127104
>
> I'm not really sure how to proceed and would love some help on the matter.
>  For the time being we have this node pulled out of our load balancer and
> the rest of the nodes see this node as down, so we're still functional in
> production, but I'd obviously like to fix this up ASAP.
>
> One final thing to note is that we have backups of the entire Riak data
> directory from before the crash, which we could restore from if that helps.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.basho.com/pipermail/riak-users_lists.basho.com/attachments/20110805/021df833/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: erl_crash.dump
Type: application/octet-stream
Size: 671162 bytes
Desc: not available
URL: <http://lists.basho.com/pipermail/riak-users_lists.basho.com/attachments/20110805/021df833/attachment.dump>


More information about the riak-users mailing list