Crashed node has Bitcask merge errors on restart

David Smith dizzyd at basho.com
Fri Aug 5 09:01:53 EDT 2011


Hi Jeff,

I believe you are encountering BZ 1097 (http://issues.basho.com/1097),
where a suddenly truncated bitcask file can cause problems when
attempting to merge. The truncation is typically the result of
underlying O/S or hardware failure and simply means that the last
record in a bitcask file didn't get fully written. Generally, bitcask
recovers from this by ignoring the last incomplete record, but there
was a case in the merging (fixed by this bug report) where this didn't
happen properly.

So, you have a few options:

1. You can restore your last known good bitcask directory on this
node. This is the easiest fix and the other Riak nodes will
read-repair any out-of-date values as the data is accessed.

2. You can grab the latest bitcask source, build it and drop that in
place on the bjorked node. (Replacing the existing bitcask code). This
is a bit more legwork (since compilation is involved), but should
allow the node to recover without further intervention.

Hope that helps,

D.

On Fri, Aug 5, 2011 at 2:12 AM, Jeff Pollard <jeff.pollard at gmail.com> wrote:
> Hey All,
> We had one of our riak node servers crash, and when booted back up it's now
> in this very inconsistent state where it responds to requests for a while
> (minute or two), then all requests time out for a little while, then go back
> to not responding to requests.  It's been ~90 minutes since the crash and
> reboot of the server, and we're still in this bad state.
> We use the bitcask data store, and looking through the logs I see a lot of
> merge failures in the sasl-error.log file.  See this gist for the tail -n
> 2000 of the sasl-error.log.  The interesting bit is mostly at the bottom:
> https://gist.github.com/1127104
>
> I'm not really sure how to proceed and would love some help on the matter.
>  For the time being we have this node pulled out of our load balancer and
> the rest of the nodes see this node as down, so we're still functional in
> production, but I'd obviously like to fix this up ASAP.
> One final thing to note is that we have backups of the entire Riak data
> directory from before the crash, which we could restore from if that helps.
> _______________________________________________
> riak-users mailing list
> riak-users at lists.basho.com
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>
>



-- 
Dave Smith
Director, Engineering
Basho Technologies, Inc.
dizzyd at basho.com




More information about the riak-users mailing list