Crashed node has Bitcask merge errors on restart

Jeff Pollard jeff.pollard at gmail.com
Fri Aug 5 09:03:49 EDT 2011


Hey David,

Thanks for the reply.  I'm in the process of downloading our backup data to
the node as we speak.  I'll restore the bitcask directory to that data and
boot the node, and will let you know how it goes.

Thanks again for your help.

On Fri, Aug 5, 2011 at 6:01 AM, David Smith <dizzyd at basho.com> wrote:

> Hi Jeff,
>
> I believe you are encountering BZ 1097 (http://issues.basho.com/1097),
> where a suddenly truncated bitcask file can cause problems when
> attempting to merge. The truncation is typically the result of
> underlying O/S or hardware failure and simply means that the last
> record in a bitcask file didn't get fully written. Generally, bitcask
> recovers from this by ignoring the last incomplete record, but there
> was a case in the merging (fixed by this bug report) where this didn't
> happen properly.
>
> So, you have a few options:
>
> 1. You can restore your last known good bitcask directory on this
> node. This is the easiest fix and the other Riak nodes will
> read-repair any out-of-date values as the data is accessed.
>
> 2. You can grab the latest bitcask source, build it and drop that in
> place on the bjorked node. (Replacing the existing bitcask code). This
> is a bit more legwork (since compilation is involved), but should
> allow the node to recover without further intervention.
>
> Hope that helps,
>
> D.
>
> On Fri, Aug 5, 2011 at 2:12 AM, Jeff Pollard <jeff.pollard at gmail.com>
> wrote:
> > Hey All,
> > We had one of our riak node servers crash, and when booted back up it's
> now
> > in this very inconsistent state where it responds to requests for a while
> > (minute or two), then all requests time out for a little while, then go
> back
> > to not responding to requests.  It's been ~90 minutes since the crash and
> > reboot of the server, and we're still in this bad state.
> > We use the bitcask data store, and looking through the logs I see a lot
> of
> > merge failures in the sasl-error.log file.  See this gist for the tail -n
> > 2000 of the sasl-error.log.  The interesting bit is mostly at the bottom:
> > https://gist.github.com/1127104
> >
> > I'm not really sure how to proceed and would love some help on the
> matter.
> >  For the time being we have this node pulled out of our load balancer and
> > the rest of the nodes see this node as down, so we're still functional in
> > production, but I'd obviously like to fix this up ASAP.
> > One final thing to note is that we have backups of the entire Riak data
> > directory from before the crash, which we could restore from if that
> helps.
> > _______________________________________________
> > riak-users mailing list
> > riak-users at lists.basho.com
> > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
> >
> >
>
>
>
> --
> Dave Smith
> Director, Engineering
> Basho Technologies, Inc.
> dizzyd at basho.com
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.basho.com/pipermail/riak-users_lists.basho.com/attachments/20110805/70cd9644/attachment.html>


More information about the riak-users mailing list