Problems with bitcask, file merge errors, too many 0 byte files

Ryan Zezeski rzezeski at basho.com
Tue May 29 23:47:10 EDT 2012


Jacob,

I only glanced at this but have some comments inline.

On Tue, May 29, 2012 at 8:44 PM, Jacob Chapel <jacob.chapel at gmail.com>wrote:

> Our Riak server which is running 1.0.2 at the moment using bitcask backend
> and search is crashing often and when restarted will crash again
> immediately due to system_limit error.
>
> 2012-05-29 19:28:54.808 [error] <0.1001.0>@riak_kv_vnode:init:245 Failed
> to start riak_kv_bitcask_backend Reason:
> {{badmatch,{error,system_limit}},[{bitcask,scan_key_files,3},{bitcask,init_keydir,2},{bitcask,open,2},{riak_kv_bitcask_backend,start,2},{riak_kv_vnode,init,1},{riak_core_vnode,init,1},{gen_fsm,init_it,6},{proc_lib,init_p_do_apply,3}]}
>

The system limit I imagine is related to the 20k 0 byte files causing you
to reach open file limit.  That is, not the cause but a symptom of the
problem.


>
> Before, we were getting emfile errors, so we upped the ulimit for open
> files which helped. Soon after (about a week) it crashed again but due to
> the above error. After looking into it and asking on IRC, there wasn't much
> information but looked to be due to a ton of 0 (zero) byte files in the
> bitcask folder. In fact when counting, at first there were over 20k 0 byte
> files. Someone who had similar issues was instructed to delete them, and
> common sense says that 0 byte files don't hold any data. So I backed them
> up (just in case) and removed them from the bitcask folder. That allowed
> the server to startup and run again.
>
>
The fact that you have many 0 byte files is an indication that bitcask is
crashing a lot.  You haven't noticed this probably because Riak has been
working fine until now.  Although, had you peeked at your logs you probably
would have noticed lots of vnode/bitcask crashes.


> Fast forward a few days to now, it appears to have crashed due to the same
> issue. Having over 20k new 0 byte files. I asked on IRC again but not much
> could be helped since they didn't know. I backed up and removed the files
> again and it runs. How can I prevent these files?
>
> Also, as this sample log output shows:
> https://gist.github.com/a2e3c473e1d582bd87a2
>

This, I believe, is the heart of your problem.  It looks like you have
corrupted data files.  IIRC bitcask merging is still susceptible to
crashing when dealing with corrupted data files.  So every time a merge is
triggered the bitcask instance crashes and restarts causing a new 0 byte
file to be created.  Once this happens enough times you have enough of
these files to reach the system limit.

Just confirmed, I'm pretty sure this is the issue you are running into.

https://issues.basho.com/show_bug.cgi?id=1160


>
> We are getting a lot of file merge errors and child processes dying
> randomly (not sure how to read the error).
>
> I am not really sure where to go from here, we can't keep removing 0 byte
> files to keep the server up, and I am sure there is some setting or
> configuration problem that just isn't apparent. Help would be very much
> appreciated.
>

Depending on how many partitions have corrupt data files you could delete
the bitcask data, kill the owning vnode, and perform list-keys +
read-repair to repair the replicas.  However, looking at your log it seems
like you have a lot of partitions with corrupt data files (I'm guessing
this was because of your previous emfile issue).  If you feel brave you
could probably remove the bad bits with a hex editor but I imagine we have
some code to automate this somewhere.  I gotta go right now but perhaps
someone else can point you to an easy fix.

-Z
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.basho.com/pipermail/riak-users_lists.basho.com/attachments/20120529/27b60fe2/attachment.html>


More information about the riak-users mailing list