Corrupted Erlang binary term inside LevelDB

Vladimir Shabanov vshabanoff at gmail.com
Thu Jul 25 16:32:42 EDT 2013


Good. Will wait for doctor.

A month ago I mailed about segmentation fault
http://lists.basho.com/pipermail/riak-users_lists.basho.com/2013-June/012245.html
After looking at core dumps you have found this problem with CRC checks
being skipped. I enabled paranoid_checks and got my node up an running.

I've also found that lost/BLOCKS.bad sometimes appears in partitions and
have sent you these blocks for further analysis.

It's very interesting why corrupted data appears in the first place. Nodes
didn't crashed, hardware didn't failed. As I mentioned previously all my
machines are with ECC memory and Riak data is kept on ZFS filesystem (which
also checks CRC for all the data and doesn't report any CRC errors). So it
looks that data is somehow corrupted by Riak itself.

lost/BLOCKS.bad are usually small 2-8kb and appears very infrequently (once
a week, once a month or never for many partitions). I found these
BLOCKS.bad in both data/leveldb and data/anti_entropy. So I have suspicion
that there is a bug in LevelDB.

Looking at LOGs they are created during compactions:
"Moving corrupted block to lost/BLOCKS.bad (size 2393)"
but there is no more information. What kind of block is it, where it was
found.

Is it possible to somehow find source of those BLOCKS.bad files? I'm
building Riak from sources, maybe it's possible to enable some additional
logging to find what these BLOCKS.bad are?


2013/7/25 Matthew Von-Maszewski <matthewv at basho.com>

> Vladimir,
>
> I can explain what happened, but not how to correct the problem.  The
> gentleman that can walk you through a repair is tied up on another project,
> but he intends to respond as soon as he is able.
>
> We recently discovered / realized that Google's leveldb code does not
> check the CRC of each block rewritten during a compaction.  This means that
> blocks with bad CRCs get read without being flagged as bad, then rewritten
> to a new file with a new, valid CRC.  The corruption is now hidden.
>
> A more thorough discussion of the problem is found here:
>
> https://github.com/basho/leveldb/wiki/mv-verify-compactions
>
>
> We added code to the 1.3.2 and 1.4 Riak releases to have the block CRC
> checked during both read (Get) requests and compaction rewrites.  This
> prevents future corruption hiding.  Unfortunately, it does NOTHING for
> blocks already corrupted and rewritten with valid CRCs.  You are
> encountering this latter condition.  We have a developer advocate / client
> services person that has walked others through a fix via the Riak data
> replicas …
>
> … please hold and the doctor will be with you shortly.
>
> Matthew
>
>
> On Jul 24, 2013, at 9:39 PM, Vladimir Shabanov <vshabanoff at gmail.com>
> wrote:
>
> Hello,
>
> Recently I've started expanding my Riak cluster and found that handoffs
> were continuously retried for one partition.
>
> Here are logs from two nodes
> https://gist.github.com/vshabanov/41282e622479fbe81974
>
> The most interesting parts of logs are
> "Handoff receiver for partition ... exited abnormally after processing
> 2860338 objects: {{badarg,[{erlang,binary_to_term,..."
> and
> "bad argument in call to erlang:binary_to_term(<<131,104,...."
>
> Both nodes are running Riak 1.3.2 (old one was running 1.3.1 previously).
>
>
> When I've printed corrupted binary string I found that it corresponds to
> one value.
>
> When I've tried to "get" it, it was read OK but node with corrupted value
> shown the same binary_to_term error.
>
> When I've tried to delete corrupted value I've got timeout.
>
>
> I'm running machines with ECC memory and ZFS filesystem (which doesn't
> report any checksum failures) so I doubt data was silently corrupted on
> disk.
>
> LOG from corresponding LevelDB partition doesn't show any errors. But
> there is a lost/BLOCKS.bad file in this partition (7kb, created more than a
> month ago and looks like it doesn't contain corrupted value).
>
> At the moment I've stopped handoffs using "risk-admin transfer-limit 0".
>
> Why the value was corrupted? It there any way to remove it or fix it?
>  _______________________________________________
> riak-users mailing list
> riak-users at lists.basho.com
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.basho.com/pipermail/riak-users_lists.basho.com/attachments/20130726/6266f2c8/attachment.html>


More information about the riak-users mailing list