Corrupted Erlang binary term inside LevelDB

Vladimir Shabanov vshabanoff at gmail.com
Thu Jul 25 20:12:11 EDT 2013


I prefer second option since it will show are the corrupted blocks related
to race condition. First option needs to be run for a long time to be
completely sure that it really fixes the issue.


2013/7/26 Matthew Von-Maszewski <matthewv at basho.com>

> Vladimir,
>
> I apologize for not recognizing your name and previous contribution.  I
> just tend to think in terms of code and performance bottlenecks, not people.
>
> Your June contribution resulted in changes that were released in 1.4 and
> 1.3.2.  I and the team thank you.  However, we have not isolated the source
> of the corruption.  We only know today that it does not happen very often.
>  We have a second, high transaction site, that has seen the same issue.
>
> I can offer you two non-release options:
>
> - I have a branch to 1.4.0 that fixes a potential, but unproven, race
> condition.  Details are here:
>
> https://github.com/basho/leveldb/wiki/mv-sst-fadvise
>
> You would have to build eleveldb locally and copy it into your executable
> tree.  The 1.4 leveldb and eleveldb work fine with Riak 1.3.x. should you
> desire to limit changes to your production environment.
>
>
> - I have code, soon to be a branch against 1.3.2, that only adds syslog
> error messages to prove / disprove the race condition.  You could take this
> code and see if it reports problems.  This route would help the community
> and mostly me know the root cause is within the race condition addressed by
> the mv-sst-fadvise branch.
>
>
> The two options above are what I currently have to offer.  I am actively
> working to find the corruption source.  The good news is that Riak will
> naturally recover from a "bad CRC" when detected.  The bad news is that the
> Google defaults let some bad CRCs become good CRCs.  Riak 1.4 and 1.3.2
> cannot identify those bad CRCs that became good CRCs.
>
> Matthew
>
>
>
>
> On Jul 25, 2013, at 4:32 PM, Vladimir Shabanov <vshabanoff at gmail.com>
> wrote:
>
> Good. Will wait for doctor.
>
> A month ago I mailed about segmentation fault
>
> http://lists.basho.com/pipermail/riak-users_lists.basho.com/2013-June/012245.html
> After looking at core dumps you have found this problem with CRC checks
> being skipped. I enabled paranoid_checks and got my node up an running.
>
> I've also found that lost/BLOCKS.bad sometimes appears in partitions and
> have sent you these blocks for further analysis.
>
> It's very interesting why corrupted data appears in the first place. Nodes
> didn't crashed, hardware didn't failed. As I mentioned previously all my
> machines are with ECC memory and Riak data is kept on ZFS filesystem (which
> also checks CRC for all the data and doesn't report any CRC errors). So it
> looks that data is somehow corrupted by Riak itself.
>
> lost/BLOCKS.bad are usually small 2-8kb and appears very infrequently
> (once a week, once a month or never for many partitions). I found these
> BLOCKS.bad in both data/leveldb and data/anti_entropy. So I have suspicion
> that there is a bug in LevelDB.
>
> Looking at LOGs they are created during compactions:
> "Moving corrupted block to lost/BLOCKS.bad (size 2393)"
> but there is no more information. What kind of block is it, where it was
> found.
>
> Is it possible to somehow find source of those BLOCKS.bad files? I'm
> building Riak from sources, maybe it's possible to enable some additional
> logging to find what these BLOCKS.bad are?
>
>
> 2013/7/25 Matthew Von-Maszewski <matthewv at basho.com>
>
>> Vladimir,
>>
>> I can explain what happened, but not how to correct the problem.  The
>> gentleman that can walk you through a repair is tied up on another project,
>> but he intends to respond as soon as he is able.
>>
>> We recently discovered / realized that Google's leveldb code does not
>> check the CRC of each block rewritten during a compaction.  This means that
>> blocks with bad CRCs get read without being flagged as bad, then rewritten
>> to a new file with a new, valid CRC.  The corruption is now hidden.
>>
>> A more thorough discussion of the problem is found here:
>>
>> https://github.com/basho/leveldb/wiki/mv-verify-compactions
>>
>>
>> We added code to the 1.3.2 and 1.4 Riak releases to have the block CRC
>> checked during both read (Get) requests and compaction rewrites.  This
>> prevents future corruption hiding.  Unfortunately, it does NOTHING for
>> blocks already corrupted and rewritten with valid CRCs.  You are
>> encountering this latter condition.  We have a developer advocate / client
>> services person that has walked others through a fix via the Riak data
>> replicas …
>>
>> … please hold and the doctor will be with you shortly.
>>
>> Matthew
>>
>>
>> On Jul 24, 2013, at 9:39 PM, Vladimir Shabanov <vshabanoff at gmail.com>
>> wrote:
>>
>> Hello,
>>
>> Recently I've started expanding my Riak cluster and found that handoffs
>> were continuously retried for one partition.
>>
>> Here are logs from two nodes
>> https://gist.github.com/vshabanov/41282e622479fbe81974
>>
>> The most interesting parts of logs are
>> "Handoff receiver for partition ... exited abnormally after processing
>> 2860338 objects: {{badarg,[{erlang,binary_to_term,..."
>> and
>> "bad argument in call to erlang:binary_to_term(<<131,104,...."
>>
>> Both nodes are running Riak 1.3.2 (old one was running 1.3.1 previously).
>>
>>
>> When I've printed corrupted binary string I found that it corresponds to
>> one value.
>>
>> When I've tried to "get" it, it was read OK but node with corrupted value
>> shown the same binary_to_term error.
>>
>> When I've tried to delete corrupted value I've got timeout.
>>
>>
>> I'm running machines with ECC memory and ZFS filesystem (which doesn't
>> report any checksum failures) so I doubt data was silently corrupted on
>> disk.
>>
>> LOG from corresponding LevelDB partition doesn't show any errors. But
>> there is a lost/BLOCKS.bad file in this partition (7kb, created more than a
>> month ago and looks like it doesn't contain corrupted value).
>>
>> At the moment I've stopped handoffs using "risk-admin transfer-limit 0".
>>
>> Why the value was corrupted? It there any way to remove it or fix it?
>>  _______________________________________________
>> riak-users mailing list
>> riak-users at lists.basho.com
>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>>
>>
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.basho.com/pipermail/riak-users_lists.basho.com/attachments/20130726/3b45fbe2/attachment.html>


More information about the riak-users mailing list