Corrupted Erlang binary term inside LevelDB

Vladimir Shabanov vshabanoff at gmail.com
Tue Jul 30 12:56:46 EDT 2013


I've built and installed mv-error-logging-hack on one of my nodes. Will
look into syslog once new BLOCKS.bad file appears.

Meanwhile I'm still waiting for doctor to solve problem with partition
handoff. When he will arrive? Maybe there is already some "do it yourself"
instructions available?


2013/7/26 Matthew Von-Maszewski <matthewv at basho.com>

> Vladimir,
>
> I have created a branch off the 1.3.2 release tag:  mv-error-logging-hack
>
> This has two changes:
>
> - removes a late fix for database level locking that was added in 1.3.2
> (to see if that was the problem source prior to its fix)
>
> - add test of all background file operations and log errors to syslog
> (since LOG handle not available)
>
>
> When I build new version of leveldb, I make sure eleveldb also rebuilds.
>  I do this via "rm eleveldb/c_src/*.o" followed by "cd
> eleveldb/c_src/leveldb; make clean"  There is a pull request from another
> community user that makes the entire process cleaner.  I just have not had
> time to review and approve it.
>
> I typically "grep beam /var/log/syslog" on my Debian system.  The exact
> system log may vary due to your Linux implementation.
>
> Let me know if this finds in any bugs.
>
> Matthew
>
>
> On Jul 25, 2013, at 8:12 PM, Vladimir Shabanov <vshabanoff at gmail.com>
> wrote:
>
> I prefer second option since it will show are the corrupted blocks related
> to race condition. First option needs to be run for a long time to be
> completely sure that it really fixes the issue.
>
>
> 2013/7/26 Matthew Von-Maszewski <matthewv at basho.com>
>
>> Vladimir,
>>
>> I apologize for not recognizing your name and previous contribution.  I
>> just tend to think in terms of code and performance bottlenecks, not people.
>>
>> Your June contribution resulted in changes that were released in 1.4 and
>> 1.3.2.  I and the team thank you.  However, we have not isolated the source
>> of the corruption.  We only know today that it does not happen very often.
>>  We have a second, high transaction site, that has seen the same issue.
>>
>> I can offer you two non-release options:
>>
>> - I have a branch to 1.4.0 that fixes a potential, but unproven, race
>> condition.  Details are here:
>>
>> https://github.com/basho/leveldb/wiki/mv-sst-fadvise
>>
>> You would have to build eleveldb locally and copy it into your executable
>> tree.  The 1.4 leveldb and eleveldb work fine with Riak 1.3.x. should you
>> desire to limit changes to your production environment.
>>
>>
>> - I have code, soon to be a branch against 1.3.2, that only adds syslog
>> error messages to prove / disprove the race condition.  You could take this
>> code and see if it reports problems.  This route would help the community
>> and mostly me know the root cause is within the race condition addressed by
>> the mv-sst-fadvise branch.
>>
>>
>> The two options above are what I currently have to offer.  I am actively
>> working to find the corruption source.  The good news is that Riak will
>> naturally recover from a "bad CRC" when detected.  The bad news is that the
>> Google defaults let some bad CRCs become good CRCs.  Riak 1.4 and 1.3.2
>> cannot identify those bad CRCs that became good CRCs.
>>
>> Matthew
>>
>>
>>
>>
>> On Jul 25, 2013, at 4:32 PM, Vladimir Shabanov <vshabanoff at gmail.com>
>> wrote:
>>
>> Good. Will wait for doctor.
>>
>> A month ago I mailed about segmentation fault
>>
>> http://lists.basho.com/pipermail/riak-users_lists.basho.com/2013-June/012245.html
>> After looking at core dumps you have found this problem with CRC checks
>> being skipped. I enabled paranoid_checks and got my node up an running.
>>
>> I've also found that lost/BLOCKS.bad sometimes appears in partitions and
>> have sent you these blocks for further analysis.
>>
>> It's very interesting why corrupted data appears in the first place.
>> Nodes didn't crashed, hardware didn't failed. As I mentioned previously all
>> my machines are with ECC memory and Riak data is kept on ZFS filesystem
>> (which also checks CRC for all the data and doesn't report any CRC errors).
>> So it looks that data is somehow corrupted by Riak itself.
>>
>> lost/BLOCKS.bad are usually small 2-8kb and appears very infrequently
>> (once a week, once a month or never for many partitions). I found these
>> BLOCKS.bad in both data/leveldb and data/anti_entropy. So I have suspicion
>> that there is a bug in LevelDB.
>>
>> Looking at LOGs they are created during compactions:
>> "Moving corrupted block to lost/BLOCKS.bad (size 2393)"
>> but there is no more information. What kind of block is it, where it was
>> found.
>>
>> Is it possible to somehow find source of those BLOCKS.bad files? I'm
>> building Riak from sources, maybe it's possible to enable some additional
>> logging to find what these BLOCKS.bad are?
>>
>>
>> 2013/7/25 Matthew Von-Maszewski <matthewv at basho.com>
>>
>>> Vladimir,
>>>
>>> I can explain what happened, but not how to correct the problem.  The
>>> gentleman that can walk you through a repair is tied up on another project,
>>> but he intends to respond as soon as he is able.
>>>
>>> We recently discovered / realized that Google's leveldb code does not
>>> check the CRC of each block rewritten during a compaction.  This means that
>>> blocks with bad CRCs get read without being flagged as bad, then rewritten
>>> to a new file with a new, valid CRC.  The corruption is now hidden.
>>>
>>> A more thorough discussion of the problem is found here:
>>>
>>> https://github.com/basho/leveldb/wiki/mv-verify-compactions
>>>
>>>
>>> We added code to the 1.3.2 and 1.4 Riak releases to have the block CRC
>>> checked during both read (Get) requests and compaction rewrites.  This
>>> prevents future corruption hiding.  Unfortunately, it does NOTHING for
>>> blocks already corrupted and rewritten with valid CRCs.  You are
>>> encountering this latter condition.  We have a developer advocate / client
>>> services person that has walked others through a fix via the Riak data
>>> replicas …
>>>
>>> … please hold and the doctor will be with you shortly.
>>>
>>> Matthew
>>>
>>>
>>> On Jul 24, 2013, at 9:39 PM, Vladimir Shabanov <vshabanoff at gmail.com>
>>> wrote:
>>>
>>> Hello,
>>>
>>> Recently I've started expanding my Riak cluster and found that handoffs
>>> were continuously retried for one partition.
>>>
>>> Here are logs from two nodes
>>> https://gist.github.com/vshabanov/41282e622479fbe81974
>>>
>>> The most interesting parts of logs are
>>> "Handoff receiver for partition ... exited abnormally after processing
>>> 2860338 objects: {{badarg,[{erlang,binary_to_term,..."
>>> and
>>> "bad argument in call to erlang:binary_to_term(<<131,104,...."
>>>
>>> Both nodes are running Riak 1.3.2 (old one was running 1.3.1 previously).
>>>
>>>
>>> When I've printed corrupted binary string I found that it corresponds to
>>> one value.
>>>
>>> When I've tried to "get" it, it was read OK but node with corrupted
>>> value shown the same binary_to_term error.
>>>
>>> When I've tried to delete corrupted value I've got timeout.
>>>
>>>
>>> I'm running machines with ECC memory and ZFS filesystem (which doesn't
>>> report any checksum failures) so I doubt data was silently corrupted on
>>> disk.
>>>
>>> LOG from corresponding LevelDB partition doesn't show any errors. But
>>> there is a lost/BLOCKS.bad file in this partition (7kb, created more than a
>>> month ago and looks like it doesn't contain corrupted value).
>>>
>>> At the moment I've stopped handoffs using "risk-admin transfer-limit 0".
>>>
>>> Why the value was corrupted? It there any way to remove it or fix it?
>>>  _______________________________________________
>>> riak-users mailing list
>>> riak-users at lists.basho.com
>>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>>>
>>>
>>>
>>
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.basho.com/pipermail/riak-users_lists.basho.com/attachments/20130730/e824f9d5/attachment.html>


More information about the riak-users mailing list