Corrupted Erlang binary term inside LevelDB

Vladimir Shabanov vshabanoff at gmail.com
Wed Jul 24 21:39:56 EDT 2013


Hello,

Recently I've started expanding my Riak cluster and found that handoffs
were continuously retried for one partition.

Here are logs from two nodes
https://gist.github.com/vshabanov/41282e622479fbe81974

The most interesting parts of logs are
"Handoff receiver for partition ... exited abnormally after processing
2860338 objects: {{badarg,[{erlang,binary_to_term,..."
and
"bad argument in call to erlang:binary_to_term(<<131,104,...."

Both nodes are running Riak 1.3.2 (old one was running 1.3.1 previously).


When I've printed corrupted binary string I found that it corresponds to
one value.

When I've tried to "get" it, it was read OK but node with corrupted value
shown the same binary_to_term error.

When I've tried to delete corrupted value I've got timeout.


I'm running machines with ECC memory and ZFS filesystem (which doesn't
report any checksum failures) so I doubt data was silently corrupted on
disk.

LOG from corresponding LevelDB partition doesn't show any errors. But there
is a lost/BLOCKS.bad file in this partition (7kb, created more than a month
ago and looks like it doesn't contain corrupted value).

At the moment I've stopped handoffs using "risk-admin transfer-limit 0".

Why the value was corrupted? It there any way to remove it or fix it?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.basho.com/pipermail/riak-users_lists.basho.com/attachments/20130725/01a7f4d5/attachment.html>


More information about the riak-users mailing list