Truncated bit-cask files

Arun Rajagopalan arun.v.rajagopalan at gmail.com
Tue Feb 14 15:58:44 EST 2017


Thanks Matthew. I will try one of those solutions


On Tue, Feb 14, 2017 at 3:51 PM, Matthew Von-Maszewski <matthewv at basho.com>
wrote:

> Arun,
>
> You are running out of RAM for the leveldb AAE.  There are several ways to
> fix that:
>
> - reduce memory allocated to bitcask
> - more memory per server
> - more servers of same memory
> - reduce the ring size from 64 to 8, and rebuild data within the cluster
> from scratch
> - lie to leveldb and give it a big than real memory setting in riak.conf:
>         leveldb.maximum_memory=8G
>
>
> The key LOG lines are:
>
> Options.total_leveldb_mem: 2,901,766,963    <-- this is the total memory
> assigned to ALL of leveldb, but
>     only 20% of it goes to AAE vnodes
>
> File cache size: 5833527     <-- the first vnode says, cool enough memory
> for me
> Block cache size: 7930679  <-- ditto
>
>   ... but as more vnodes start:
>
>  File cache size: 0                <-- things are just not going to work
> well
> Block cache size: 0
>
> There are no actual file system error messages in your LOG files.  That
> supports that the real problem is memory unhappiness.
>
> Matthew
>
>
> On Feb 14, 2017, at 3:34 PM, Arun Rajagopalan <
> arun.v.rajagopalan at gmail.com> wrote:
>
> Hi Matthew, Magnus
>
> I have attached the log files for your review
>
> Thanks
> Arun
>
>
> On Tue, Feb 14, 2017 at 11:55 AM, Matthew Von-Maszewski <
> matthewv at basho.com> wrote:
>
>> Arun,
>>
>> The AAE code uses leveldb for its storage of anti-entropy data, no matter
>> which backend holds the user data.  Therefore the error below suggests
>> corruption within leveldb files (which is not impossible, but becoming
>> really rare except with bad hardware or full disks).
>>
>> Before wiping out the AAE directory, you should copy the LOG file within
>> it.  There are likely more useful error messages within that file ... maybe
>> put the file in drop box or zip attach to a reply for us to review.
>>
>> Matthew
>>
>> On Feb 14, 2017, at 10:42 AM, Magnus Kessler <mkessler at basho.com> wrote:
>>
>> On 14 February 2017 at 14:46, Arun Rajagopalan <
>> arun.v.rajagopalan at gmail.com> wrote:
>>
>>> Hi Magnus
>>>
>>> RIAK crashes on startup when I have trucated bitcask file
>>>
>>> It also crashes when the AAE files are bad too I think. Example below
>>>
>>> 2017-02-13 21:18:30 =CRASH REPORT====
>>>   crasher:
>>>     initial call: riak_kv_index_hashtree:init/1
>>>     pid: <0.6037.0>
>>>     registered_name: []
>>>     exception exit: {{{badmatch,{error,{db_open,"Corruption: truncated
>>> record at end of file"}}},[{hashtree,new_segment_
>>> store,2,[{file,"src/hashtree.erl"},{line,675}]},{hashtree,ne
>>> w,2,[{file,"src/hashtree.erl"},{line,246}]},{riak_kv_index_h
>>> ashtree,do_new_tree,3,[{file,"src/riak_kv_index_hashtree.erl
>>> "},{line,610}]},{lists,foldl,3,[{file,"lists.erl"},{line,124
>>> 8}]},{riak_kv_index_hashtree,init_trees,3,[{file,"src/riak_k
>>> v_index_hashtree.erl"},{line,474}]},{riak_kv_index_hashtree,
>>> init,1,[{file,"src/riak_kv_index_hashtree.erl"},{line,268}]}
>>> ,{gen_server,init_it,6,[{file,"gen_server.erl"},{line,304}]}
>>> ,{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,23
>>> 9}]}]},[{gen_server,init_it,6,[{file,"gen_server.erl"},{line
>>> ,328}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{
>>> line,239}]}]}
>>>     ancestors: [<0.715.0>,riak_core_vnode_sup,riak_core_sup,<0.160.0>]
>>>     messages: []
>>>     links: []
>>>     dictionary: []
>>>     trap_exit: false
>>>     status: running
>>>     heap_size: 1598
>>>     stack_size: 27
>>>     reductions: 889
>>>   neighbours:
>>>
>>>
>>> Regards
>>> Arun
>>>
>>>
>> Hi Arun,
>>
>> The crash log you provided shows that there is a corrupted file in the
>> AAE (anti_entropy) backend. Entries in console.log should have more
>> information about which partition is affected. Please post output from the
>> affected node at around 2017-02-13T21:18:30. As this is AAE data, it is
>> safe to remove the directory named after the affected partition from the
>> active_entropy directory before restarting the node. You may find that
>> there is more than one affected partition, the next of which will be
>> encountered after the attempted restart only. If this is the case, simply
>> identify the next partition in the same way and remove it, too, until the
>> node starts up successfully again.
>>
>> Is there a reason why the nodes aren't shut down in the regular way?
>>
>> Kind Regards,
>>
>> Magnus
>>
>>
>>
>> --
>> Magnus Kessler
>> Client Services Engineer
>> Basho Technologies Limited
>>
>> Registered Office - 8 Lincoln’s Inn Fields London WC2A 3BP Reg 07970431
>> _______________________________________________
>> riak-users mailing list
>> riak-users at lists.basho.com
>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>>
>>
>>
> <aaeLOG.tar>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.basho.com/pipermail/riak-users_lists.basho.com/attachments/20170214/1623467b/attachment-0002.html>


More information about the riak-users mailing list