Troubleshooting riak inserts

Fyodor Yarochkin fyodor.y at armorize.com
Wed Jul 6 05:18:58 EDT 2011


Greetings,

Thanks for the response. The recommendation  on simply taking out the
corrupted buffer file worked! There was indeed single  buffer file
(buffer.13), which apparently was causing the crash. Once renamed, the
node stopped hanging on new data inserts. Thanks for the
interpretation of the issue. As for the 'binary' brackets in the data
dump, I can't tell exactly what that it, as it doesn't directly match
any of the data I am writing (json objects that are mainly composed of
string sequences of printable ASCII text). I can share the corrupted
buffer file if that'd be helpful in investigating the root cause.

As for the cause of truncation/corruption - the machine and the
riaksearch node has been continually up for more than 3 months with
two occasional crashes, which looked like this in syslog:

 node kernel: [ 2282.474990] beam[2396]: segfault at 24a5d ip 081016a2
sp bfd68600 error 4 in beam[8048000+15c000]
...
I suspect the corrupted segments might have been caused by beam
segfaults. Anyway, thanks alot for pointers to corrupted buffer files.
This indeed resolved the issue.

Per Rusty's request, here's detailed information of the platform where
the issue was observed:

    What platform are you running?

Debian 5.3 (Lenny)

    What version of Riak Search are you using?

At the time when problem appeared, the riak-search_0.14.2-1_i386.deb
was installed on the system. The node was upgraded from
riak-search_0.14.0-1_i386.deb (the updade procedure was like
riaksearch stop; dpkg -i ..; riaksearch start)

    Did you install Riak Search from our pre-built binaries, or did
you compile from source?

pre-build binaries (.deb packages)

    If you compiled from source, what version of Erlang are you running?

Erlang R14A (erts-5.8)

    What interface are you using to index the files? (Solr or KV?)

the data is indexed by riaksearch precommit hook when stored in the
bucket (if that was the question)


I'll send email regarding the data sampling offlist





On Tue, Jul 5, 2011 at 9:15 PM, Ryan Zezeski <rzezeski at basho.com> wrote:
> Fyodor,
> I can't tell you exactly what caused this to happen but I can tell you how
> to move past it.  Search uses two data structures to store the index:
> buffers and segments.  A buffer is an in-memory structure backed by a file
> on disk.  Overtime buffers are converted to segments.  All segments live on
> disk but there is an in-memory offset table to perform lookups.  During a
> request if the vnode to handle that request is not already up it will be
> started.  During the vnode's initialization it will read all buffers and
> segment tables into memory.  In your case, each time the vnode is started it
> crashes while trying to read the buffer file.  Looking at the binary in your
> trace it looks like somehow the data became corrupted.  First off, I'm
> confused by the syntax of the binary in your stack trace.  I.e. what's up
> with the brackets surrounding that binary data?  That aside, I see two terms
> in that data, i.e. there are two occurrences of the byte '131' which
> indicates the start of a term.  The second term is valid:
> [{{<<"logs">>,<<"text">>,<<"SEQ=1">>},
>   <<"ae2b12ae-a155-11e0-9e33-00219bfc3293">>,
>   -1309244813808575,
>   [{p,[14]}]}]
> However, the first term seems to have been truncated/corrupted somehow.
>  Why?  I'm not sure. My immediate guess would be that a write failed at some
> point, writing bad data to the buffer file, the vnode crashed, and then when
> it started back up it couldn't read back the buffer file.  The code to read
> the buffer data expects correct data or it will simply crash, as you see.
>  This will cause a perpetual series of crashes until the problem is manually
> resolved.  In this case you can just move your buffer files, for the
> crashing vnodes, one at a time until the problem goes away.  This will cause
> you to lose some of your indexed data.  For example, in your case the
> crashing vnode is for
> partition 433883298582611803841718934712646521460354973696.  You can cd to
> riak_search/data/merge_index/433883298582611803841718934712646521460354973696
> and then mv your buffer.* files to something like corrupt-buffer.*.
> TL;DR - For one reason or another a buffer file became corrupted.  As a
> workaround you can move your buffer files out of the way.
> -Ryan
> On Sat, Jul 2, 2011 at 6:40 AM, Fyodor Yarochkin <fyodor.y at armorize.com>
> wrote:
>>
>> Greetings,
>>
>>  I've been running a single node riaksearch instance, while came
>> across this problem: after inserting roughly 200Mb of data every
>> consequential insert (into any bucket) would start to time out with a
>> sequence of errors logs that point on  riak_search_vnode_master crash:
>>
>> =SUPERVISOR REPORT==== 2-Jul-2011::06:04:57 ===
>>     Supervisor: {local,riak_search_sup}
>>     Context:    child_terminated
>>     Reason:
>>
>> {{badmatch,{error,{{badmatch,{error,{badarg,[{erlang,binary_to_term,[<<[131,108,0,0,0,2,104,4,104,3,109,0,0,0,4,108,111,103,115,109,0,0,0,4,116,101,120,116,109,0,0,0,16,91,49,50,49,49,49,56,48,46,55,49,54,51,55,52,93,109,0,0,0,36,97,97,54,55,53,52,53,99,45,97,49,53,53,45,49,49,101,48,45,57,101,51,51,45,48,48,50,49,57,98,102,99,51,50,57,51,110,7,1,112,21,181,79,192,166,4,108,0,0,0,1,104,0,0,0,106,131,108,0,0,0,1,104,4,104,3,109,0,0,0,4,108,111,103,115,109,0,0,0,4,116,101,120,116,109,0,0,0,5,83,69,81,61,49,109,0,0,0,36,97,101,50,98,49,50,97,101,45,97,49,53,53,45,49,49,101,48,45,57,101,51,51,45,48,48,50,49,57,98,102,99,51,50,57,51,110,7,1,191,19,13,80,192,166,4,108,0,0,0,1,104,2,100,0,1,112,107,0,1,14,106,106]>>]},{mi_buffer,read_value,1},{mi_buffer,open_inner,2},{mi_buffer,new,1},{mi_server,read_buffers,4},{mi_server,read_buf_and_seg,1},{mi_server,init,1},{gen_server,init_it,6}]}}},[{merge_index_backend,start,2},{riak_search_vnode,init,1},{riak_core_vnode,init,1},{gen_fsm,init_it,6},{proc_lib,init_p_do_apply,3}]}}},[{riak_core_vnode_master,get_vnode,2},{riak_core_vnode_master,handle_call,3},{gen_server,handle_msg,5},{proc_lib,init_p_do_apply,3}]}
>>     Offender:
>>
>> [{pid,<0.754.0>},{name,riak_search_vnode_master},{mfa,{riak_core_vnode_master,start_link,[riak_search_vnode]}},{restart_type,permanent},{shutdown,5000},{child_type,worker}]
>>
>>
>> (the full paste of error dump log is here http://pastebin.com/0Bj5cJAQ)
>>
>> Reads still work and I am slighly confused on the reason of the crash.
>> The availability of RAM is one of the things I suspect here:
>> "mem_total":1059192832,"mem_allocated":893632512,". There is no
>> shortage of the disk space or other resources on the system.  I am
>> abit stuck as to where to start troubleshooting this issue. Any
>> pointers or hints would be appreciated greatly! :)
>>
>>
>> regards,
>> -F
>>
>> _______________________________________________
>> riak-users mailing list
>> riak-users at lists.basho.com
>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>
>




More information about the riak-users mailing list