Troubleshooting riak inserts

Rusty Klophaus rusty at basho.com
Tue Jul 5 09:44:52 EDT 2011


Hi Fyodor,

Following up, to help us troubleshoot, would you mind answering a few
questions about your environment:

   - What platform are you running?
   - What version of Riak Search are you using?
   - Did you install Riak Search from our pre-built binaries, or did you
   compile from source?
   - If you compiled from source, what version of Erlang are you running?
   - What interface are you using to index the files? (Solr or KV?)
   - Are you using the default schema? If not, can you send a copy of your
   schema file?
   - Can you send us a sampling of your data, anonymized if necessary.

Best,
Rusty

On Tue, Jul 5, 2011 at 9:15 AM, Ryan Zezeski <rzezeski at basho.com> wrote:

> Fyodor,
>
> I can't tell you exactly what caused this to happen but I can tell you how
> to move past it.  Search uses two data structures to store the index:
> buffers and segments.  A buffer is an in-memory structure backed by a file
> on disk.  Overtime buffers are converted to segments.  All segments live on
> disk but there is an in-memory offset table to perform lookups.  During a
> request if the vnode to handle that request is not already up it will be
> started.  During the vnode's initialization it will read all buffers and
> segment tables into memory.  In your case, each time the vnode is started it
> crashes while trying to read the buffer file.  Looking at the binary in your
> trace it looks like somehow the data became corrupted.  First off, I'm
> confused by the syntax of the binary in your stack trace.  I.e. what's up
> with the brackets surrounding that binary data?  That aside, I see two terms
> in that data, i.e. there are two occurrences of the byte '131' which
> indicates the start of a term.  The second term is valid:
>
> [{{<<"logs">>,<<"text">>,<<"SEQ=1">>},
>   <<"ae2b12ae-a155-11e0-9e33-00219bfc3293">>,
>   -1309244813808575,
>   [{p,[14]}]}]
>
> However, the first term seems to have been truncated/corrupted somehow.
>  Why?  I'm not sure. My immediate guess would be that a write failed at some
> point, writing bad data to the buffer file, the vnode crashed, and then when
> it started back up it couldn't read back the buffer file.  The code to read
> the buffer data expects correct data or it will simply crash, as you see.
>  This will cause a perpetual series of crashes until the problem is manually
> resolved.  In this case you can just move your buffer files, for the
> crashing vnodes, one at a time until the problem goes away.  This will cause
> you to lose some of your indexed data.  For example, in your case the
> crashing vnode is for partition 433883298582611803841718934712646521460354973696.
>  You can cd to riak_search/data/merge_index/433883298582611803841718934712646521460354973696
> and then mv your buffer.* files to something like corrupt-buffer.*.
>
> TL;DR - For one reason or another a buffer file became corrupted.  As a
> workaround you can move your buffer files out of the way.
>
> -Ryan
>
> On Sat, Jul 2, 2011 at 6:40 AM, Fyodor Yarochkin <fyodor.y at armorize.com>wrote:
>
>> Greetings,
>>
>>  I've been running a single node riaksearch instance, while came
>> across this problem: after inserting roughly 200Mb of data every
>> consequential insert (into any bucket) would start to time out with a
>> sequence of errors logs that point on  riak_search_vnode_master crash:
>>
>> =SUPERVISOR REPORT==== 2-Jul-2011::06:04:57 ===
>>     Supervisor: {local,riak_search_sup}
>>     Context:    child_terminated
>>     Reason:
>>
>> {{badmatch,{error,{{badmatch,{error,{badarg,[{erlang,binary_to_term,[<<[131,108,0,0,0,2,104,4,104,3,109,0,0,0,4,108,111,103,115,109,0,0,0,4,116,101,120,116,109,0,0,0,16,91,49,50,49,49,49,56,48,46,55,49,54,51,55,52,93,109,0,0,0,36,97,97,54,55,53,52,53,99,45,97,49,53,53,45,49,49,101,48,45,57,101,51,51,45,48,48,50,49,57,98,102,99,51,50,57,51,110,7,1,112,21,181,79,192,166,4,108,0,0,0,1,104,0,0,0,106,131,108,0,0,0,1,104,4,104,3,109,0,0,0,4,108,111,103,115,109,0,0,0,4,116,101,120,116,109,0,0,0,5,83,69,81,61,49,109,0,0,0,36,97,101,50,98,49,50,97,101,45,97,49,53,53,45,49,49,101,48,45,57,101,51,51,45,48,48,50,49,57,98,102,99,51,50,57,51,110,7,1,191,19,13,80,192,166,4,108,0,0,0,1,104,2,100,0,1,112,107,0,1,14,106,106]>>]},{mi_buffer,read_value,1},{mi_buffer,open_inner,2},{mi_buffer,new,1},{mi_server,read_buffers,4},{mi_server,read_buf_and_seg,1},{mi_server,init,1},{gen_server,init_it,6}]}}},[{merge_index_backend,start,2},{riak_search_vnode,init,1},{riak_core_vnode,init,1},{gen_fsm,init_it,6},{proc_lib,init_p_do_apply,3}]}}},[{riak_core_vnode_master,get_vnode,2},{riak_core_vnode_master,handle_call,3},{gen_server,handle_msg,5},{proc_lib,init_p_do_apply,3}]}
>>     Offender:
>>
>> [{pid,<0.754.0>},{name,riak_search_vnode_master},{mfa,{riak_core_vnode_master,start_link,[riak_search_vnode]}},{restart_type,permanent},{shutdown,5000},{child_type,worker}]
>>
>>
>> (the full paste of error dump log is here http://pastebin.com/0Bj5cJAQ)
>>
>> Reads still work and I am slighly confused on the reason of the crash.
>> The availability of RAM is one of the things I suspect here:
>> "mem_total":1059192832,"mem_allocated":893632512,". There is no
>> shortage of the disk space or other resources on the system.  I am
>> abit stuck as to where to start troubleshooting this issue. Any
>> pointers or hints would be appreciated greatly! :)
>>
>>
>> regards,
>> -F
>>
>> _______________________________________________
>> riak-users mailing list
>> riak-users at lists.basho.com
>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>>
>
>
> _______________________________________________
> riak-users mailing list
> riak-users at lists.basho.com
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.basho.com/pipermail/riak-users_lists.basho.com/attachments/20110705/6eb00bc5/attachment.html>


More information about the riak-users mailing list