Troubleshooting riak inserts

Ryan Zezeski rzezeski at basho.com
Tue Jul 5 09:15:57 EDT 2011


Fyodor,

I can't tell you exactly what caused this to happen but I can tell you how
to move past it.  Search uses two data structures to store the index:
buffers and segments.  A buffer is an in-memory structure backed by a file
on disk.  Overtime buffers are converted to segments.  All segments live on
disk but there is an in-memory offset table to perform lookups.  During a
request if the vnode to handle that request is not already up it will be
started.  During the vnode's initialization it will read all buffers and
segment tables into memory.  In your case, each time the vnode is started it
crashes while trying to read the buffer file.  Looking at the binary in your
trace it looks like somehow the data became corrupted.  First off, I'm
confused by the syntax of the binary in your stack trace.  I.e. what's up
with the brackets surrounding that binary data?  That aside, I see two terms
in that data, i.e. there are two occurrences of the byte '131' which
indicates the start of a term.  The second term is valid:

[{{<<"logs">>,<<"text">>,<<"SEQ=1">>},
  <<"ae2b12ae-a155-11e0-9e33-00219bfc3293">>,
  -1309244813808575,
  [{p,[14]}]}]

However, the first term seems to have been truncated/corrupted somehow.
 Why?  I'm not sure. My immediate guess would be that a write failed at some
point, writing bad data to the buffer file, the vnode crashed, and then when
it started back up it couldn't read back the buffer file.  The code to read
the buffer data expects correct data or it will simply crash, as you see.
 This will cause a perpetual series of crashes until the problem is manually
resolved.  In this case you can just move your buffer files, for the
crashing vnodes, one at a time until the problem goes away.  This will cause
you to lose some of your indexed data.  For example, in your case the
crashing vnode is for partition
433883298582611803841718934712646521460354973696.
 You can cd to riak_search/data/merge_index/433883298582611803841718934712646521460354973696
and then mv your buffer.* files to something like corrupt-buffer.*.

TL;DR - For one reason or another a buffer file became corrupted.  As a
workaround you can move your buffer files out of the way.

-Ryan

On Sat, Jul 2, 2011 at 6:40 AM, Fyodor Yarochkin <fyodor.y at armorize.com>wrote:

> Greetings,
>
>  I've been running a single node riaksearch instance, while came
> across this problem: after inserting roughly 200Mb of data every
> consequential insert (into any bucket) would start to time out with a
> sequence of errors logs that point on  riak_search_vnode_master crash:
>
> =SUPERVISOR REPORT==== 2-Jul-2011::06:04:57 ===
>     Supervisor: {local,riak_search_sup}
>     Context:    child_terminated
>     Reason:
>
> {{badmatch,{error,{{badmatch,{error,{badarg,[{erlang,binary_to_term,[<<[131,108,0,0,0,2,104,4,104,3,109,0,0,0,4,108,111,103,115,109,0,0,0,4,116,101,120,116,109,0,0,0,16,91,49,50,49,49,49,56,48,46,55,49,54,51,55,52,93,109,0,0,0,36,97,97,54,55,53,52,53,99,45,97,49,53,53,45,49,49,101,48,45,57,101,51,51,45,48,48,50,49,57,98,102,99,51,50,57,51,110,7,1,112,21,181,79,192,166,4,108,0,0,0,1,104,0,0,0,106,131,108,0,0,0,1,104,4,104,3,109,0,0,0,4,108,111,103,115,109,0,0,0,4,116,101,120,116,109,0,0,0,5,83,69,81,61,49,109,0,0,0,36,97,101,50,98,49,50,97,101,45,97,49,53,53,45,49,49,101,48,45,57,101,51,51,45,48,48,50,49,57,98,102,99,51,50,57,51,110,7,1,191,19,13,80,192,166,4,108,0,0,0,1,104,2,100,0,1,112,107,0,1,14,106,106]>>]},{mi_buffer,read_value,1},{mi_buffer,open_inner,2},{mi_buffer,new,1},{mi_server,read_buffers,4},{mi_server,read_buf_and_seg,1},{mi_server,init,1},{gen_server,init_it,6}]}}},[{merge_index_backend,start,2},{riak_search_vnode,init,1},{riak_core_vnode,init,1},{gen_fsm,init_it,6},{proc_lib,init_p_do_apply,3}]}}},[{riak_core_vnode_master,get_vnode,2},{riak_core_vnode_master,handle_call,3},{gen_server,handle_msg,5},{proc_lib,init_p_do_apply,3}]}
>     Offender:
>
> [{pid,<0.754.0>},{name,riak_search_vnode_master},{mfa,{riak_core_vnode_master,start_link,[riak_search_vnode]}},{restart_type,permanent},{shutdown,5000},{child_type,worker}]
>
>
> (the full paste of error dump log is here http://pastebin.com/0Bj5cJAQ)
>
> Reads still work and I am slighly confused on the reason of the crash.
> The availability of RAM is one of the things I suspect here:
> "mem_total":1059192832,"mem_allocated":893632512,". There is no
> shortage of the disk space or other resources on the system.  I am
> abit stuck as to where to start troubleshooting this issue. Any
> pointers or hints would be appreciated greatly! :)
>
>
> regards,
> -F
>
> _______________________________________________
> riak-users mailing list
> riak-users at lists.basho.com
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.basho.com/pipermail/riak-users_lists.basho.com/attachments/20110705/f63f921f/attachment.html>


More information about the riak-users mailing list