riak-search corruption causing multiple nodes to crash

Robby Grossman robby at freerobby.com
Wed Aug 29 11:03:37 EDT 2012


We have a riak cluster on EC2 (large instances, ephemeral storage, ~5 nodes though currently 7 after some shuffling) that's seen multiple nodes go down over the last week due to corrupted merge_indexes. Certain boxes go down more frequently than others, but it's not predictable and it seems like any arbitrary box can be affected. The errors look similar to what I read about in this thread:

http://lists.basho.com/pipermail/riak-users_lists.basho.com/2012-July/008933.html

except that it's occurring on multiple nodes, which prevents us from doing repairs from adjacent nodes.

I've tried a few things:

- Stop riak on single box, clear out merge_index/, restart riak. This works for several hours to about a day, but it eventually becomes corrupt again.
- Stopping all nodes, clearing out all merge_index folders, restarting all nodes. Like above, this works for several hours but eventually we see corrupted merge indexes again. And obviously, this loses all past index data, so even if it worked it wouldn't be a suitable solution. I just needed to stop the nodes from going down.
- Using Ryan Zezeski's script to detect bad MI files - there are too many for this to be a sustainable ongoing effort, though. https://gist.github.com/3250870

I spoke to Tom Santero who hypothesized that there could be something underlying in the EC2 infrastructure that's causing some corruption problems. We don't have a ticket off of EC2 right now, but what I am doing (as I write this) is widening the cluster to span three availability zones (all in US-East). My thinking is that if we continue to see problems but they' re all in a specific zone, that would confirm Tom's hypothesis (though the alternative would not disprove it).

Below is some sample error.log/crash.log output. I'd be appreciative if anybody has thoughts as to what is causing these problems, or further tests I can run to diagnose/troubleshoot.

Thanks in advance,
Robby

error.log:
2012-08-29 04:50:01.829 [error] <0.1909.0> CRASH REPORT Process <0.1909.0> with 0 neighbours exited with reason: bad argument in call to erlang:binary_to_term(<<131,108,0,0,0,4,104,4,
104,3,109,0,0,0,5,108,105,110,107,115,109,0,0,0,10,108,105,110,107,95,...>>) in mi_buffer:read_value/2 line 162 in gen_server:init_it/6 line 328
2012-08-29 04:50:01.833 [error] <0.1908.0> CRASH REPORT Process <0.1908.0> with 0 neighbours exited with reason: no match of right hand value {error,{badarg,[{erlang,binary_to_term,[<
<131,108,0,0,0,4,104,4,104,3,109,0,0,0,5,108,105,110,107,115,109,0,0,0,10,108,105,110,107,95,110,111,116,101,115,109,0,0,0,2,115,111,109,0,0,0,32,55,100,51,97,55,48,50,54,56,101,56,98
,52,100,99,98,101,55,102,100,102,52,54,51,54,98,50,50,102,49,56,97,110,7,1,238,11,38,162,93,200,4,108,0,0,0,1,104,2,100,0,1,112,107,0,1,39,106,104,4,104,3,109,0,0,0,5,108,105,110,107,
115,109,0,0,0,10,108,105,110,107,95,110,111,116,101,115,109,0,0,0,16,83,116,97,121,45,...>>],...},...]}} in merge_index_backend:start/2 line 47 in gen_fsm:init_it/6 line 379
2012-08-29 04:50:01.836 [error] <0.154.0> Supervisor riak_core_vnode_sup had child undefined started with {riak_core_vnode,start_link,undefined} at <0.1908.0> exit with reason no matc
h of right hand value {error,{badarg,[{erlang,binary_to_term,[<<131,108,0,0,0,4,104,4,104,3,109,0,0,0,5,108,105,110,107,115,109,0,0,0,10,108,105,110,107,95,110,111,116,101,115,109,0,0
,0,2,115,111,109,0,0,0,32,55,100,51,97,55,48,50,54,56,101,56,98,52,100,99,98,101,55,102,100,102,52,54,51,54,98,50,50,102,49,56,97,110,7,1,238,11,38,162,93,200,4,108,0,0,0,1,104,2,100,
0,1,112,107,0,1,39,106,104,4,104,3,109,0,0,0,5,108,105,110,107,115,109,0,0,0,10,108,105,110,107,95,110,111,116,101,115,109,0,0,0,16,83,116,97,121,45,...>>],...},...]}} in merge_index_
backend:start/2 line 47 in context child_terminated
2012-08-29 04:50:01.839 [error] <0.1906.0> gen_server riak_core_vnode_manager terminated with reason: no match of right hand value {error,{{badmatch,{error,{badarg,[{erlang,binary_to_
term,[<<131,108,0,0,0,4,104,4,104,3,109,0,0,0,5,108,105,110,107,115,109,0,0,0,10,108,105,110,107,95,110,111,116,101,115,109,0,0,0,2,115,111,109,0,0,0,32,55,100,51,97,55,48,50,54,56,10
1,56,98,52,100,99,98,101,55,102,100,102,52,54,51,54,98,50,50,102,49,56,97,110,7,1,238,11,38,162,93,200,4,108,0,0,0,1,104,2,100,0,1,112,107,0,1,39,106,104,4,104,3,109,0,0,0,5,108,105,1
10,107,115,109,0,0,0,10,108,105,110,107,95,110,111,116,101,115,109,0,0,0,...>>],...},...]}}},...}} in riak_core_vnode_manager:get_vnode/3 line 489
2012-08-29 04:50:01.858 [error] <0.1906.0> CRASH REPORT Process riak_core_vnode_manager with 0 neighbours exited with reason: no match of right hand value {error,{{badmatch,{error,{ba
darg,[{erlang,binary_to_term,[<<131,108,0,0,0,4,104,4,104,3,109,0,0,0,5,108,105,110,107,115,109,0,0,0,10,108,105,110,107,95,110,111,116,101,115,109,0,0,0,2,115,111,109,0,0,0,32,55,100
,51,97,55,48,50,54,56,101,56,98,52,100,99,98,101,55,102,100,102,52,54,51,54,98,50,50,102,49,56,97,110,7,1,238,11,38,162,93,200,4,108,0,0,0,1,104,2,100,0,1,112,107,0,1,39,106,104,4,104
,3,109,0,0,0,5,108,105,110,107,115,109,0,0,0,10,108,105,110,107,95,110,111,116,101,115,109,0,0,0,...>>],...},...]}}},...}} in riak_core_vnode_manager:get_vnode/3 line 489 in gen_serve
r:terminate/6 line 747
2012-08-29 04:50:01.881 [error] <0.152.0> Supervisor riak_core_sup had child riak_core_vnode_manager started with riak_core_vnode_manager:start_link() at <0.1906.0> exit with reason n
o match of right hand value {error,{{badmatch,{error,{badarg,[{erlang,binary_to_term,[<<131,108,0,0,0,4,104,4,104,3,109,0,0,0,5,108,105,110,107,115,109,0,0,0,10,108,105,110,107,95,110
,111,116,101,115,109,0,0,0,2,115,111,109,0,0,0,32,55,100,51,97,55,48,50,54,56,101,56,98,52,100,99,98,101,55,102,100,102,52,54,51,54,98,50,50,102,49,56,97,110,7,1,238,11,38,162,93,200,
4,108,0,0,0,1,104,2,100,0,1,112,107,0,1,39,106,104,4,104,3,109,0,0,0,5,108,105,110,107,115,109,0,0,0,10,108,105,110,107,95,110,111,116,101,115,109,0,0,0,...>>],...},...]}}},...}} in r
iak_core_vnode_manager:get_vnode/3 line 489 in context child_terminated


crash.log:
2012-08-29 04:48:19 =CRASH REPORT====
  crasher:
    initial call: riak_core_vnode_manager:init/1
    pid: <0.3501.0>
    registered_name: riak_core_vnode_manager
    exception exit: {{{badmatch,{error,{{badmatch,{error,{badarg,[{erlang,binary_to_term,[<<131,108,0,0,0,4,104,4,104,3,109,0,0,0,5,108,105,110,107,115,109,0,0,0,10,108,105,110,107,95
,110,111,116,101,115,109,0,0,0,2,115,111,109,0,0,0,32,55,100,51,97,55,48,50,54,56,101,56,98,52,100,99,98,101,55,102,100,102,52,54,51,54,98,50,50,102,49,56,97,110,7,1,238,11,38,162,93,
200,4,108,0,0,0,1,104,2,100,0,1,112,107,0,1,39,106,104,4,104,3,109,0,0,0,5,108,105,110,107,115,109,0,0,0,10,108,105,110,107,95,110,111,116,101,115,109,0,0,0,16,83,116,97,121,45,97,116
,45,72,111,109,101,45,77,111,109,109,0,0,0,32,55,100,51,97,55,48,50,54,56,101,56,98,52,100,99,98,101,55,102,100,102,52,54,51,54,98,50,50,102,49,56,97,110,7,1,238,11,38,162,93,200,4,10
8,0,0,0,1,104,2,100,0,1,112,107,0,1,31,106,104,4,104,3,109,0,0,0,5,108,105,110,107,115,109,0,0,0,2,104,49,109,0,0,0,3,99,111,109,109,0,0,0,32,55,100,51,97,55,48,50,54,56,101,56,98,52,
100,99,98,101,55,102,100,102,52,54,51,54,98,50,50,102,49,56,97,110,7,1,238,11,38,162,93,200,4,108,0,0,0,1,104,2,100,0,1,112,107,0,1,0,106,104,4,104,3,109,0,0,0,5,108,105,110,107,115,1
09,0,0,0,9,116,105,109,101,115,116,97,109,112,109,0,0,0,10,49,51,52,54,50,48,52,51,53,52,109,0,0,0,32,55,100,51,97,55,48,50,54,56,101,56,98,52,100,99,98,101,55,102,100,102,52,54,51,54
,98,50,50,102,49,56,97,110,7,1,238,11,38,162,93,200,4,108,0,0,0,1,104,2,100,0,1,112,107,0,0,0,1,43>>],[]},{mi_buffer,read_value,2,[{file,"src/mi_buffer.erl"},{line,162}]},{mi_buffer,o
pen_inner,3,[{file,"src/mi_buffer.erl"},{line,70}]},{mi_buffer,new,1,[{file,"src/mi_buffer.erl"},{line,62}]},{mi_server,read_buffers,4,[{file,"src/mi_server.erl"},{line,613}]},{mi_ser
ver,read_buf_and_seg,1,[{file,"src/mi_server.erl"},{line,585}]},{mi_server,init,1,[{file,"src/mi_server.erl"},{line,143}]},{gen_server,init_it,6,[{file,"gen_server.erl"},{line,304}]}]
}}},[{merge_index_backend,start,2,[{file,"src/merge_index_backend.erl"},{line,47}]},{riak_search_vnode,init,1,[{file,"src/riak_search_vnode.erl"},{line,135}]},{riak_core_vnode,init,1,
[{file,"src/riak_core_vnode.erl"},{line,123}]},{gen_fsm,init_it,6,[{file,"gen_fsm.erl"},{line,361}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,227}]}]}}},[{riak_core_vn
ode_manager,get_vnode,3,[{file,"src/riak_core_vnode_manager.erl"},{line,489}]},{riak_core_vnode_manager,maybe_trigger_handoff,3,[{file,"src/riak_core_vnode_manager.erl"},{line,613}]},
{riak_core_vnode_manager,'-trigger_ownership_handoff/3-lc$^2/1-2-',2,[{file,"src/riak_core_vnode_manager.erl"},{line,448}]},{riak_core_vnode_manager,trigger_ownership_handoff,3,[{file
,"src/riak_core_vnode_manager.erl"},{line,448}]},{riak_core_vnode_manager,handle_cast,2,[{file,"src/riak_core_vnode_manager.erl"},{line,378}]},{gen_server,handle_msg,5,[{file,"gen_ser
ver.erl"},{line,607}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,227}]}]},[{gen_server,terminate,6,[{file,"gen_server.erl"},{line,747}]},{proc_lib,init_p_do_apply,3,[{f
ile,"proc_lib.erl"},{line,227}]}]}
    ancestors: [riak_core_sup,<0.150.0>]
    messages: []
    links: [<0.151.0>]
    dictionary: []
    trap_exit: false



-- 
Robby Grossman
@freerobby (http://twitter.com/freerobby)
http://rob.by (http://rob.by/)


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.basho.com/pipermail/riak-users_lists.basho.com/attachments/20120829/9ada9b0b/attachment.html>


More information about the riak-users mailing list