Upgrade to 1.1.2 problems

Joseph Blomstedt joe at basho.com
Wed Apr 25 19:38:54 EDT 2012


Phil,

Given the following lines,

>         {reason,
>          {{badmatch,{'EXIT',noproc}},
>           [{riak_core_vnode_proxy,call,2},

I'm inclined to think this is an issue with the new vnode routing
layer introduced in Riak 1.1. On your Riak 1.1.2 node, make sure your
app.config does not contain {legacy_vnode_routing, false} in the
riak_core section. If it does, either delete the line or change it to
false. If this was the issue, you should set it back to false after
upgrading the rest of your cluster.

Also, if this turns out to be the issue, could you please let me know
how you performed the upgrade? Our official packages are supposed to
be designed so that they retain your old app.config file when
installed over an existing Riak installation (and thus, this line
would have been missing from the 1.0.3 config). It would be useful to
know if you used an official package and the app.config was
overwritten, or if you manually setup your app.config and simply
missed this option.

On the plus side, the next release of Riak should include built-in
capability negotiation that should remove the need for users to have
to manually deal with legacy, mapred, listkeys, etc settings during an
upgrade.

-Joe

On Wed, Apr 25, 2012 at 2:05 PM, Phil Sorber <phil at omniti.com> wrote:
> We have a 4 node riak 1.0.0 cluster running in production that we want
> to upgrade to 1.1.2. We set up a test environment that closely mimic's
> the production one. As close as we possibly can with ec2 hosts. First
> attempt to jump from 1.0.0 -> 1.1.2 failed. We took into account the
> mapred_system issue and the listkeys_backpressure issue. We decided to
> try 1.0.0 -> 1.0.3 since that would involve the mapred_system issue
> only. That upgrade worked. We then tried to upgrade 1.0.3 -> 1.1.2 and
> had similar problems. Details below.
>
> --
> # riak-admin transfers
> Attempting to restart script through sudo -u riak
> 'riak at 50.16.31.226' waiting to handoff 14 partitions
> --
>
> Sometimes this would show as many as 48 transfers. Always from the
> node that we upgraded. It would eventually show no transfers left. The
> upgrade from 1.0.0 -> 1.0.3 didn't do this.
>
> We tested a link walking query that is similar to what we run in
> production. On 2 of the 3 nodes still running 1.0.3 it worked fine. On
> the 3rd node, this happened:
>
> Curl run on 1.0.3 node:
>
> --
> curl -v http://localhost:8098/riak/email_address/riakupgrade@gmail.com/_,_,1
> * About to connect() to localhost port 8098 (#0)
> *   Trying 127.0.0.1... connected
> * Connected to localhost (127.0.0.1) port 8098 (#0)
>> GET /riak/email_address/riakupgrade at gmail.com/_,_,1 HTTP/1.1
>> User-Agent: curl/7.19.7 (x86_64-pc-linux-gnu) libcurl/7.19.7 OpenSSL/0.9.8k zlib/1.2.3.3 libidn/1.15
>> Host: localhost:8098
>> Accept: */*
>>
> < HTTP/1.1 500 Internal Server Error
> < Server: MochiWeb/1.1 WebMachine/1.9.0 (someone had painted it blue)
> < Expires: Wed, 25 Apr 2012 20:55:30 GMT
> < Date: Wed, 25 Apr 2012 20:45:30 GMT
> < Content-Type: text/html
> < Content-Length: 2068
> <
> <html><head><title>500 Internal Server
> Error</title></head><body><h1>Internal Server Error</h1>The server
> encountered an error while processing this request:<br><pre>{error,
>  {error,
>  {badmatch,
>   {eoi,[],
>    [{{reduce,0},
>      {trace,
>       [error],
>       {error,
>        [{module,riak_kv_w_reduce},
>         {partition,1438665674247607560106752257205091097473808596992},
>         {details,
>          [{fitting,
>            {fitting,<0.848.0>,#Ref<0.0.0.5472>,
>             #Fun<riak_kv_mrc_pipe.3.19126064>,1}},
>           {name,{reduce,0}},
>           {module,riak_kv_w_reduce},
>           {arg,{rct,#Fun<riak_kv_mapreduce.reduce_set_union.2>,none}},
>           {output,
>            {fitting,<0.847.0>,#Ref<0.0.0.5472>,
>             #Fun<riak_kv_mrc_pipe.1.120571329>,
>             #Fun<riak_kv_mrc_pipe.2.112900629>}},
>           {options,
>            [{sink,{fitting,<0.119.0>,#Ref<0.0.0.5472>,sink,undefined}},
>             {log,sink},
>             {trace,
>              {set,1,16,16,8,80,48,
>               {[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},
>               {{[],[],[error],[],[],[],[],[],[],[],[],[],[],[],[],[]}}}}]},
>           {q_limit,64}]},
>         {reason,
>          {{badmatch,{'EXIT',noproc}},
>           [{riak_core_vnode_proxy,call,2},
>            {riak_pipe_vnode,queue_work_send,4},
>            {riak_pipe_vnode,queue_work_erracc,6},
>            {riak_kv_w_reduce,'-done/1-lc$^0/1-0-',3},
>            {riak_kv_w_reduce,done,1},
>            {riak_pipe_vnode_worker,wait_for_input,2},
>            {gen_fsm,handle_msg,7},
>            {proc_lib,init_p_do_apply,3}]}},
>         {state,{working,done}}]}}}]}},
>  [{riak_kv_mrc_pipe,collect_outputs,3},
>   {riak_kv_wm_link_walker,execute_segment,3},
>   {riak_kv_wm_link_walker,execute_query,3},
>   {riak_kv_wm_link_walker,to_multipart_mixed,2},
>   {webmachine_resource,resource_call,3},
>   {webmachine_resource,do,3},
>   {webmachine_decision_core,resource_call,1},
>   {webmachine_decision_core,decision,1}]}}</pre><P><HR><ADDRESS>mochiweb+webmachine
> web server</AD* Connection #0 to host localhost left intact
> * Closing connection #0
> --
>
> After that, this appeared in the error.log on that node:
>
> error.log on 1.0.3 node:
>
> --
> 2012-04-25 20:45:30.373 [error] <0.119.0> webmachine error:
> path="/riak/email_address/riakupgrade at gmail.com/_,_,1"
> {error,{error,{badmatch,{eoi,[],[{{reduce,0},{trace,[error],{error,[{module,riak_kv_w_reduce},{partition,1438665674247607560106752257205091097473808596992},{details,[{fitting,{fitting,<0.848.0>,#Ref<0.0.0.5472>,#Fun<riak_kv_mrc_pipe.3.19126064>,1}},{name,{reduce,0}},{module,riak_kv_w_reduce},{arg,{rct,#Fun<riak_kv_mapreduce.reduce_set_union.2>,none}},{output,{fitting,<0.847.0>,#Ref<0.0.0.5472>,#Fun<riak_kv_mrc_pipe.1.120571329>,#Fun<riak_kv_mrc_pipe.2.112900629>}},{options,[{sink,{fitting,<0.119.0>,#Ref<0.0.0.5472>,sink,undefined}},{log,sink},{trace,{set,1,16,16,8,80,48,{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},{{[],[],[error],[],[],[],[],[],[],[],[],[],[],[],[],[]}}}}]},{q_limit,64}]},{reason,{{badmatch,{'EXIT',noproc}},[{riak_core_vnode_proxy,call,2},{riak_pipe_vnode,queue_work_send,4},{riak_pipe_vnode,queue_work_erracc,6},{riak_kv_w_reduce,'-done/1-lc$^0/1-0-',3},{riak_kv_w_reduce,done,1},{riak_pipe_vnode_worker,wait_for_input,2},{gen_fsm,handle_msg,7},{proc_lib,init_p_do_apply,3}]}},{state,{working,done}}]}}}]}},[{riak_kv_mrc_pipe,collect_outputs,3},{riak_kv_wm_link_walker,execute_segment,3},{riak_kv_wm_link_walker,execute_query,3},{riak_kv_wm_link_walker,to_multipart_mixed,2},{webmachine_resource,resource_call,3},{webmachine_resource,do,3},{webmachine_decision_core,resource_call,1},{webmachine_decision_core,decision,1}]}}
> --
>
> And this was in the error.log on the 1.1.2 node:
>
> error.log on 1.1.2 node:
>
> --
> 2012-04-25 20:45:30.234 [error] <0.1710.0> gen_fsm <0.1710.0> in state
> wait_for_input terminated with reason: no match of right hand value
> {'EXIT',noproc} in riak_core_vnode_proxy:call/2
> 2012-04-25 20:45:30.243 [error] <0.1710.0> CRASH REPORT Process
> <0.1710.0> with 0 neighbours crashed with reason: no match of right
> hand value {'EXIT',noproc} in riak_core_vnode_proxy:call/2
> 2012-04-25 20:45:30.245 [error] <0.331.0> Supervisor
> riak_pipe_vnode_worker_sup had child undefined started with
> {riak_pipe_vnode_worker,start_link,undefined} at <0.1710.0> exit with
> reason no match of right hand value {'EXIT',noproc} in
> riak_core_vnode_proxy:call/2 in context child_terminated
> --
>
> On the 1.1.2 node, running the same curl, it returned a 404 after a
> long timeout:
>
> Curl run on the 1.1.2 node:
>
> --
> curl -v http://localhost:8098/riak/email_address/riakupgrade@gmail.com/_,_,1
> * About to connect() to localhost port 8098 (#0)
> *   Trying 127.0.0.1... connected
> * Connected to localhost (127.0.0.1) port 8098 (#0)
>> GET /riak/email_address/riakupgrade at gmail.com/_,_,1 HTTP/1.1
>> User-Agent: curl/7.19.7 (x86_64-pc-linux-gnu) libcurl/7.19.7 OpenSSL/0.9.8k zlib/1.2.3.3 libidn/1.15
>> Host: localhost:8098
>> Accept: */*
>>
> < HTTP/1.1 404 Object Not Found
> < Server: MochiWeb/1.1 WebMachine/1.9.0 (someone had painted it blue)
> < Date: Wed, 25 Apr 2012 20:49:40 GMT
> < Content-Type: text/html
> < Content-Length: 193
> <
> * Connection #0 to host localhost left intact
> * Closing connection #0
> <HTML><HEAD><TITLE>404 Not Found</TITLE></HEAD><BODY><H1>Not
> Found</H1>The requested document was not found on this
> server.<P><HR><ADDRESS>mochiweb+webmachine web
> server</ADDRESS></BODY></HTML>
> --
>
> There was nothing more in any error logs after this. Also pulling up
> direct keys fails on this node. Riak is installed from the debian
> packages. I can send you whatever other info is neccesary. We tried
> many different combinations, but I think this one is the most correct
> and produced the most useful error messages. Any help is appreciated.
>
> Thanks.
>
> _______________________________________________
> riak-users mailing list
> riak-users at lists.basho.com
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com



-- 
Joseph Blomstedt <joe at basho.com>
Software Engineer
Basho Technologies, Inc.
http://www.basho.com/




More information about the riak-users mailing list