bitcask issue - write_locked

Konstantin Kalin konstantin.kalin at gmail.com
Thu Oct 31 10:42:30 EDT 2013


We use riak 1.2.1 in production. Recently we had two incidents when a Riak
node (out of 27 nodes) was in limbo state: riak-core worked fine but
bitcask backend was constantly crashing (see below). Both times it happened
immediately after Riak was restarted due to Linux reboot.

I looked at the source code of bitcask and looks like it happens because
there was a stuck alive process that held the lock.

Major issue that this node causes timeouts on others and as result Java
client got timeout error and bad-message code. Thus it was impossible to do
any update for data that lands to that node as a primary replica (It's my
guess). For reads there was significant delay ~60 seconds. We have W=3, R=2.

Is it known issue ? What I can do besides of upgrading to latest version
and hoping it will solve the issue :)

Thank you,
Konstantin.

Bad Riak node.
==================
2013-10-30 21:40:27.000 [info] <0.7.0> Application erlydtl started on node '
riak at dirapp81.mlp.marquee.net'
2013-10-30 21:40:45.229 [info] <0.407.0>@riak_core:wait_for_service:439
Wait complete for service riak_kv (18 seconds)
2013-10-30 21:40:45.473 [error] <0.798.0> gen_fsm <0.798.0> in state active
terminated with reason: bad return value:
{error,{write_locked,locked,"/var/lib/riak/bitcask/1324485858831130769622089379649131486563188867072"}}
2013-10-30 21:40:45.485 [error] <0.798.0> CRASH REPORT Process <0.798.0>
with 1 neighbours exited with reason: bad return value:
{error,{write_locked,locked,"/var/lib/riak/bitcask/1324485858831130769622089379649131486563188867072"}}
in gen_fsm:terminate/7 line 611
2013-10-30 21:40:45.486 [error] <0.140.0> Supervisor riak_core_vnode_sup
had child undefined started with {riak_core_vnode,start_link,undefined} at
<0.798.0> exit with reason bad return value:
{error,{write_locked,locked,"/var/lib/riak/bitcask/1324485858831130769622089379649131486563188867072"}}
in context child_terminated
2013-10-30 21:40:45.487 [error] <0.845.0> gen_fsm <0.845.0> in state ready
terminated with reason: bad return value:
{error,{write_locked,locked,"/var/lib/riak/bitcask/1324485858831130769622089379649131486563188867072"}}
==================


Another good Riak node:
==================
2013-10-30 21:37:36.348 [error] <0.15018.5661>@riak_kv_put_fsm:prepare:220
Unable to forward put for {<<"userSessions">>,<<"123">>} to '
riak at dirapp81.mlp.marquee.net' - nodedown
2013-10-30 21:37:36.348 [error] <0.14950.5661>@riak_kv_put_fsm:prepare:220
Unable to forward put for {<<"userSessions">>,<<"123">>} to '
riak at dirapp81.mlp.marquee.net' - nodedown
2013-10-30 21:41:52.313 [error]
<0.16728.5661>@riak_api_pb_server:handle_info:170 Unrecognized message
{11673037,{error,timeout}}
2013-10-30 21:41:52.313 [error]
<0.16733.5661>@riak_api_pb_server:handle_info:170 Unrecognized message
{105876903,{error,timeout}}
==================


One more good Riak node (that temporary held vnode of bad Riak node during
it's down):
==================
2013-10-30 21:37:37.345 [error] <0.3663.0> ** Node '
riak at dirapp81.mlp.marquee.net' not responding **
** Removing (timedout) connection **

2013-10-30 21:37:37.345 [error] <0.30864.17>@riak_kv_put_fsm:prepare:220
Unable to forward put for {<<"userSessions">>,<<"5457016">>} to '
riak at dirapp81.mlp.marquee.net' - nodedown
2013-10-30 21:41:44.358 [info]
<0.151.0>@riak_core_handoff_manager:handle_info:271 An outbound handoff of
partition riak_kv_vnode 1438665674247607560106752257205091097473808596992
was terminated for reason: {shutdown,max_concurrency}
2013-10-30 21:42:44.361 [info]
<0.151.0>@riak_core_handoff_manager:handle_info:271 An outbound handoff of
partition riak_kv_vnode 1438665674247607560106752257205091097473808596992
was terminated for reason: {shutdown,max_concurrency}
2013-10-30 21:44:30.164 [info]
<0.151.0>@riak_core_handoff_manager:handle_info:271 An outbound handoff of
partition riak_kv_vnode 1438665674247607560106752257205091097473808596992
was terminated for reason: {shutdown,max_concurrency}
2013-10-30 21:45:30.384 [info]
<0.2262.18>@riak_core_handoff_sender:start_fold:126 Starting hinted_handoff
transfer of riak_kv_vnode from 'riak at dirapp41.mlp.marquee.net'
1438665674247607560106752257205091097473808596992 to '
riak at dirapp81.mlp.marquee.net'
1438665674247607560106752257205091097473808596992
2013-10-30 21:45:43.917 [error]
<0.2262.18>@riak_core_handoff_sender:start_fold:215 hinted_handoff transfer
of riak_kv_vnode from 'riak at dirapp41.mlp.marquee.net'
1438665674247607560106752257205091097473808596992 to '
riak at dirapp81.mlp.marquee.net'
1438665674247607560106752257205091097473808596992 failed because of
error:{case_clause,{error,closed}}
[{riak_core_handoff_sender,start_fold,5,[{file,"src/riak_core_handoff_sender.erl"},{line,174}]}]
2013-10-30 21:46:30.382 [error]
<0.969.18>@riak_api_pb_server:handle_info:170 Unrecognized message
{67827933,{error,timeout}}
2013-10-30 21:47:30.167 [info]
<0.151.0>@riak_core_handoff_manager:handle_info:271 An outbound handoff of
partition riak_kv_vnode 1438665674247607560106752257205091097473808596992
was terminated for reason: {shutdown,max_concurrency}
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.basho.com/pipermail/riak-users_lists.basho.com/attachments/20131031/18ccdef8/attachment.html>


More information about the riak-users mailing list