Whole cluster times out if one node is gone

Jay Adkisson j4yferd at gmail.com
Mon Nov 29 12:39:29 EST 2010


Hey Dan/Sean,

Thanks for the response.  sasl-error.log on node A is completely empty, and
I see this pattern in erlang.log:

===== ALIVE Tue Nov 23 12:46:57 PST 2010

===== Tue Nov 23 12:57:36 PST 2010

=ERROR REPORT==== 23-Nov-2010::12:57:36 ===
** Node 'riak@<node D>' not responding **
** Removing (timedout) connection **

=INFO REPORT==== 23-Nov-2010::12:58:41 ===
Starting handoff of partition riak_kv_vnode
251195593916248939066258330623111144003363405824 to 'riak@<node D>'

=INFO REPORT==== 23-Nov-2010::12:58:41 ===
Handoff of partition riak_kv_vnode
251195593916248939066258330623111144003363405824 to 'riak@<node D>'
completed: sent 1 objects in 0.02 seconds
=INFO REPORT==== 23-Nov-2010::12:59:18 ===
Starting handoff of partition riak_kv_vnode
707914855582156101004909840846949587645842325504 to 'riak@<node D>'

=INFO REPORT==== 23-Nov-2010::12:59:18 ===
Handoff of partition riak_kv_vnode
707914855582156101004909840846949587645842325504 to 'riak@<node D>'
completed: sent 5 objects in 0.03 seconds
=INFO REPORT==== 23-Nov-2010::12:59:20 ===
Starting handoff of partition riak_kv_vnode
525227150915793236229449236757414210188850757632 to 'riak@<node D>'

<handoffs, etc...>

This is my testing process: I'm doing an initial load into riak of small
image files between 1 and 150K, throttled to two images per second, with
W=1.  In a different terminal, I'm running a wget every second against node
A of one particular image I already know to be in the cluster, again with
R=1.  I'm using R,W=1 because I figured that would reduce the chance of
timing out, and with my data pattern, nothing I write to the cluster will
ever change, so I really don't need to wait for a quorum.

In response to Sean,

> 1) Riak detects node outage the same way any Erlang system does - when a
> message fails to deliver, or the heartbeat maintained by epmd fails.  The
> default timeout in epmd is 1 minute, which is probably why you're seeing it
> take 1 minute to be detected.
>
Thanks, this is enlightening.

2) If it takes too long (the vnode is overloaded, perhaps, or is just
> starting up as a hint partition) to retrieve from any node, the request can
> time out.
>
That makes sense, but I still wonder why this happens even when the quorum
is already met by the machines that are responding normally?


> 3) You could probably configure epmd to timeout sooner, but then you become
> more vulnerable to temporary partitions. YMMV
>
I may try that - it might be a good fit with my data pattern.

Thanks again,
--Jay


On Mon, Nov 29, 2010 at 4:44 AM, David Smith <dizzyd at basho.com> wrote:

> On Tue, Nov 23, 2010 at 3:33 PM, Jay Adkisson <j4yferd at gmail.com> wrote:
> > (many profuse apologies to Dan - hit "reply" instead of "reply all")
> > Alrighty, I've done a little more digging.  When I throttle the writes
> > heavily (2/sec) and set R and W to 1 all around, the cluster works just
> fine
> > after I restart the node for about 15-20 seconds.  Then the read request
> > hangs for about a minute, until node D disappears from connected_nodes in
> > riak-admin status, at which point it returns the desired value (although
> > sometimes I get a 503):
>
> Are you seeing any error messages in log/erlang.log.* or
> log/sasl-error.log?
>
> Can you expound on your use case a little -- are you doing a large
> insert, or just a random read/write mix? Did you pre-populate the
> dataset? Why are you using r=1, instead of relying on quorom for
> reads?
>
> How are you running the riak-admin status to measure the 15-20 seconds?
>
> Thanks.
>
> D.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.basho.com/pipermail/riak-users_lists.basho.com/attachments/20101129/9f0f63f7/attachment.html>


More information about the riak-users mailing list