Whole cluster times out if one node is gone

Sean Cribbs sean at basho.com
Sat Nov 27 18:21:30 EST 2010


1) Riak detects node outage the same way any Erlang system does - when a message fails to deliver, or the heartbeat maintained by epmd fails.  The default timeout in epmd is 1 minute, which is probably why you're seeing it take 1 minute to be detected.
2) If it takes too long (the vnode is overloaded, perhaps, or is just starting up as a hint partition) to retrieve from any node, the request can time out.9
3) You could probably configure epmd to timeout sooner, but then you become more vulnerable to temporary partitions. YMMV

Sean Cribbs <sean at basho.com>
Developer Advocate
Basho Technologies, Inc.
http://basho.com/

On Nov 27, 2010, at 3:21 PM, Jay Adkisson wrote:

> Neville, I'm not sure how you mean.  The network gear is all functional, otherwise I wouldn't be able to interact with the machines at all (they're at our colo).  But as far as I understand, if I hard reboot a box (or, in a real-world scenario, the pdu fails), the switch will happily continue forwarding packets into nothingness, causing HTTP requests to hang indefinitely until they time out.  From what Dan said, I would expect that Riak handles that sort of situation intelligently.  I guess my remaining questions are:
> 
> * How does Riak detect that a node is down, and what could cause that to take a full minute?
> * When N=3, what about a single node failure could cause a read with R=1 to time out?
> * Is there a way to configure the strictness of when nodes are assumed dead?  I'm thinking like a "timeout" config option or something.
> 
> Peace,
> --Jay
> 
> On Tue, Nov 23, 2010 at 2:55 PM, Neville Burnell <neville.burnell at gmail.com> wrote:
> Just a thought ... have you verified your switch, cables, nics, etc
> 
> 
> On 24 November 2010 09:33, Jay Adkisson <j4yferd at gmail.com> wrote:
> (many profuse apologies to Dan - hit "reply" instead of "reply all")
> 
> Alrighty, I've done a little more digging.  When I throttle the writes heavily (2/sec) and set R and W to 1 all around, the cluster works just fine after I restart the node for about 15-20 seconds.  Then the read request hangs for about a minute, until node D disappears from connected_nodes in riak-admin status, at which point it returns the desired value (although sometimes I get a 503):
> 
> --2010-11-23 13:01:28--  http://<node A>:8098/riak/<bucket>/<key>?r=1
> Resolving <node A>... <ip addr>
> Connecting to <node A>|<ip addr>|:8098... connected.
> HTTP request sent, awaiting response... <hang...> 200 OK
> Length: 3684 (3.6K) [image/jpeg]
> Saving to: `<key>?r=1'
> 
> 100%[======================================>] 3,684       --.-K/s   in 0s
> 
> 2010-11-23 13:02:21 (49.5 MB/s) - `<key>?r=1' saved [3684/3684]
> 
> --2010-11-23 13:02:23--  http://<node A>:8098/riak/<bucket>/<key>?r=1
> Resolving <node A>... <ip addr>
> Connecting to <node A>|<ip addr>|:8098... connected.
> HTTP request sent, awaiting response... 200 OK
> Length: 3684 (3.6K) [image/jpeg]
> Saving to: `<key>?r=1'
> 
> 100%[======================================>] 3,684       --.-K/s   in 0s
> 
> 2010-11-23 13:02:23 (220 MB/s) - `<key>?r=1' saved [3684/3684]
> 
> Afterwards, node D comes back up and re-joins the cluster seamlessly.
> 
> Any insights?  
> 
> --Jay
> 
> On Mon, Nov 22, 2010 at 5:59 PM, Jay Adkisson <j4yferd at gmail.com> wrote:
> Hey Dan,
> 
> Thanks for the response!  I tried it again while watching `riak-admin status` - basically, it takes about 30 seconds of node C being down before riak realizes it's gone.  During that time, if I'm writing to the cluster at all (I throttled it to 2 writes per second for testing), both writes and reads hang indefinitely, and sometimes time out.
> 
> I'm using Ripple to do the writes, and wget to test reads, all on node A for now, since I know it'll be up.  I'm using the default R and W options for now.
> 
> Thanks for the help and clarification around ringready.
> 
> --Jay
> 
> 
> On Mon, Nov 22, 2010 at 5:15 PM, Dan Reverri <dan at basho.com> wrote:
> Your HTTP calls should not being timing out. Are you sending requests directly to the Riak node or are you using a load balancer? How much load are you placing on node A? Is it a write only load or are there reads as well? Can you confirm "all" requests time out or is it a large subset of the requests? How large are the objects being written? Are you setting R and W in the request? Are you using a particular client (Ruby, Python, etc.)? Can you provide the output of "riak-admin status" from node A?
> 
> Regarding the ringready command; that is behaving as I would expect considering a node is down.
> 
> Thanks,
> Dan
> 
> Daniel Reverri
> Developer Advocate
> Basho Technologies, Inc.
> dan at basho.com
> 
> 
> On Mon, Nov 22, 2010 at 4:55 PM, Jay Adkisson <j4yferd at gmail.com> wrote:
> Hey all,
> 
> Here's what I'm seeing: I have four nodes A, B, C, and D.  I'm loading lots of data into node A, which is being distributed evenly across the nodes.  If I physically reboot node D, all my HTTP calls time out, and `riak-admin ringready` complains that not all nodes are up.  Is this intended behavior?  Is there a configuration option I can set so it fails more gracefully?
> 
> --Jay
> 
> _______________________________________________
> riak-users mailing list
> riak-users at lists.basho.com
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
> 
> 
> 
> 
> 
> _______________________________________________
> riak-users mailing list
> riak-users at lists.basho.com
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
> 
> 
> 
> _______________________________________________
> riak-users mailing list
> riak-users at lists.basho.com
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.basho.com/pipermail/riak-users_lists.basho.com/attachments/20101127/f0b68d52/attachment.html>


More information about the riak-users mailing list