Whole cluster times out if one node is gone

Jay Adkisson j4yferd at gmail.com
Tue Nov 23 17:33:47 EST 2010


(many profuse apologies to Dan - hit "reply" instead of "reply all")

Alrighty, I've done a little more digging.  When I throttle the writes
heavily (2/sec) and set R and W to 1 all around, the cluster works just fine
after I restart the node for about 15-20 seconds.  Then the read request
hangs for about a minute, until node D disappears from connected_nodes in
riak-admin status, at which point it returns the desired value (although
sometimes I get a 503):

--2010-11-23 13:*01:28*--  http://<node A>:8098/riak/<bucket>/<key>?r=1
Resolving <node A>... <ip addr>
Connecting to <node A>|<ip addr>|:8098... connected.
HTTP request sent, awaiting response... *<hang...> *200 OK
Length: 3684 (3.6K) [image/jpeg]
Saving to: `<key>?r=1'

100%[======================================>] 3,684       --.-K/s   in 0s

2010-11-23 13:*02:21* (49.5 MB/s) - `<key>?r=1' saved [3684/3684]

--2010-11-23 13:02:23--  http://<node A>:8098/riak/<bucket>/<key>?r=1
Resolving <node A>... <ip addr>
Connecting to <node A>|<ip addr>|:8098... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3684 (3.6K) [image/jpeg]
Saving to: `<key>?r=1'

100%[======================================>] 3,684       --.-K/s   in 0s

2010-11-23 13:02:23 (220 MB/s) - `<key>?r=1' saved [3684/3684]

Afterwards, node D comes back up and re-joins the cluster seamlessly.

Any insights?

--Jay

On Mon, Nov 22, 2010 at 5:59 PM, Jay Adkisson <j4yferd at gmail.com> wrote:

> Hey Dan,
>
> Thanks for the response!  I tried it again while watching `riak-admin
> status` - basically, it takes about 30 seconds of node C being down before
> riak realizes it's gone.  During that time, if I'm writing to the cluster at
> all (I throttled it to 2 writes per second for testing), both writes and
> reads hang indefinitely, and sometimes time out.
>
> I'm using Ripple to do the writes, and wget to test reads, all on node A
> for now, since I know it'll be up.  I'm using the default R and W options
> for now.
>
> Thanks for the help and clarification around ringready.
>
> --Jay
>
>
> On Mon, Nov 22, 2010 at 5:15 PM, Dan Reverri <dan at basho.com> wrote:
>
>> Your HTTP calls should not being timing out. Are you sending requests
>> directly to the Riak node or are you using a load balancer? How much load
>> are you placing on node A? Is it a write only load or are there reads as
>> well? Can you confirm "all" requests time out or is it a large subset of the
>> requests? How large are the objects being written? Are you setting R and W
>> in the request? Are you using a particular client (Ruby, Python, etc.)? Can
>> you provide the output of "riak-admin status" from node A?
>>
>> Regarding the ringready command; that is behaving as I would expect
>> considering a node is down.
>>
>> Thanks,
>> Dan
>>
>> Daniel Reverri
>> Developer Advocate
>> Basho Technologies, Inc.
>> dan at basho.com
>>
>>
>> On Mon, Nov 22, 2010 at 4:55 PM, Jay Adkisson <j4yferd at gmail.com> wrote:
>>
>>> Hey all,
>>>
>>> Here's what I'm seeing: I have four nodes A, B, C, and D.  I'm loading
>>> lots of data into node A, which is being distributed evenly across the
>>> nodes.  If I physically reboot node D, all my HTTP calls time out, and
>>> `riak-admin ringready` complains that not all nodes are up.  Is this
>>> intended behavior?  Is there a configuration option I can set so it fails
>>> more gracefully?
>>>
>>> --Jay
>>>
>>> _______________________________________________
>>> riak-users mailing list
>>> riak-users at lists.basho.com
>>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.basho.com/pipermail/riak-users_lists.basho.com/attachments/20101123/337d9fd3/attachment.html>


More information about the riak-users mailing list