Whole cluster times out if one node is gone

Neville Burnell neville.burnell at gmail.com
Tue Nov 23 17:55:36 EST 2010


Just a thought ... have you verified your switch, cables, nics, etc

On 24 November 2010 09:33, Jay Adkisson <j4yferd at gmail.com> wrote:

> (many profuse apologies to Dan - hit "reply" instead of "reply all")
>
> Alrighty, I've done a little more digging.  When I throttle the writes
> heavily (2/sec) and set R and W to 1 all around, the cluster works just fine
> after I restart the node for about 15-20 seconds.  Then the read request
> hangs for about a minute, until node D disappears from connected_nodes in
> riak-admin status, at which point it returns the desired value (although
> sometimes I get a 503):
>
> --2010-11-23 13:*01:28*--  http://<node A>:8098/riak/<bucket>/<key>?r=1
> Resolving <node A>... <ip addr>
> Connecting to <node A>|<ip addr>|:8098... connected.
> HTTP request sent, awaiting response... *<hang...> *200 OK
> Length: 3684 (3.6K) [image/jpeg]
> Saving to: `<key>?r=1'
>
> 100%[======================================>] 3,684       --.-K/s   in 0s
>
> 2010-11-23 13:*02:21* (49.5 MB/s) - `<key>?r=1' saved [3684/3684]
>
> --2010-11-23 13:02:23--  http://<node A>:8098/riak/<bucket>/<key>?r=1
> Resolving <node A>... <ip addr>
> Connecting to <node A>|<ip addr>|:8098... connected.
> HTTP request sent, awaiting response... 200 OK
> Length: 3684 (3.6K) [image/jpeg]
> Saving to: `<key>?r=1'
>
> 100%[======================================>] 3,684       --.-K/s   in 0s
>
> 2010-11-23 13:02:23 (220 MB/s) - `<key>?r=1' saved [3684/3684]
>
> Afterwards, node D comes back up and re-joins the cluster seamlessly.
>
> Any insights?
>
> --Jay
>
> On Mon, Nov 22, 2010 at 5:59 PM, Jay Adkisson <j4yferd at gmail.com> wrote:
>
>> Hey Dan,
>>
>> Thanks for the response!  I tried it again while watching `riak-admin
>> status` - basically, it takes about 30 seconds of node C being down before
>> riak realizes it's gone.  During that time, if I'm writing to the cluster at
>> all (I throttled it to 2 writes per second for testing), both writes and
>> reads hang indefinitely, and sometimes time out.
>>
>> I'm using Ripple to do the writes, and wget to test reads, all on node A
>> for now, since I know it'll be up.  I'm using the default R and W options
>> for now.
>>
>> Thanks for the help and clarification around ringready.
>>
>> --Jay
>>
>>
>> On Mon, Nov 22, 2010 at 5:15 PM, Dan Reverri <dan at basho.com> wrote:
>>
>>> Your HTTP calls should not being timing out. Are you sending requests
>>> directly to the Riak node or are you using a load balancer? How much load
>>> are you placing on node A? Is it a write only load or are there reads as
>>> well? Can you confirm "all" requests time out or is it a large subset of the
>>> requests? How large are the objects being written? Are you setting R and W
>>> in the request? Are you using a particular client (Ruby, Python, etc.)? Can
>>> you provide the output of "riak-admin status" from node A?
>>>
>>> Regarding the ringready command; that is behaving as I would expect
>>> considering a node is down.
>>>
>>> Thanks,
>>> Dan
>>>
>>> Daniel Reverri
>>> Developer Advocate
>>> Basho Technologies, Inc.
>>> dan at basho.com
>>>
>>>
>>> On Mon, Nov 22, 2010 at 4:55 PM, Jay Adkisson <j4yferd at gmail.com> wrote:
>>>
>>>> Hey all,
>>>>
>>>> Here's what I'm seeing: I have four nodes A, B, C, and D.  I'm loading
>>>> lots of data into node A, which is being distributed evenly across the
>>>> nodes.  If I physically reboot node D, all my HTTP calls time out, and
>>>> `riak-admin ringready` complains that not all nodes are up.  Is this
>>>> intended behavior?  Is there a configuration option I can set so it fails
>>>> more gracefully?
>>>>
>>>> --Jay
>>>>
>>>> _______________________________________________
>>>> riak-users mailing list
>>>> riak-users at lists.basho.com
>>>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>>>>
>>>>
>>>
>>
>
> _______________________________________________
> riak-users mailing list
> riak-users at lists.basho.com
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.basho.com/pipermail/riak-users_lists.basho.com/attachments/20101124/b8816575/attachment.html>


More information about the riak-users mailing list