Java Riak client can't handle a Riak node failure?

Vanessa Williams vanessa.williams at thoughtwire.ca
Wed Oct 7 19:56:26 EDT 2015


Hi Dmitri, what would be the benefit of r=2, exactly? It isn't necessary to
trigger read-repair, is it? If it's important I'd rather try it sooner than
later...

Regards,
Vanessa



On Wed, Oct 7, 2015 at 4:02 PM, Dmitri Zagidulin <dzagidulin at basho.com>
wrote:

> Glad you sorted it out!
>
> (I do want to encourage you to bump your R setting to at least 2, though.
> Run some tests -- I think you'll find that the difference in speed will not
> be noticeable, but you do get a lot more data resilience with 2.)
>
> On Wed, Oct 7, 2015 at 6:24 PM, Vanessa Williams <
> vanessa.williams at thoughtwire.ca> wrote:
>
>> Hi Dmitri, well...we solved our problem to our satisfaction but it turned
>> out to be something unexpected.
>>
>> The keys were two properties mentioned in a blog post on "configuring
>> Riak’s oft-subtle behavioral characteristics":
>> http://basho.com/posts/technical/riaks-config-behaviors-part-4/
>>
>> notfound_ok= false
>> basic_quorum=true
>>
>> The 2nd one just makes things a little faster, but the first one is the
>> one whose default value of true was killing us.
>>
>> With r=1 and notfound_ok=true (default) the first node to respond, if it
>> didn't find the requested key, the authoritative answer was "this key is
>> not found". Not what we were expecting at all.
>>
>> With the changed settings, it will wait for a quorum of responses and
>> only if *no one* finds the key will "not found" be returned. Perfect.
>> (Without this setting it would wait for all responses, not ideal.)
>>
>> Now there is only one snag, which is that if the Riak node the client
>> connects to goes down, there will be no communication and we have a
>> problem. This is easily solvable with a load-balancer, though for
>> complicated reasons we actually don't need to do that right now. It's just
>> acceptable for us temporarily. Later, we'll get the load-balancer working
>> and even that won't be a problem.
>>
>> I *think* we're ok now. Thanks for your help!
>>
>> Regards,
>> Vanessa
>>
>>
>>
>> On Wed, Oct 7, 2015 at 9:33 AM, Dmitri Zagidulin <dzagidulin at basho.com>
>> wrote:
>>
>>> Yeah, definitely find out what the sysadmin's experience was, with the
>>> load balancer. It could have just been a wrong configuration or something.
>>>
>>> And yes, that's the documentation page I recommend -
>>> http://docs.basho.com/riak/latest/ops/advanced/configs/load-balancing-proxy/
>>> Just set up HAProxy, and point your Java clients to its IP.
>>>
>>> The drawbacks to load-balancing on the java client side (yes, the
>>> cluster object) instead of a standalone load balancer like HAProxy, are the
>>> following:
>>>
>>> 1) Adding node means code changes (or at very least, config file
>>> changes) rolled out to all your clients. Which turns out to be a pretty
>>> serious hassle. Instead, HAProxy allows you to add or remove nodes without
>>> changing any java code or config files.
>>>
>>> 2) Performance. We've ran many tests to compare performance, and
>>> client-side load balancing results in significantly lower throughput than
>>> you'd have using haproxy (or nginx). (Specifically, you actually want to
>>> use the 'leastconn' load balancing algorithm with HAProxy, instead of round
>>> robin).
>>>
>>> 3) The health check on the client side (so that the java load balancer
>>> can tell when a remote node is down) is much less intelligent than a
>>> dedicated load balancer would provide. With something like HAProxy, you
>>> should be able to take down nodes with no ill effects for the client code.
>>>
>>> Now, if you load balance on the client side and you take a node down,
>>> it's not supposed to stop working completely. (I'm not sure why it's
>>> failing for you, we can investigate, but it'll be easier to just use a load
>>> balancer). It should throw an error or two, but then start working again
>>> (on the retry).
>>>
>>> Dmitri
>>>
>>> On Wed, Oct 7, 2015 at 2:45 PM, Vanessa Williams <
>>> vanessa.williams at thoughtwire.ca> wrote:
>>>
>>>> Hi Dmitri, thanks for the quick reply.
>>>>
>>>> It was actually our sysadmin who tried the load balancer approach and
>>>> had no success, late last evening. However I haven't discussed the gory
>>>> details with him yet. The failure he saw was at the application level (i.e.
>>>> failure to read a key), but I don't know a) how he set up the LB or b) what
>>>> the Java exception was, if any. I'll find that out in an hour or two and
>>>> report back.
>>>>
>>>> I did find this article just now:
>>>>
>>>>
>>>> http://docs.basho.com/riak/latest/ops/advanced/configs/load-balancing-proxy/
>>>>
>>>> So I suppose we'll give those suggestions a try this morning.
>>>>
>>>> What is the drawback to having the client connect to all 4 nodes (the
>>>> cluster client, I assume you mean?) My understanding from reading articles
>>>> I've found is that one of the nodes going away causes that client to fail
>>>> as well. Is that what you mean, or are there other drawbacks as well?
>>>>
>>>> If there's anything else you can recommend, or links other than the one
>>>> above you can point me to, it would be much appreciated. We expect both
>>>> node failure and deliberate node removal for upgrade, repair, replacement,
>>>> etc.
>>>>
>>>> Regards,
>>>> Vanessa
>>>>
>>>> On Wed, Oct 7, 2015 at 8:29 AM, Dmitri Zagidulin <dzagidulin at basho.com>
>>>> wrote:
>>>>
>>>>> Hi Vanessa,
>>>>>
>>>>> Riak is definitely meant to run behind a load balancer. (Or, at the
>>>>> worst case, to be load-balanced on the client side. That is, all clients
>>>>> connect to all 4 nodes).
>>>>>
>>>>> When you say "we did try putting all 4 Riak nodes behind a
>>>>> load-balancer and pointing the clients at it, but it didn't help." -- what
>>>>> do you mean exactly, by "it didn't help"? What happened when you tried
>>>>> using the load balancer?
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Oct 7, 2015 at 1:57 PM, Vanessa Williams <
>>>>> vanessa.williams at thoughtwire.ca> wrote:
>>>>>
>>>>>> Hi all, we are still (for a while longer) using Riak 1.4 and the
>>>>>> matching Java client. The client(s) connect to one node in the cluster
>>>>>> (since that's all it can do in this client version). The cluster itself has
>>>>>> 4 nodes (sorry, we can't use 5 in this scenario). There are 2 separate
>>>>>> clients.
>>>>>>
>>>>>> We've tried both n_val = 3 and n_val=4. We achieve
>>>>>> consistency-by-writes by setting w=all. Therefore, we only require one
>>>>>> successful read (r=1).
>>>>>>
>>>>>> When all nodes are up, everything is fine. If one node fails, the
>>>>>> clients can no longer read any keys at all. There's an exception like this:
>>>>>>
>>>>>> com.basho.riak.client.RiakRetryFailedException:
>>>>>> java.net.ConnectException: Connection refused
>>>>>>
>>>>>> Now, it isn't possible that Riak can't operate when one node fails,
>>>>>> so we're clearly missing something here.
>>>>>>
>>>>>> Note: we did try putting all 4 Riak nodes behind a load-balancer and
>>>>>> pointing the clients at it, but it didn't help.
>>>>>>
>>>>>> Riak is a high-availability key-value store, so... why are we failing
>>>>>> to achieve high-availability? Any suggestions greatly appreciated, and if
>>>>>> more info is required I'll do my best to provide it.
>>>>>>
>>>>>> Thanks in advance,
>>>>>> Vanessa
>>>>>>
>>>>>> --
>>>>>> Vanessa Williams
>>>>>> ThoughtWire Corporation
>>>>>> http://www.thoughtwire.com
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> riak-users mailing list
>>>>>> riak-users at lists.basho.com
>>>>>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.basho.com/pipermail/riak-users_lists.basho.com/attachments/20151007/d9ebf3c6/attachment-0002.html>


More information about the riak-users mailing list