Java Riak client can't handle a Riak node failure?

Vanessa Williams vanessa.williams at thoughtwire.ca
Mon Feb 22 10:48:04 EST 2016


See inline:

On Mon, Feb 22, 2016 at 10:31 AM, Alex Moore <amoore at basho.com> wrote:

> Hi Vanessa,
>
> You might have a problem with your delete function (depending on it's
> return value).
> What does the return value of the delete() function indicate?  Right now
> if an object existed, and was deleted, the function will return true, and
> will only return false if the object didn't exist or only consisted of
> tombstones.
>


That's the correct behaviour: it should return true iff a value was
actually deleted.


> If you never look at the object value returned by your fetchValue(key) function, another potential optimization you could make is to only return the HEAD / metadata:
>
> FetchValue fv = new FetchValue.Builder(new Location(new Namespace(
> "some_bucket"), key))
>
>                               .withOption(FetchValue.Option.HEAD, true)
>                               .build();
>
> This would be more efficient because Riak won't have to send you the
> values over the wire, if you only need the metadata.
>
>
Thanks, I'll clean that up.


> If you do write this up somewhere, share the link! :)
>

Will do!

Regards,
Vanessa


>
> Thanks,
> Alex
>
>
> On Mon, Feb 22, 2016 at 6:23 AM, Vanessa Williams <
> vanessa.williams at thoughtwire.ca> wrote:
>
>> Hi Dmitri, this thread is old, but I read this part of your answer
>> carefully:
>>
>> You can use the following strategies to prevent stale values, in
>>> increasing order of security/preference:
>>> 1) Use timestamps (and not pass in vector clocks/causal context). This
>>> is ok if you're not editing objects, or you're ok with a bit of risk of
>>> stale values.
>>> 2) Use causal context correctly (which means, read-before-you-write --
>>> in fact, the Update operation in the java client does this for you, I
>>> think). And if Riak can't determine which version is correct, it will fall
>>> back on timestamps.
>>> 3) Turn on siblings, for that bucket or bucket type.  That way, Riak
>>> will still try to use causal context to decide the right value. But if it
>>> can't decide, it will store BOTH values, and give them back to you on the
>>> next read, so that your application can decide which is the correct one.
>>
>>
>> I decided on strategy #2. What I am hoping for is some validation that
>> the code we use to "get", "put", and "delete" is correct in that context,
>> or if it could be simplified in some cases. Not we are using delete-mode
>> "immediate" and no duplicates.
>>
>> In their shortest possible forms, here are the three methods I'd like
>> some feedback on (note, they're being used in production and haven't caused
>> any problems yet, however we have very few writes in production so the lack
>> of problems doesn't support the conclusion that the implementation is
>> correct.) Note all argument-checking, exception-handling, and logging
>> removed for clarity. *I'm mostly concerned about correct use of causal
>> context and response.isNotFound and response.hasValues. *Is there
>> anything I could/should have left out?
>>
>>     public RiakClient(String name, com.basho.riak.client.api.RiakClient
>> client)
>>     {
>>         this.name = name;
>>         this.namespace = new Namespace(name);
>>         this.client = client;
>>     }
>>
>>     public byte[] get(String key) throws ExecutionException,
>> InterruptedException {
>>
>>         FetchValue.Response response = fetchValue(key);
>>         if (!response.isNotFound())
>>         {
>>             RiakObject riakObject = response.getValue(RiakObject.class);
>>             return riakObject.getValue().getValue();
>>         }
>>         return null;
>>     }
>>
>>     public void put(String key, byte[] value) throws ExecutionException,
>> InterruptedException {
>>
>>         // fetch in order to get the causal context
>>         FetchValue.Response response = fetchValue(key);
>>         RiakObject storeObject = new
>>
>> RiakObject().setValue(BinaryValue.create(value)).setContentType("binary/octet-stream");
>>         StoreValue.Builder builder =
>>             new StoreValue.Builder(storeObject).withLocation(new
>> Location(namespace, key));
>>         if (response.getVectorClock() != null) {
>>             builder = builder.withVectorClock(response.getVectorClock());
>>         }
>>         StoreValue storeValue = builder.build();
>>         client.execute(storeValue);
>>     }
>>
>>     public boolean delete(String key) throws ExecutionException,
>> InterruptedException {
>>
>>         // fetch in order to get the causal context
>>         FetchValue.Response response = fetchValue(key);
>>         if (!response.isNotFound())
>>         {
>>             DeleteValue deleteValue = new DeleteValue.Builder(new
>> Location(namespace, key)).build();
>>             client.execute(deleteValue);
>>         }
>>         return !response.isNotFound() || !response.hasValues();
>>     }
>>
>>
>> Any comments much appreciated. I want to provide a minimally correct
>> example of simple client code somewhere (GitHub, blog post, something...)
>> so I don't want to post this without review.
>>
>> Thanks,
>> Vanessa
>>
>> ThoughtWire Corporation
>> http://www.thoughtwire.com
>>
>>
>>
>>
>> On Thu, Oct 8, 2015 at 8:45 AM, Dmitri Zagidulin <dzagidulin at basho.com>
>> wrote:
>>
>>> Hi Vanessa,
>>>
>>> The thing to keep in mind about read repair is -- it happens
>>> asynchronously on every GET, but /after/ the results are returned to the
>>> client.
>>>
>>> So, when you issue a GET with r=1, the coordinating node only waits for
>>> 1 of the replicas before responding to the client with a success, and only
>>> afterwards triggers read-repair.
>>>
>>> It's true that with notfound_ok=false, it'll wait for the first
>>> non-missing replica before responding. But if you edit or update your
>>> objects at all, an R=1 still gives you a risk of stale values being
>>> returned.
>>>
>>> For example, say you write an object with value A.  And let's say your 3
>>> replicas now look like this:
>>>
>>> replica 1: A,  replica 2: A, replica 3: notfound/missing
>>>
>>> A read with an R=1 and notfound_ok=false is just fine, here. (Chances
>>> are, the notfound replica will arrive first, but the notfound_ok setting
>>> will force the coordinator to wait for the first non-empty value, A, and
>>> return it to the client. And then trigger read-repair).
>>>
>>> But what happens if you edit that same object, and give it a new value,
>>> B?  So, now, there's a chance that your replicas will look like this:
>>>
>>> replica 1: A, replica 2: B, replica 3: B.
>>>
>>> So now if you do a read with an R=1, there's a chance that replica 1,
>>> with the old value of A, will arrive first, and that's the response that
>>> will be returned to the client.
>>>
>>> Whereas, using R=2 eliminates that risk -- well, at least decreases it.
>>> You still have the issue of -- how does Riak decide whether A or B is the
>>> correct value? Are you using causal context/vclocks correctly? (That is,
>>> reading the object before you update, to get the correct causal context?)
>>> Or are you relying on timestamps? (This is an ok strategy, provided that
>>> the edits are sufficiently far apart in time, and you don't have many
>>> concurrent edits, AND you're ok with the small risk of occasionally the
>>> timestamp being wrong). You can use the following strategies to prevent
>>> stale values, in increasing order of security/preference:
>>>
>>> 1) Use timestamps (and not pass in vector clocks/causal context). This
>>> is ok if you're not editing objects, or you're ok with a bit of risk of
>>> stale values.
>>>
>>> 2) Use causal context correctly (which means, read-before-you-write --
>>> in fact, the Update operation in the java client does this for you, I
>>> think). And if Riak can't determine which version is correct, it will fall
>>> back on timestamps.
>>>
>>> 3) Turn on siblings, for that bucket or bucket type.  That way, Riak
>>> will still try to use causal context to decide the right value. But if it
>>> can't decide, it will store BOTH values, and give them back to you on the
>>> next read, so that your application can decide which is the correct one.
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Thu, Oct 8, 2015 at 1:56 AM, Vanessa Williams <
>>> vanessa.williams at thoughtwire.ca> wrote:
>>>
>>>> Hi Dmitri, what would be the benefit of r=2, exactly? It isn't
>>>> necessary to trigger read-repair, is it? If it's important I'd rather try
>>>> it sooner than later...
>>>>
>>>> Regards,
>>>> Vanessa
>>>>
>>>>
>>>>
>>>> On Wed, Oct 7, 2015 at 4:02 PM, Dmitri Zagidulin <dzagidulin at basho.com>
>>>> wrote:
>>>>
>>>>> Glad you sorted it out!
>>>>>
>>>>> (I do want to encourage you to bump your R setting to at least 2,
>>>>> though. Run some tests -- I think you'll find that the difference in speed
>>>>> will not be noticeable, but you do get a lot more data resilience with 2.)
>>>>>
>>>>> On Wed, Oct 7, 2015 at 6:24 PM, Vanessa Williams <
>>>>> vanessa.williams at thoughtwire.ca> wrote:
>>>>>
>>>>>> Hi Dmitri, well...we solved our problem to our satisfaction but it
>>>>>> turned out to be something unexpected.
>>>>>>
>>>>>> The keys were two properties mentioned in a blog post on "configuring
>>>>>> Riak’s oft-subtle behavioral characteristics":
>>>>>> http://basho.com/posts/technical/riaks-config-behaviors-part-4/
>>>>>>
>>>>>> notfound_ok= false
>>>>>> basic_quorum=true
>>>>>>
>>>>>> The 2nd one just makes things a little faster, but the first one is
>>>>>> the one whose default value of true was killing us.
>>>>>>
>>>>>> With r=1 and notfound_ok=true (default) the first node to respond, if
>>>>>> it didn't find the requested key, the authoritative answer was "this key is
>>>>>> not found". Not what we were expecting at all.
>>>>>>
>>>>>> With the changed settings, it will wait for a quorum of responses and
>>>>>> only if *no one* finds the key will "not found" be returned. Perfect.
>>>>>> (Without this setting it would wait for all responses, not ideal.)
>>>>>>
>>>>>> Now there is only one snag, which is that if the Riak node the client
>>>>>> connects to goes down, there will be no communication and we have a
>>>>>> problem. This is easily solvable with a load-balancer, though for
>>>>>> complicated reasons we actually don't need to do that right now. It's just
>>>>>> acceptable for us temporarily. Later, we'll get the load-balancer working
>>>>>> and even that won't be a problem.
>>>>>>
>>>>>> I *think* we're ok now. Thanks for your help!
>>>>>>
>>>>>> Regards,
>>>>>> Vanessa
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Wed, Oct 7, 2015 at 9:33 AM, Dmitri Zagidulin <
>>>>>> dzagidulin at basho.com> wrote:
>>>>>>
>>>>>>> Yeah, definitely find out what the sysadmin's experience was, with
>>>>>>> the load balancer. It could have just been a wrong configuration or
>>>>>>> something.
>>>>>>>
>>>>>>> And yes, that's the documentation page I recommend -
>>>>>>> http://docs.basho.com/riak/latest/ops/advanced/configs/load-balancing-proxy/
>>>>>>> Just set up HAProxy, and point your Java clients to its IP.
>>>>>>>
>>>>>>> The drawbacks to load-balancing on the java client side (yes, the
>>>>>>> cluster object) instead of a standalone load balancer like HAProxy, are the
>>>>>>> following:
>>>>>>>
>>>>>>> 1) Adding node means code changes (or at very least, config file
>>>>>>> changes) rolled out to all your clients. Which turns out to be a pretty
>>>>>>> serious hassle. Instead, HAProxy allows you to add or remove nodes without
>>>>>>> changing any java code or config files.
>>>>>>>
>>>>>>> 2) Performance. We've ran many tests to compare performance, and
>>>>>>> client-side load balancing results in significantly lower throughput than
>>>>>>> you'd have using haproxy (or nginx). (Specifically, you actually want to
>>>>>>> use the 'leastconn' load balancing algorithm with HAProxy, instead of round
>>>>>>> robin).
>>>>>>>
>>>>>>> 3) The health check on the client side (so that the java load
>>>>>>> balancer can tell when a remote node is down) is much less intelligent than
>>>>>>> a dedicated load balancer would provide. With something like HAProxy, you
>>>>>>> should be able to take down nodes with no ill effects for the client code.
>>>>>>>
>>>>>>> Now, if you load balance on the client side and you take a node
>>>>>>> down, it's not supposed to stop working completely. (I'm not sure why it's
>>>>>>> failing for you, we can investigate, but it'll be easier to just use a load
>>>>>>> balancer). It should throw an error or two, but then start working again
>>>>>>> (on the retry).
>>>>>>>
>>>>>>> Dmitri
>>>>>>>
>>>>>>> On Wed, Oct 7, 2015 at 2:45 PM, Vanessa Williams <
>>>>>>> vanessa.williams at thoughtwire.ca> wrote:
>>>>>>>
>>>>>>>> Hi Dmitri, thanks for the quick reply.
>>>>>>>>
>>>>>>>> It was actually our sysadmin who tried the load balancer approach
>>>>>>>> and had no success, late last evening. However I haven't discussed the gory
>>>>>>>> details with him yet. The failure he saw was at the application level (i.e.
>>>>>>>> failure to read a key), but I don't know a) how he set up the LB or b) what
>>>>>>>> the Java exception was, if any. I'll find that out in an hour or two and
>>>>>>>> report back.
>>>>>>>>
>>>>>>>> I did find this article just now:
>>>>>>>>
>>>>>>>>
>>>>>>>> http://docs.basho.com/riak/latest/ops/advanced/configs/load-balancing-proxy/
>>>>>>>>
>>>>>>>> So I suppose we'll give those suggestions a try this morning.
>>>>>>>>
>>>>>>>> What is the drawback to having the client connect to all 4 nodes
>>>>>>>> (the cluster client, I assume you mean?) My understanding from reading
>>>>>>>> articles I've found is that one of the nodes going away causes that client
>>>>>>>> to fail as well. Is that what you mean, or are there other drawbacks as
>>>>>>>> well?
>>>>>>>>
>>>>>>>> If there's anything else you can recommend, or links other than the
>>>>>>>> one above you can point me to, it would be much appreciated. We expect both
>>>>>>>> node failure and deliberate node removal for upgrade, repair, replacement,
>>>>>>>> etc.
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Vanessa
>>>>>>>>
>>>>>>>> On Wed, Oct 7, 2015 at 8:29 AM, Dmitri Zagidulin <
>>>>>>>> dzagidulin at basho.com> wrote:
>>>>>>>>
>>>>>>>>> Hi Vanessa,
>>>>>>>>>
>>>>>>>>> Riak is definitely meant to run behind a load balancer. (Or, at
>>>>>>>>> the worst case, to be load-balanced on the client side. That is, all
>>>>>>>>> clients connect to all 4 nodes).
>>>>>>>>>
>>>>>>>>> When you say "we did try putting all 4 Riak nodes behind a
>>>>>>>>> load-balancer and pointing the clients at it, but it didn't help." -- what
>>>>>>>>> do you mean exactly, by "it didn't help"? What happened when you tried
>>>>>>>>> using the load balancer?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Wed, Oct 7, 2015 at 1:57 PM, Vanessa Williams <
>>>>>>>>> vanessa.williams at thoughtwire.ca> wrote:
>>>>>>>>>
>>>>>>>>>> Hi all, we are still (for a while longer) using Riak 1.4 and the
>>>>>>>>>> matching Java client. The client(s) connect to one node in the cluster
>>>>>>>>>> (since that's all it can do in this client version). The cluster itself has
>>>>>>>>>> 4 nodes (sorry, we can't use 5 in this scenario). There are 2 separate
>>>>>>>>>> clients.
>>>>>>>>>>
>>>>>>>>>> We've tried both n_val = 3 and n_val=4. We achieve
>>>>>>>>>> consistency-by-writes by setting w=all. Therefore, we only require one
>>>>>>>>>> successful read (r=1).
>>>>>>>>>>
>>>>>>>>>> When all nodes are up, everything is fine. If one node fails, the
>>>>>>>>>> clients can no longer read any keys at all. There's an exception like this:
>>>>>>>>>>
>>>>>>>>>> com.basho.riak.client.RiakRetryFailedException:
>>>>>>>>>> java.net.ConnectException: Connection refused
>>>>>>>>>>
>>>>>>>>>> Now, it isn't possible that Riak can't operate when one node
>>>>>>>>>> fails, so we're clearly missing something here.
>>>>>>>>>>
>>>>>>>>>> Note: we did try putting all 4 Riak nodes behind a load-balancer
>>>>>>>>>> and pointing the clients at it, but it didn't help.
>>>>>>>>>>
>>>>>>>>>> Riak is a high-availability key-value store, so... why are we
>>>>>>>>>> failing to achieve high-availability? Any suggestions greatly appreciated,
>>>>>>>>>> and if more info is required I'll do my best to provide it.
>>>>>>>>>>
>>>>>>>>>> Thanks in advance,
>>>>>>>>>> Vanessa
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Vanessa Williams
>>>>>>>>>> ThoughtWire Corporation
>>>>>>>>>> http://www.thoughtwire.com
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> riak-users mailing list
>>>>>>>>>> riak-users at lists.basho.com
>>>>>>>>>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>> _______________________________________________
>> riak-users mailing list
>> riak-users at lists.basho.com
>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.basho.com/pipermail/riak-users_lists.basho.com/attachments/20160222/c925195b/attachment-0002.html>


More information about the riak-users mailing list