Siblings on first write to a key
drohrer at basho.com
Tue Apr 18 08:55:58 EDT 2017
This sounds like an issue our Riak CS team ran into quite a while ago, which involved “slow nodes” and coordination retry. Take a look at https://github.com/basho/riak_kv/issues/1188 and see if it makes sense to you, but it certainly sounds like what’s happening.
The basic flow of the issue comes when one node in the preflist is down, and you write to a node _not in the preflist_, at which point the following happens (better formatted in the issue above, btw):
client node-A node-R node-S
= P, Q and R
Redirect to R ---> [frozen]
| 3 sec timeout
Compute new PL excluding R
= P, Q and S
Redirect to S --------------------> Compute PL without
| any knowlege about R (at this point)
| = P, Q and R
| Redirect to R ---+
| | |
| [what happnes?] <-|-----------------+
| | 3 sec timeout
| Compute new PL excluding R
| = P, Q and S
| I'm coordinator this time
| Execute put
V 3 sec timeout
Compute new PL again
So, it’s possible for a slow/down node (node R in this case) to eventually cause two _other nodes_ to each write a sibling, even on a new key. In fact, depending on the number of nodes in the system and your luck, you could end up writing more than one sibling on a fresh write in this case. Given your comment about a network issue potentially being a factor, and the 3-second timing you noted (the default for the failure timeout), this increases the likelihood that this was, in fact, the issue.
A fix for this issue has been worked on and tested, but is not yet incorporated into a version of Riak for distribution. You can, however, disable the coordinator retry logic as noted in the issue I referenced above, or increase the timeout if your cluster is running slowly in general by setting `riak_kv`, `put_coordinator_failure_timeout` in your `advanced.config` file (see http://docs.basho.com/riak/kv/2.2.3/configuring/reference/#advanced-configuration for the general format of the advanced.config if you’re not familiar).
Hope this helps.
On 4/18/17, 8:28 AM, "riak-users on behalf of Daniel Abrahamsson" <riak-users-bounces at lists.basho.com on behalf of hamsson at gmail.com> wrote:
This cluster has been running in production for a few months. Key
generation is based on flake (https://github.com/boundary/flake); we
have never experienced a collision in the 3+ years we have been using
it heavily in production. However, I will look into that possibility
I just noticed that one of the Riak nodes logged this at the time:
2017-04-13 17:42:40.567 [error]
<0.3624.28>@riak_api_pb_server:handle_info:331 Unrecognized message
(actual value removed).
I also have another example (from the same cluster) where there is a
*single* writer to a key, but after a few writes/updates, it also got
a sibling error. Also at that time, the write+read took significantly
longer than normal. I'll check if we had any "unrecognized messages"
in the Riak logs at that time as well.
To answer your second question, we are talking to the riak cluster
over protocol buffers, using the official Erlang client.
On Tue, Apr 18, 2017 at 1:51 PM, Magnus Kessler <mkessler at basho.com> wrote:
> On 18 April 2017 at 08:20, Daniel Abrahamsson <hamsson at gmail.com> wrote:
>> I've run into a case where I got a sbiling error/response on the first
>> ever write to a key. I would like to understand how this could happen.
>> Normally when you get siblings, it is because you have written a value
>> with an out-of-date vclock. But since this is the first write, there
>> is no vclock. Could someone shed some light on this for me?
>> It is worth to mention that the it took 3 seconds for Riak to deliver
>> the response, so it is possible there was some kind of network issue
>> at the time.
>> Here are some details about my setup:
>> Number of nodes: 8.
>> n_val: 5
>> write options: pw: 3 (quorum), return_body
>> Daniel Abrahamsson
> Hi Daniel,
> Please let me know if all nodes in this cluster were set up completely
> fresh, with empty backend directories, or if any of them had been used
> before for a Riak installation. If the latter is the case, it may be that
> the key in question had already been used once before. Cluster nodes pick up
> data from pre-existing backends.
> How do you access the key for read and write operations?
> Kind Regards,
> Magnus Kessler
> Client Services Engineer
> Basho Technologies Limited
> Registered Office - 8 Lincoln’s Inn Fields London WC2A 3BP Reg 07970431
riak-users mailing list
riak-users at lists.basho.com
More information about the riak-users