Siblings on first write to a key

Daniel Abrahamsson hamsson at gmail.com
Tue Apr 18 10:11:54 EDT 2017


Hi Douglas,

That seems to be a good candidate for an explanation. Thank you very
much for the explanation and link. I'll dig into it.

As promised, I looked into whether we in the second case I mentioned
also had "unrecognized message" in the logs, and we indeed had.




On Tue, Apr 18, 2017 at 2:55 PM, Douglas Rohrer <drohrer at basho.com> wrote:
> This sounds like an issue our Riak CS team ran into quite a while ago, which involved “slow nodes” and coordination retry. Take a look at https://github.com/basho/riak_kv/issues/1188 and see if it makes sense to you, but it certainly sounds like what’s happening.
>
> The basic flow of the issue comes when one node in the preflist is down, and you write to a node _not in the preflist_, at which point the following happens (better formatted in the issue above, btw):
>
> client        node-A              node-R         node-S
>    ---(Put)-->
>              Compute PL
>                = P, Q and R
>              Redirect to R --->  [frozen]
>              |
>              | 3 sec timeout
>              V
>              Compute new PL excluding R
>                = P, Q and S
>              Redirect to S --------------------> Compute PL without
>              |                                     any knowlege about R (at this point)
>              |                                     = P, Q and R
>              |                                   Redirect to R  ---+
>              |                                   |                 |
>              |                 [what happnes?] <-|-----------------+
>              |                                   | 3 sec timeout
>              |                                   V
>              |                                   Compute new PL excluding R
>              |                                     = P, Q and S
>              |                                   I'm coordinator this time
>              |                                   Execute put
>              V 3 sec timeout
>              Compute new PL again
>                [continues]
>
> So, it’s possible for a slow/down node (node R in this case) to eventually cause two _other nodes_ to each write a sibling, even on a new key. In fact, depending on the number of nodes in the system and your luck, you could end up writing more than one sibling on a fresh write in this case. Given your comment about a network issue potentially being a factor, and the 3-second timing you noted (the default for the failure timeout), this increases the likelihood that this was, in fact, the issue.
>
> A fix for this issue has been worked on and tested, but is not yet incorporated into a version of Riak for distribution. You can, however, disable the coordinator retry logic as noted in the issue I referenced above, or increase the timeout if your cluster is running slowly in general by setting `riak_kv`, `put_coordinator_failure_timeout` in your `advanced.config` file (see http://docs.basho.com/riak/kv/2.2.3/configuring/reference/#advanced-configuration for the general format of the advanced.config if you’re not familiar).
>
> Hope this helps.
>
> Doug Rohrer
>
>
> On 4/18/17, 8:28 AM, "riak-users on behalf of Daniel Abrahamsson" <riak-users-bounces at lists.basho.com on behalf of hamsson at gmail.com> wrote:
>
>     Hi Magnus,
>
>     This cluster has been running in production for a few months. Key
>     generation is based on flake (https://github.com/boundary/flake); we
>     have never experienced a collision in the 3+ years we have been using
>     it heavily in production. However, I will look into that possibility
>     as well.
>
>     I just noticed that one of the Riak nodes logged this at the time:
>
>     2017-04-13 17:42:40.567 [error]
>     <0.3624.28>@riak_api_pb_server:handle_info:331 Unrecognized message
>     {30320806,{ok,{r_object,<<"session">>,<<".12011742tWzDvu8mk5WAdfYihfV_T3DcnJ5VDyXC0c">>,[{r_content,{dict,3,16,16,8,80,48,{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},{{[],[],[],[],[],[],[],[],[],[],[[<<"X-Riak-VTag">>,53,114,86,115,108,71,120,112,73,55,108,118,114,100,105,114,107,104,50,66,105,119]],[[<<"index">>]],[],[[<<"X-Riak-Last-Modified">>|{1492,105357,453143}]],[],[]}}},<<...
>     (actual value removed).
>
>     I also have another example (from the same cluster) where there is a
>     *single* writer to a key, but after a few writes/updates, it also got
>     a sibling error. Also at that time, the write+read took significantly
>     longer than normal. I'll check if we had any "unrecognized messages"
>     in the Riak logs at that time as well.
>
>     To answer your second question, we are talking to the riak cluster
>     over protocol buffers, using the official Erlang client.
>
>     //Daniel
>
>     On Tue, Apr 18, 2017 at 1:51 PM, Magnus Kessler <mkessler at basho.com> wrote:
>     > On 18 April 2017 at 08:20, Daniel Abrahamsson <hamsson at gmail.com> wrote:
>     >>
>     >> I've run into a case where I got a sbiling error/response on the first
>     >> ever write to a key. I would like to understand how this could happen.
>     >> Normally when you get siblings, it is because you have written a value
>     >> with an out-of-date vclock. But since this is the first write, there
>     >> is no vclock. Could someone shed some light on this for me?
>     >>
>     >> It is worth to mention that the it took 3 seconds for Riak to deliver
>     >> the response, so it is possible there was some kind of network issue
>     >> at the time.
>     >>
>     >> Here are some details about my setup:
>     >> Number of nodes: 8.
>     >> n_val: 5
>     >> write options: pw: 3 (quorum), return_body
>     >>
>     >> Regards,
>     >> Daniel Abrahamsson
>     >>
>     >
>     >
>     > Hi Daniel,
>     >
>     > Please let me know if all nodes in this cluster were set up completely
>     > fresh, with empty backend directories, or if any of them had been used
>     > before for a Riak installation. If the latter is the case, it may be that
>     > the key in question had already been used once before. Cluster nodes pick up
>     > data from pre-existing backends.
>     >
>     > How do you access the key for read and write operations?
>     >
>     > Kind Regards,
>     >
>     > Magnus
>     >
>     >
>     > Magnus Kessler
>     > Client Services Engineer
>     > Basho Technologies Limited
>     >
>     > Registered Office - 8 Lincoln’s Inn Fields London WC2A 3BP Reg 07970431
>
>     _______________________________________________
>     riak-users mailing list
>     riak-users at lists.basho.com
>     http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>
>
>




More information about the riak-users mailing list