Riak CS race condition at start-up (was: Riak-CS issues when Riak endpoint fails-over to new server)

Toby Corkindale toby at dryft.net
Thu Jan 19 20:38:08 EST 2017


Hi guys,
I've switched our configuration around, so that Riak CS now talks to
127.0.0.1:8087 instead of the haproxy version.

We have immediately re-encountered the problems that caused us to move to
haproxy.
On start-up, riak takes slightly longer than riak-cs to get ready, and so
riak-cs logs the following then exits.
Restarting riak-cs again (so now 15 seconds after Riak started) results in
a successful start-up, but obviously this is really annoying for our ops
guys to have to remember to do after restarting riak or rebooting a machine.

How do other people avoid this issue in production?

```
2017-01-20 12:23:12.937 [warning]
<0.150.0>@riak_cs_app:check_bucket_props:187 Unable to verify moss.users
bucket settings (disconnected).
2017-01-20 12:23:12.937 [warning]
<0.150.0>@riak_cs_app:check_bucket_props:187 Unable to verify moss.access
bucket settings (disconnected).
2017-01-20 12:23:12.937 [warning]
<0.150.0>@riak_cs_app:check_bucket_props:187 Unable to verify moss.storage
bucket settings (disconnected).
2017-01-20 12:23:12.937 [warning]
<0.150.0>@riak_cs_app:check_bucket_props:187 Unable to verify moss.buckets
bucket settings (disconnected).
2017-01-20 12:23:12.937 [error] <0.150.0>@riak_cs_app:sanity_check:125
Could not verify bucket properties. Error was disconnected.
2017-01-20 12:23:12.938 [error] <0.149.0> CRASH REPORT Process <0.149.0>
with 0 neighbours exited with reason:
{error_verifying_props,{riak_cs_app,start,[normal,[]]}} in
application_master:init/4 line 133
2017-01-20 12:23:12.938 [info] <0.7.0> Application riak_cs exited with
reason: {error_verifying_props,{riak_cs_app,start,[normal,[]]}}
```

On Fri, 6 Jan 2017 at 12:33 Toby Corkindale <toby at dryft.net> wrote:

> Hi Shaun,
> We've been running Riak CS since its early days, so it's possible best
> practice has changed.. but I'm sure at some point it was suggested to put a
> haproxy between CS and KV to guard against the issues of start-up race
> condition, individual KV losses, and brief KV restarts.
> I'm sure we used to have continual issues with CS being dead on nodes
> before we moved to the haproxy solution. That was probably on Debian
> Squeeze though, and these days we're on Ubuntu LTS, and so if CS is
> launched from Upstart at least it can retry to start, whereas on old-school
> init systems it just gets one attempt and then dies.
>
> Moving on though--
>
> Even if it's not the majority practice, shouldn't CS still be able to
> withstand dropping and reconnecting it's protocol buffer TCP connections?
>
> CS still has a problem with not handling the case when its long-standing
> idle PBC connections to KV get reset. Regardless of whether that's because
> the local KV process is restarted, or because we've failed over to a new
> haproxy.
> The errors get pushed back to the S3 client software, but even if they
> retry, they get repeated errors because, I think, CS has such a large pool
> of PBC connections. You have to work through a large portion of this pool
> before you finally get to one that's reconnected and is good.
>
> In our case, the pool size is multiplied by the number of CS instances, so
> quite a large number.
> Most client software has retry limits built in, at much lower values.
>
> While it will come good eventually, there's a significant period of time
> where everything fails, all our monitoring goes red, etc. which we'd like
> to avoid!
>
> I'm surprised this problem doesn't come up more for other users; I don't
> feel like we're running at a large scale.. but maybe we're using a more
> dynamic architecture than major users.
>
> Toby
>
> On Thu, 5 Jan 2017 at 20:04 Shaun McVey <smcvey at basho.com> wrote:
>
> Hi Toby,
>
> > I thought that it was recommended AGAINST talking to a co-located Riak
> on the same host?
>
> I'm not sure where you heard that from, but that's not the case.  We do
> discourage running other parts of an application on the same hosts, such as
> client front-ends for example.  From the top of my head (which means
> there's probably an exception to the rule), all our customers have nodes
> set up in the way Magnus described: one CS instance talking locally to its
> KV instance directly on the same node.  The load balancing comes between
> the CS node and the client.
>
> Riak shouldn't take a particularly long time to start at all.  We have
> customers that have terabytes of data per node and a KV node can be
> restarted in just a minute or two.  As long as you have valid bitcask hint
> files in place (which requires a proper shutdown beforehand), then a node
> should come up quickly.  If you have nodes that you feel are taking a
> particularly long time to start up, that may be a symptom of another issue
> unrelated to this discussion.
>
> If, for any reason, you need to shut down KV, you would then just remove
> the CS node from the HAproxy configuration so it doesn't deal with any
> requests.  The other CS nodes then take the additional load.  There
> shouldn't be a need to restart CS if you remove it from the load balancer.
> Having said that, you shouldn't have to worry about restarting CS as far as
> I'm aware.  You might see failures if KV is down, but once it's up and
> running again, CS will continue to deal with new requests without
> problems.  Any failures to connect to its KV node should be passed to the
> client/front-end, which should have all the proper logic for re-attempts or
> error reporting.
>
> > I'm surprised more people with highly-available Riak CS installations
> haven't hit the same issues.
>
> As I mentioned, our customers go with the setup Magnus described.  I can't
> speak for setups like yours as I've not seen them in the wild.
>
> Kind Regards,
> Shaun
>
> On Wed, Jan 4, 2017 at 11:26 PM, Toby Corkindale <toby at dryft.net> wrote:
>
> Hi Magnus,
> I thought that it was recommended AGAINST talking to a co-located Riak on
> the same host?
> The reason being, the local Riak will take longer to start up than Riak
> CS, once you have a sizeable amount of data. This means Riak CS starts up,
> fails to connect to Riak, and exits.
> You also end up in a situation where you must always restart Riak CS if
> you restart the co-located Riak. (Otherwise the Riak CS PBC connections
> suffer the same problem as I described in my earlier email, where Riak CS
> doesn't realise it needs to reconnect them and returns errors).
>
> Putting haproxy between Riak CS and Riak solved the problem of needing the
> local Riak to be started first.
> But it seems we just were putting the core problem off, rather than
> solving it. ie. That Riak CS doesn't understand it needs to re-connect and
> retry.
>
> I'm surprised more people with highly-available Riak CS installations
> haven't hit the same issues.
>
> Toby
>
> On Wed, 4 Jan 2017 at 21:42 Magnus Kessler <mkessler at basho.com> wrote:
>
> Hi Toby,
>
> As far as I know Riak CS has none of the more advanced retry capabilities
> that Riak KV has. However, in the design of CS there seems to be an
> assumption that a CS instance will talk to a co-located KV node on the same
> host. To achieve high availability, in CS deployments HAProxy is often
> deployed in front of the CS nodes. Could you please let me know if this is
> an option for your setup?
>
> Kind Regards,
>
> Magnus
>
>
> On 4 January 2017 at 01:04, Toby Corkindale <toby at dryft.net> wrote:
>
> Hello all,
> Now that we're all back from the end-of-year holidays, I'd like to bump
> this question up.
> I feel like this has been a long-standing problem with Riak CS not
> handling dropped TCP connections.
> Last time the cause was haproxy dropping idle TCP connections after too
> long, but we solved that at the haproxy end.
>
> This time, it's harder -- we're failing over to a different Riak backend,
> so the TCP connections between Riak CS and Riak PBC *have* to go down, but
> Riak CS just doesn't handle it well at all.
>
> Is there a trick to configuring it better?
>
> Thanks
> Toby
>
>
> On Thu, 22 Dec 2016 at 16:48 Toby Corkindale <toby at dryft.net> wrote:
>
> Hi,
> We've been seeing some issues with Riak CS for a while in a specific
> situation. Maybe you can advise if we're doing something wrong?
>
> Our setup has redundant haproxy instances in front of a cluster of riak
> nodes, for both HTTP and PBC. The haproxy instances share a floating IP
> address.
> Only one node holds the IP, but if it goes down, another takes it up.
>
> Our Riak CS nodes are configured to talk to the haproxy on that floating
> IP.
>
> The problem occurs if the floating IP moves from one haproxy to another.
>
> Suddenly we see a flurry of errors in riak-cs log files.
>
> This is presumably because it was holding open TCP connections, and the
> new haproxy instance doesn't know anything about them, so they get TCP
> RESET and shutdown.
>
> The problem is that riak-cs doesn't try to reconnect and retry
> immediately, instead it just throws a 503 error back to the client. Who
> then retries, but Riak-CS has a pool of a couple of hundred connections to
> cycle through, all of which throw the error!
>
> Does this sound like it is a likely description of the fault?
> Do you have any ways to mitigate this issue in Riak CS when using TCP load
> balancing above Riak PBC?
>
> Toby
>
>
> _______________________________________________
> riak-users mailing list
> riak-users at lists.basho.com
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>
>
>
>
> --
> Magnus Kessler
> Client Services Engineer
> Basho Technologies Limited
>
> Registered Office - 8 Lincoln’s Inn Fields London WC2A 3BP Reg 07970431
>
>
> _______________________________________________
> riak-users mailing list
> riak-users at lists.basho.com
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.basho.com/pipermail/riak-users_lists.basho.com/attachments/20170120/5e97f415/attachment-0002.html>


More information about the riak-users mailing list