Riak-CS issues when Riak endpoint fails-over to new server

Shaun McVey smcvey at basho.com
Thu Jan 5 04:04:39 EST 2017


Hi Toby,

> I thought that it was recommended AGAINST talking to a co-located Riak on
the same host?

I'm not sure where you heard that from, but that's not the case.  We do
discourage running other parts of an application on the same hosts, such as
client front-ends for example.  From the top of my head (which means
there's probably an exception to the rule), all our customers have nodes
set up in the way Magnus described: one CS instance talking locally to its
KV instance directly on the same node.  The load balancing comes between
the CS node and the client.

Riak shouldn't take a particularly long time to start at all.  We have
customers that have terabytes of data per node and a KV node can be
restarted in just a minute or two.  As long as you have valid bitcask hint
files in place (which requires a proper shutdown beforehand), then a node
should come up quickly.  If you have nodes that you feel are taking a
particularly long time to start up, that may be a symptom of another issue
unrelated to this discussion.

If, for any reason, you need to shut down KV, you would then just remove
the CS node from the HAproxy configuration so it doesn't deal with any
requests.  The other CS nodes then take the additional load.  There
shouldn't be a need to restart CS if you remove it from the load balancer.
Having said that, you shouldn't have to worry about restarting CS as far as
I'm aware.  You might see failures if KV is down, but once it's up and
running again, CS will continue to deal with new requests without
problems.  Any failures to connect to its KV node should be passed to the
client/front-end, which should have all the proper logic for re-attempts or
error reporting.

> I'm surprised more people with highly-available Riak CS installations
haven't hit the same issues.

As I mentioned, our customers go with the setup Magnus described.  I can't
speak for setups like yours as I've not seen them in the wild.

Kind Regards,
Shaun

On Wed, Jan 4, 2017 at 11:26 PM, Toby Corkindale <toby at dryft.net> wrote:

> Hi Magnus,
> I thought that it was recommended AGAINST talking to a co-located Riak on
> the same host?
> The reason being, the local Riak will take longer to start up than Riak
> CS, once you have a sizeable amount of data. This means Riak CS starts up,
> fails to connect to Riak, and exits.
> You also end up in a situation where you must always restart Riak CS if
> you restart the co-located Riak. (Otherwise the Riak CS PBC connections
> suffer the same problem as I described in my earlier email, where Riak CS
> doesn't realise it needs to reconnect them and returns errors).
>
> Putting haproxy between Riak CS and Riak solved the problem of needing the
> local Riak to be started first.
> But it seems we just were putting the core problem off, rather than
> solving it. ie. That Riak CS doesn't understand it needs to re-connect and
> retry.
>
> I'm surprised more people with highly-available Riak CS installations
> haven't hit the same issues.
>
> Toby
>
> On Wed, 4 Jan 2017 at 21:42 Magnus Kessler <mkessler at basho.com> wrote:
>
>> Hi Toby,
>>
>> As far as I know Riak CS has none of the more advanced retry capabilities
>> that Riak KV has. However, in the design of CS there seems to be an
>> assumption that a CS instance will talk to a co-located KV node on the same
>> host. To achieve high availability, in CS deployments HAProxy is often
>> deployed in front of the CS nodes. Could you please let me know if this is
>> an option for your setup?
>>
>> Kind Regards,
>>
>> Magnus
>>
>>
>> On 4 January 2017 at 01:04, Toby Corkindale <toby at dryft.net> wrote:
>>
>> Hello all,
>> Now that we're all back from the end-of-year holidays, I'd like to bump
>> this question up.
>> I feel like this has been a long-standing problem with Riak CS not
>> handling dropped TCP connections.
>> Last time the cause was haproxy dropping idle TCP connections after too
>> long, but we solved that at the haproxy end.
>>
>> This time, it's harder -- we're failing over to a different Riak backend,
>> so the TCP connections between Riak CS and Riak PBC *have* to go down, but
>> Riak CS just doesn't handle it well at all.
>>
>> Is there a trick to configuring it better?
>>
>> Thanks
>> Toby
>>
>>
>> On Thu, 22 Dec 2016 at 16:48 Toby Corkindale <toby at dryft.net> wrote:
>>
>> Hi,
>> We've been seeing some issues with Riak CS for a while in a specific
>> situation. Maybe you can advise if we're doing something wrong?
>>
>> Our setup has redundant haproxy instances in front of a cluster of riak
>> nodes, for both HTTP and PBC. The haproxy instances share a floating IP
>> address.
>> Only one node holds the IP, but if it goes down, another takes it up.
>>
>> Our Riak CS nodes are configured to talk to the haproxy on that floating
>> IP.
>>
>> The problem occurs if the floating IP moves from one haproxy to another.
>>
>> Suddenly we see a flurry of errors in riak-cs log files.
>>
>> This is presumably because it was holding open TCP connections, and the
>> new haproxy instance doesn't know anything about them, so they get TCP
>> RESET and shutdown.
>>
>> The problem is that riak-cs doesn't try to reconnect and retry
>> immediately, instead it just throws a 503 error back to the client. Who
>> then retries, but Riak-CS has a pool of a couple of hundred connections to
>> cycle through, all of which throw the error!
>>
>> Does this sound like it is a likely description of the fault?
>> Do you have any ways to mitigate this issue in Riak CS when using TCP
>> load balancing above Riak PBC?
>>
>> Toby
>>
>>
>> _______________________________________________
>> riak-users mailing list
>> riak-users at lists.basho.com
>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>>
>>
>>
>>
>> --
>> Magnus Kessler
>> Client Services Engineer
>> Basho Technologies Limited
>>
>> Registered Office - 8 Lincoln’s Inn Fields London WC2A 3BP Reg 07970431
>>
>
> _______________________________________________
> riak-users mailing list
> riak-users at lists.basho.com
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.basho.com/pipermail/riak-users_lists.basho.com/attachments/20170105/b9f6e242/attachment-0002.html>


More information about the riak-users mailing list