riak TS max concurrent queries + overload error

Cian Synnott cian at emauton.org
Thu Jul 28 06:15:14 EDT 2016

On Thu, Jul 28, 2016 at 6:10 AM,  <Chris.Johnson at vaisala.com> wrote:
> Thank you! I should've mentioned in my initial email that I thought we were experiencing the same bug you called out (in fact the 2nd comment on that github issue is actually from me).
Aha, cool. :o)

> So, what I'm really curious about is whether or not the original "overload" error is happening because we're hitting the limit on TS max concurrent queries or if riak is actually "overloaded" and we shouldn't increase the configuration value for max concurrent queries.
I looked into this when examining the bug, and it *is* stimulated by
hitting the max concurrent queries, which as you've noted is set
nervous-alpha-software low by default. Plain `overload` is a little
unhelpful in that it is used deeper within Riak KV too, but in this
case I'm confident you're hitting the one in Riak TS's query path.

> I'd like to know whether or not I should expect a certain value for max concurrent queries to be stable and performant for some given hardware specs. This is an experiment that we will probably run in house to determine a good value, but it would be great to know what range is expected to perform well.
I don't think there is a range expected to perform well, yet. The PBC
server just dying on overload suggests it hasn't really been
loadtested much at Basho, so sharing whatever you come up with on the
list would be good. :o)

> Also, I have no idea if the max concurrent queries setting includes subqueries over multiple quanta. For instance, if I have 4 TS queries hitting a riak node configured for 12 max queries and each query spans 3 - 4 quanta, should i expect an "overload" error?
No, max concurrent queries does not include this.

Digging around in the code, the max subqueries configuration is used
in the query compiler, and the error message in that case is

which I'm not sure is plumbed properly back through the PBC server's
error responses, and the code is a little more twisty than I have time
to check right now.

If I understand the code correctly, overload due to max concurrent
queries is hit when there are more than 3 queries waiting to be served
by the query FSMs, which are started around here:

So, `timeseries_max_concurrent_queries` gives us the number of query
FSMs per node. There's a short, static overflow queue of queries, and
if the FSMs can't keep up, you get the overload message.

I don't know why the default number of query FSMs running per node is
so low. Perhaps early customers were using it purely interactively, at
a command prompt? In any case, try setting it lots higher and see how
you get on.


More information about the riak-users mailing list