Solr search response time spikes

sean mcevoy sean.mcevoy at gmail.com
Fri Jun 23 08:48:10 EDT 2017


Hi Fred,

Thanks for taking the time!
Yes, I noticed that unbalance yesterday when writing, looked into it after
sending and found our config is corrupt with one node ommitted and another
in there twice.
But, with such low traffic levels and the spikes being on the non-favoured
node I'm not currently ranking that as a likely factor.


Another interesting case from last night, this sample was taken at
2017-6-23 06:04:09

Riak node 1
"search_query_throughput_one": 27
"search_query_latency_max": 10417

Riak node 2
"search_query_throughput_one": 49
"search_query_latency_max": 8952

Riak node 3
"search_query_throughput_one": 18
"search_query_throughput_count": 2507
"search_query_latency_min": 1757
"search_query_latency_median": 14775
"search_query_latency_mean": 5628361
"search_query_latency_max": 18298854
"search_query_latency_999": 18298854
"search_query_latency_99": 18298854
"search_query_latency_95": 16539782

Riak node 4
"search_query_throughput_one": 25
"search_query_latency_max": 10217


Brushing up my maths and focussing on node 3, from the 99 & 95% figures we
can tell the 2 slowest response times were 18,298 & 16,539ms, 34,837 ms in
total.
And from the request count for the minute & the mean we can tell that in
total these 18 requests spent a total of 101,310 ms being processed.
>From the median & min we know the 9 quickest took between 18 & 265 ms in
total.
This leaves in the region of 66 sec for the other 7 requests, enough for
all 7 to have timed out.


Cross referencing with our application logs I can see:

On application node 1 at 2017-06-23 06:03:17 we had 3 search request
timeouts to index A with 3 different filters, one field of which, lets call
it field X, had the same value.
We immediately retried these and at 2017-06-23 06:03:19 2 of those timed
out again and were retried again.
They all succeeded on this retry, so this suggests that the same requests
sent to other riak nodes was fine, but to this riak node at this time was a
problem.

On application node 2 at:
2017-06-23 06:03:27
2017-06-23 06:03:29
2017-06-23 06:03:31
2017-06-23 06:03:33

we had 4 more timeouts on search requests to index A, these requests had 2
different filters but in both cases field X had the same value as in the
previous example.


So these application logs show 9 riak timeouts, that must correlate with
the riak stats.
I can't definitively say that no other search requests went to this riak
node between 06:03:15 & 06:03:33 but the circumstantial evidence is that it
had a problem for 18 seconds, which is quiet a big window.


The index that all these requests were directed at currently has 490K
entries with 8 different fields defined in each. The corresponding riak
bucket has allow_mult = false, if that's relevant.

We see a similar pattern on our test system, I'm going to setup a test to
repeatedly do searches and see if I can trigger this consistently. Will let
ye know if anything interesting comes out of it.

I know it's relatively new to the product, do we know is riak solr used
much in production systems?
I assume no one else has seen these spikes?

//Sean.


On Thu, Jun 22, 2017 at 9:40 PM, Fred Dushin <fred at dushin.net> wrote:

> It's pretty strange that you are seeing no search latency measurements on
> node 5.  Are you sure your round robining is working?  Are you favoring
> node 1?
>
> In general, I don't think which node you hit for query should make a
> difference, but I'd have to stare at the code some to be sure.  In essence,
> all the node that services the query does is convert the query into a
> sharded Solr query based on a coverage plan, which changes every minute or
> so, and then runs the sharded query on the local Solr node.  The Solr node
> then distributes the query to the rest of the nodes in the cluster, but
> that's all Solr comms -- Riak is out of the picture, by then.
>
> Now, if you have a lot of sharded queries accumulating on one node, that
> might make a difference to Solr.  I am not a Solr expert, and I don't even
> play one on TV.  But maybe the fact that you are not hitting node 5 is
> relevant for that reason?
>
> Can you do more analysis on your client, to make sure you are not favoring
> node 1?
>
> -Fred
>
> > On Jun 22, 2017, at 10:20 AM, sean mcevoy <sean.mcevoy at gmail.com> wrote:
> >
> > Hi List,
> >
> > We have a standard riak cluster with 5 nodes and at the minute the
> traffic levels are fairly low. Each of our application nodes has 25 client
> connections, 5 to each riak node which get selected in a round robin.
> >
> > Our application level requests involve multiple riak requests so our
> traffic tends to make requests in small bursts. Everything works fine for
> KV gets, puts & deletes but we're seeing timeouts & weird response time
> spikes on solr search operations.
> >
> > In the past 36 hours (the only period I have riak stats for) I see one
> response time of 38.8 seconds, 3 hours earlier a response time of 20.8
> seconds, and the third biggest spike is an acceptable 3.5 seconds.
> >
> > See below all search_query stats for the minute of the 38 sec sample. In
> the application request we made 5 riak search requests to the same index in
> parallel, which happens for each request of this type and normally doesn't
> have an issue. But in this case all 5 timed out, and one timed out again on
> retry with the other 4 succeeding.
> >
> > Anyone ever seen anything like this before? Is there any known deadlock
> in solr that I might hit if I make the same request on another connection
> before the first has completed? This is what we do when our riak client
> times out after 2 seconds and immediately retries.
> >
> > Any advice or pointers welcomed.
> > Thanks,
> > //Sean.
> >
> >
> > Riak node 1
> > search_query_throughput_one: 14
> > search_query_throughput_count: 259
> > search_query_latency_min: 2776
> > search_query_latency_median: 69411
> > search_query_latency_mean: 4900973
> > search_query_latency_max: 38887902
> > search_query_latency_999: 38887902
> > search_query_latency_99: 38887902
> > search_query_latency_95: 2046215
> > search_query_fail_one: 0
> > search_query_fail_count: 0
> >
> > Riak node 2
> > search_query_throughput_one: 22
> > search_query_throughput_count: 564
> > search_query_latency_min: 4006
> > search_query_latency_median: 8800
> > search_query_latency_mean: 11834
> > search_query_latency_max: 25509
> > search_query_latency_999: 25509
> > search_query_latency_99: 25509
> > search_query_latency_95: 24035
> > search_query_fail_one: 0
> > search_query_fail_count: 0
> >
> > Riak node 3
> > search_query_throughput_one: 6
> > search_query_throughput_count: 298
> > search_query_latency_min: 3200
> > search_query_latency_median: 15391
> > search_query_latency_mean: 18062
> > search_query_latency_max: 31759
> > search_query_latency_999: 31759
> > search_query_latency_99: 31759
> > search_query_latency_95: 31759
> > search_query_fail_one: 0
> > search_query_fail_count: 0
> >
> > Riak node 4
> > search_query_throughput_one: 8
> > search_query_throughput_count: 334
> > search_query_latency_min: 2404
> > search_query_latency_median: 7230
> > search_query_latency_mean: 10211
> > search_query_latency_max: 22502
> > search_query_latency_999: 22502
> > search_query_latency_99: 22502
> > search_query_latency_95: 22502
> > search_query_fail_one: 0
> > search_query_fail_count: 0
> >
> > Riak node 5
> > search_query_throughput_one: 0
> > search_query_throughput_count: 0
> > search_query_latency_min: 0
> > search_query_latency_median: 0
> > search_query_latency_mean: 0
> > search_query_latency_max: 0
> > search_query_latency_999: 0
> > search_query_latency_99: 0
> > search_query_latency_95: 0
> > search_query_fail_one: 0
> > search_query_fail_count: 0
> >
> > _______________________________________________
> > riak-users mailing list
> > riak-users at lists.basho.com
> > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>
>
> _______________________________________________
> riak-users mailing list
> riak-users at lists.basho.com
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.basho.com/pipermail/riak-users_lists.basho.com/attachments/20170623/e9393a5f/attachment-0002.html>


More information about the riak-users mailing list