Solr search response time spikes

sean mcevoy sean.mcevoy at gmail.com
Mon Jul 3 06:01:25 EDT 2017


Hi List, Fred,

After a week of going cross-eyed looking at stats & trying to engineer a
test case to make this happen in the test env I think I've made a
breakthrough.

We have a low but steady level of riak traffic but our application level
actions that result in solr reads are actually fairly infrequent. And when
one of these actions occur it results in multiple parallel reads to our
solr indexes.

What I've observed is that our timeouts are most easily reproduced after a
period of inactivity. And once I see a timeout after 2 seconds I kick off
multiple other reads to random keys and observe that some return instantly
while others can take several seconds, but then all return at the same time.

It's almost as if some shards in the java VM have gone to sleep due to
inactivity and we see a cluster of timeouts when we try to read from it.

I'm setting up a "pinger" script in our prod env to keep these awake and
see if our observed timeout rate reduces.

If this is actually our problem are there any JVM config options we can use
to keep the index active all the time?

//Sean.

On Fri, Jun 23, 2017 at 1:48 PM, sean mcevoy <sean.mcevoy at gmail.com> wrote:

> Hi Fred,
>
> Thanks for taking the time!
> Yes, I noticed that unbalance yesterday when writing, looked into it after
> sending and found our config is corrupt with one node ommitted and another
> in there twice.
> But, with such low traffic levels and the spikes being on the non-favoured
> node I'm not currently ranking that as a likely factor.
>
>
> Another interesting case from last night, this sample was taken at
> 2017-6-23 06:04:09
>
> Riak node 1
> "search_query_throughput_one": 27
> "search_query_latency_max": 10417
>
> Riak node 2
> "search_query_throughput_one": 49
> "search_query_latency_max": 8952
>
> Riak node 3
> "search_query_throughput_one": 18
> "search_query_throughput_count": 2507
> "search_query_latency_min": 1757
> "search_query_latency_median": 14775
> "search_query_latency_mean": 5628361
> "search_query_latency_max": 18298854
> "search_query_latency_999": 18298854
> "search_query_latency_99": 18298854
> "search_query_latency_95": 16539782
>
> Riak node 4
> "search_query_throughput_one": 25
> "search_query_latency_max": 10217
>
>
> Brushing up my maths and focussing on node 3, from the 99 & 95% figures we
> can tell the 2 slowest response times were 18,298 & 16,539ms, 34,837 ms in
> total.
> And from the request count for the minute & the mean we can tell that in
> total these 18 requests spent a total of 101,310 ms being processed.
> From the median & min we know the 9 quickest took between 18 & 265 ms in
> total.
> This leaves in the region of 66 sec for the other 7 requests, enough for
> all 7 to have timed out.
>
>
> Cross referencing with our application logs I can see:
>
> On application node 1 at 2017-06-23 06:03:17 we had 3 search request
> timeouts to index A with 3 different filters, one field of which, lets call
> it field X, had the same value.
> We immediately retried these and at 2017-06-23 06:03:19 2 of those timed
> out again and were retried again.
> They all succeeded on this retry, so this suggests that the same requests
> sent to other riak nodes was fine, but to this riak node at this time was a
> problem.
>
> On application node 2 at:
> 2017-06-23 06:03:27
> 2017-06-23 06:03:29
> 2017-06-23 06:03:31
> 2017-06-23 06:03:33
>
> we had 4 more timeouts on search requests to index A, these requests had 2
> different filters but in both cases field X had the same value as in the
> previous example.
>
>
> So these application logs show 9 riak timeouts, that must correlate with
> the riak stats.
> I can't definitively say that no other search requests went to this riak
> node between 06:03:15 & 06:03:33 but the circumstantial evidence is that it
> had a problem for 18 seconds, which is quiet a big window.
>
>
> The index that all these requests were directed at currently has 490K
> entries with 8 different fields defined in each. The corresponding riak
> bucket has allow_mult = false, if that's relevant.
>
> We see a similar pattern on our test system, I'm going to setup a test to
> repeatedly do searches and see if I can trigger this consistently. Will let
> ye know if anything interesting comes out of it.
>
> I know it's relatively new to the product, do we know is riak solr used
> much in production systems?
> I assume no one else has seen these spikes?
>
> //Sean.
>
>
> On Thu, Jun 22, 2017 at 9:40 PM, Fred Dushin <fred at dushin.net> wrote:
>
>> It's pretty strange that you are seeing no search latency measurements on
>> node 5.  Are you sure your round robining is working?  Are you favoring
>> node 1?
>>
>> In general, I don't think which node you hit for query should make a
>> difference, but I'd have to stare at the code some to be sure.  In essence,
>> all the node that services the query does is convert the query into a
>> sharded Solr query based on a coverage plan, which changes every minute or
>> so, and then runs the sharded query on the local Solr node.  The Solr node
>> then distributes the query to the rest of the nodes in the cluster, but
>> that's all Solr comms -- Riak is out of the picture, by then.
>>
>> Now, if you have a lot of sharded queries accumulating on one node, that
>> might make a difference to Solr.  I am not a Solr expert, and I don't even
>> play one on TV.  But maybe the fact that you are not hitting node 5 is
>> relevant for that reason?
>>
>> Can you do more analysis on your client, to make sure you are not
>> favoring node 1?
>>
>> -Fred
>>
>> > On Jun 22, 2017, at 10:20 AM, sean mcevoy <sean.mcevoy at gmail.com>
>> wrote:
>> >
>> > Hi List,
>> >
>> > We have a standard riak cluster with 5 nodes and at the minute the
>> traffic levels are fairly low. Each of our application nodes has 25 client
>> connections, 5 to each riak node which get selected in a round robin.
>> >
>> > Our application level requests involve multiple riak requests so our
>> traffic tends to make requests in small bursts. Everything works fine for
>> KV gets, puts & deletes but we're seeing timeouts & weird response time
>> spikes on solr search operations.
>> >
>> > In the past 36 hours (the only period I have riak stats for) I see one
>> response time of 38.8 seconds, 3 hours earlier a response time of 20.8
>> seconds, and the third biggest spike is an acceptable 3.5 seconds.
>> >
>> > See below all search_query stats for the minute of the 38 sec sample.
>> In the application request we made 5 riak search requests to the same index
>> in parallel, which happens for each request of this type and normally
>> doesn't have an issue. But in this case all 5 timed out, and one timed out
>> again on retry with the other 4 succeeding.
>> >
>> > Anyone ever seen anything like this before? Is there any known deadlock
>> in solr that I might hit if I make the same request on another connection
>> before the first has completed? This is what we do when our riak client
>> times out after 2 seconds and immediately retries.
>> >
>> > Any advice or pointers welcomed.
>> > Thanks,
>> > //Sean.
>> >
>> >
>> > Riak node 1
>> > search_query_throughput_one: 14
>> > search_query_throughput_count: 259
>> > search_query_latency_min: 2776
>> > search_query_latency_median: 69411
>> > search_query_latency_mean: 4900973
>> > search_query_latency_max: 38887902
>> > search_query_latency_999: 38887902
>> > search_query_latency_99: 38887902
>> > search_query_latency_95: 2046215
>> > search_query_fail_one: 0
>> > search_query_fail_count: 0
>> >
>> > Riak node 2
>> > search_query_throughput_one: 22
>> > search_query_throughput_count: 564
>> > search_query_latency_min: 4006
>> > search_query_latency_median: 8800
>> > search_query_latency_mean: 11834
>> > search_query_latency_max: 25509
>> > search_query_latency_999: 25509
>> > search_query_latency_99: 25509
>> > search_query_latency_95: 24035
>> > search_query_fail_one: 0
>> > search_query_fail_count: 0
>> >
>> > Riak node 3
>> > search_query_throughput_one: 6
>> > search_query_throughput_count: 298
>> > search_query_latency_min: 3200
>> > search_query_latency_median: 15391
>> > search_query_latency_mean: 18062
>> > search_query_latency_max: 31759
>> > search_query_latency_999: 31759
>> > search_query_latency_99: 31759
>> > search_query_latency_95: 31759
>> > search_query_fail_one: 0
>> > search_query_fail_count: 0
>> >
>> > Riak node 4
>> > search_query_throughput_one: 8
>> > search_query_throughput_count: 334
>> > search_query_latency_min: 2404
>> > search_query_latency_median: 7230
>> > search_query_latency_mean: 10211
>> > search_query_latency_max: 22502
>> > search_query_latency_999: 22502
>> > search_query_latency_99: 22502
>> > search_query_latency_95: 22502
>> > search_query_fail_one: 0
>> > search_query_fail_count: 0
>> >
>> > Riak node 5
>> > search_query_throughput_one: 0
>> > search_query_throughput_count: 0
>> > search_query_latency_min: 0
>> > search_query_latency_median: 0
>> > search_query_latency_mean: 0
>> > search_query_latency_max: 0
>> > search_query_latency_999: 0
>> > search_query_latency_99: 0
>> > search_query_latency_95: 0
>> > search_query_fail_one: 0
>> > search_query_fail_count: 0
>> >
>> > _______________________________________________
>> > riak-users mailing list
>> > riak-users at lists.basho.com
>> > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>>
>>
>> _______________________________________________
>> riak-users mailing list
>> riak-users at lists.basho.com
>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.basho.com/pipermail/riak-users_lists.basho.com/attachments/20170703/aff43460/attachment-0002.html>


More information about the riak-users mailing list