Yokozuna indexes too slow

Fred Dushin fdushin at basho.com
Fri Oct 2 10:19:25 EDT 2015


Hi Ilyas,

So, from those stats it's looking like it's taking around 3 seconds on average to write a message into Solr (search_index_latency_mean, in microseconds), though the max value is almost 27 seconds.  So there is definitely something wrong with the way Solr is behaving, or possibly with the way you are indexing you data, though from your description everything looks fine -- 200 bytes of JSON, which I assume contains fields like username, device id, ip address etc, looking at your schema.  You should definitely be able to get better performance out of Solr, though those stats are being taken in Yokozuna, so I wouldn't rule out what might be going on in Riak, at least not without knowing more.

One thing you might start looking at are the JVM stats via JConsole, for example, to see if there is anything suspicious going on with the JVM.  You should at least be able to get things like stats from the garbage collector, heap size, etc.

I would also recommend using tools on your debian machine like top, vmstat, iotsat, to get a picture of how much time is being spent in Solr vs Riak.  It would be interesting to see what the CPU and I/O behavior of Riak and Java are, in this case.

While you don't necessarily need collectd to tract stats, I have some collectd/python scripts for gathering data about Riak and the JVM.  Please feel free to pilfer/use at your discretion (The proc man page on most lines is helpful).  All they do is scrape the /proc file system to get stats about CPU time, io, etc.  There is also some collectd config for collecting stats via JMX from the JVM, in case you are interested in that.

https://github.com/fadushin/riak_puppet_stuff/tree/master/modules/riak_node/files/collectd

Yokozuna uses ibrowse to connect (via HTTP) to the Solr, and there is a way to set the browse connection pool to something larger than 10.  You might be able to get better throughput that way, but I would first try to sort out why the latency is so bad.

To change the size of the ibrowse connection pool, attach to Riak, and set the ibrowse default_max_sessions environment variable to something greater than 10, e.g., 100.

prompt$ riak attach
Remote Shell: Use "Ctrl-C a" to quit. q() or init:stop() will terminate the riak node.
Erlang R16B02_basho8 (erts-5.10.3) [source] [64-bit] [async-threads:10] [kernel-poll:false] [frame-pointer]

Eshell V5.10.3  (abort with ^G)
(riak at 192.168.1.202)1> rpc:multicall(application, get_env, [ibrowse, default_max_sessions]).
{[undefined,undefined,undefined,undefined,undefined],[]}
(riak at 192.168.1.202)2> rpc:multicall(application, set_env, [ibrowse, default_max_sessions, 100]).
{[ok,ok,ok,ok,ok],[]}
(riak at 192.168.1.202)3> rpc:multicall(application, get_env, [ibrowse, default_max_sessions]).     
{[{ok,100},{ok,100},{ok,100},{ok,100},{ok,100}],[]}

One other thought -- does any of your JSON contain internationalized data, and if so, how is it encoded, e.g., UTF-8 or UTF-16, ISO-8859, etc?  Your etop listing didn't suggest anything out of sorts with extractors, but we might want to get a handle on what is going on there, as well.

Since your quoted issue 320, are the Java errors in your logs associated with broken pipes on the Solr server?  That would suggest that we might be getting timeouts on the client (yokozuna) side, and connections are getting closed before the server can write a response, but I think the default timeout is 60 seconds, so you shouldn't be hitting that (looking at your stats), though those stats are taken from a relatively small time window.

I hope that helps you diagnose where the bottleneck is.  Keep us posted.

-Fred

> On Oct 2, 2015, at 2:23 AM, ilyas <i.sergeev at keepsolid.com> wrote:
> 
> 
> It looks like this issues
> 
> https://github.com/basho/yokozuna/issues/320 <https://github.com/basho/yokozuna/issues/320>
> 
> I try to set 
> maxThreads to 150
> Acceptors to 10
> lowResourcesMaxIdleTime to 50000
> in /usr/lib/riak/lib/yokozuna/priv/solr/etc/jetty.xml as recommended in https://github.com/basho/yokozuna/issues/330 <https://github.com/basho/yokozuna/issues/330> 
> 
> but it has no effect
> 
> On 10/01/2015 11:53 PM, Fred Dushin wrote:
>> Is there any more information in these logs that you can share?  For example, is this the only entry with this exception?  Or are there more?  Are there any associated stack traces?  An EOF exception can come from many different scenarios.
>> 
>> Is there anything in the Riak console.log that looks suspicious?
>> 
>> Finally, you might want to take a look at what is going on inside of riak when you get into this state (slow writes to Solr), by looking at Riak stats.
>> 
>> You get get to Riak stats via curl, e.g.,
>> 
>> 	curl http://localhost:8098/stats <http://localhost:8098/stats> | python -m json.tool
> ok, output is attached
> 
>> Stats you might want to pay special attention to:
>> 
>> riak_kv_vnodeq (min, max, median, etc)  -- the aggregate length of the vnode queues.  Long vnode queues may mean your vnode is locked waiting on Solr
>> vnode_put_fsm_time (mean, media, percentile, etc) The amount of time spent on average waiting for a vnode put to complete.  Long times may also be indicative of waits writing into Solr.
> "riak_kv_vnodeq_max": 0,
> "riak_kv_vnodeq_mean": 0.0,
> "riak_kv_vnodeq_median": 0,
> "riak_kv_vnodeq_min": 0,
> "riak_kv_vnodeq_total": 0,
> 
> 
> <riak_stats.txt>_______________________________________________
> riak-users mailing list
> riak-users at lists.basho.com
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.basho.com/pipermail/riak-users_lists.basho.com/attachments/20151002/3fa73da1/attachment-0002.html>


More information about the riak-users mailing list