High CPU on a single node in production

Fred Dushin fdushin at basho.com
Wed Jan 6 16:04:24 EST 2016


Hi Josh,

Sorry for not getting back sooner.

I am not entirely sure what is going on with your handoffs.  It could be that you have overloaded Solr with handoff activity, and that is causing vnodes to become unresponsive.  We are actively working on a fix for this, which allows vnodes to continue their work, even if Solr is taking its time with ingest.  This also includes batching, with aggregates insertion (and deletion) operations into Solr, to smooth out some of the bumps, while at the same time being a better Solr citizen.

The hundreds of entries you see on close trees seem to be due to the fact that your YZ AAE trees are in need of being rebuilt.  Can it be that you have hit the magic 7 day grace period on AAE tree expiry?  The index failures you see in the logs seem to be because the yz_entropy_mgr has been shut down.  You are seeing this during a riak stop, correct?  Is the system under high indexing load, at the time?  That could account for the log messages, as index operations may be coming in while yokozuna is being shut down.

Regarding ring resize, please have a look at https://github.com/basho/yokozuna/issues/279 <https://github.com/basho/yokozuna/issues/279>.  I do not believe these issues have been rectified, so the official line is what you see in the documentation.  You can, of course, reindex your data after a ring resize, but that is not acceptable in a production scenario, if you have an SLA around search availability.

Hope that helps, and let us know if you have any more information about what might be consuming CPU on your nodes.  I would keep a close eye on the vnode queue lengths in the Riak stats (riak_kv_vnodeq_(min|max|avg|mean|etc)).  If your vnode queues start getting deep, then vnodes are likely being blocked by Solr.

-Fred

> On Jan 6, 2016, at 1:03 PM, Josh Yudaken <josh at smyte.com> wrote:
> 
> Hi Luke,
> 
> We're planning on having a rather large cluster rather soon, which was
> the reason for the large ring size. Your documentation indicates ring
> resize is *not* possible with search 2.0 [1], although an issue I
> found on github indicated it might be now? [2]
> 
> If the situation is resolved we might be open to resizing our ring
> now, but given the trouble we're seeing with normal handoffs that
> seems like a bad idea. Is the 4x ring size expected to completely
> break Riak like we're seeing in production, or just a bit of extra
> strain/latency?
> 
> I've been through the tuning list multiple times, and haven't seen any
> changes. I migrated the machine seeing issues to a new host, and now
> the new host is seeing similar problems. Heres a screenshot of `htop`
> just before I stopped the node in order to bring our site back up [3].
> 
> Regards,
> Josh
> 
> [1] http://docs.basho.com/riak/latest/ops/advanced/ring-resizing/#Feature-Incompatibility
> [2] https://github.com/basho/basho_docs/issues/1742
> [3] https://slack-files.com/T031MU137-F0HRU4E94-c3ab1e776e
> 
> On Wed, Jan 6, 2016 at 6:39 AM, Luke Bakken <lbakken at basho.com> wrote:
>> Hi Josh,
>> 
>> 1024 is too large of a ring size for 10 nodes. If it's possible to
>> rebuild your cluster using a ring size of 128 or 256 that would be
>> ideal (http://docs.basho.com/riak/latest/ops/building/planning/cluster/#Ring-Size-Number-of-Partitions).
>> Ring resizing is possible as well
>> (http://docs.basho.com/riak/latest/ops/advanced/ring-resizing/).
>> 
>> Have all of our recommended performance tunings been applied to every
>> node in this cluster?
>> (http://docs.basho.com/riak/latest/ops/tuning/linux/) - these can have
>> a dramatic effect on cluster performance.
>> 
>> --
>> Luke Bakken
>> Engineer
>> lbakken at basho.com
>> 
>> On Tue, Jan 5, 2016 at 10:52 AM, Josh Yudaken <josh at smyte.com> wrote:
>>> Hi,
>>> 
>>> We're attempting to use Riak as our primary key-value and search
>>> database for an analytics-typed solution to blocking spam/fraud.
>>> 
>>> As we expect to eventually be handling a huge amount of data, I
>>> started with a ring size of 1024. We currently have 10 nodes on Google
>>> Cloud n1-standard-16 instances [ 16 cores, 60gb RAM, 720gb local ssd.
>>> ]. Disks are at about 60% usage [ roughly 175gb leveldb, 16gb yz, 45gb
>>> anti_entropy, 6gb yz_anti_entropy ], and request wise we're at about
>>> 20k/min get, 4k/min set. Load average is usually around 6.
>>> 
>>> I'm assuming most of the issues we're seeing are Yokozuna related, but
>>> we're seeing a ton of tcp timeouts during handoffs, very slow get/set
>>> queries, and a slew of other errors.
>>> 
>>> Right now I'm trying to debug an issue where one of the 10 nodes
>>> pegged all the cpu cores. Mostly with the `bean` process.
>>> 
>>> # riak-admin top
>>> Output server crashed: connection_lost
>>> 
>>> With few other options (as it was causing slow queries across the
>>> cluster) I stopped the server and saw hundreds of the following
>>> (interesting) messages in the log::
>>> 
>>> 2016-01-05 18:28:28.573 [info]
>>> <0.4958.0>@yz_index_hashtree:close_trees:557 Deliberately marking YZ
>>> hashtree {1458647141945490998441568260777384029383167049728,3} for
>>> full rebuild on next restart
>>> 
>>> As well as a ton of (I think related?):
>>> 2016-01-05 18:28:31.153 [error] <0.5982.0>@yz_kv:index_internal:237
>>> failed to index object
>>> {{<<"features">>,<<"features">>},<<"0NKqMtj3O6_">>} with error
>>> {noproc,{gen_server,call,[yz_entropy_mgr,{get_tree,1120389438774178506630754486017853682060456099840},infinity]}}
>>> because [{gen_server,call,3,[{file,"gen_server.erl"},{line,188}]},{yz_kv,get_and_set_tree,1,[{file,"src/yz_kv.erl"},{line,452}]},{yz_kv,update_hashtree,4,[{file,"src/yz_kv.erl"},{line,340}]},{yz_kv,index,7,[{file,"src/yz_kv.erl"},{line,295}]},{yz_kv,index_internal,5,[{file,"src/yz_kv.erl"},{line,224}]},{riak_kv_vnode,actual_put,6,[{file,"src/riak_kv_vnode.erl"},{line,1619}]},{riak_kv_vnode,perform_put,3,[{file,"src/riak_kv_vnode.erl"},{line,1607}]},{riak_kv_vnode,do_put,7,[{file,"src/riak_kv_vnode.erl"},{line,1398}]}]
>>> 
>>> For reference the TCP timeout error looks like:
>>> 
>>> 2016-01-01 01:09:50.522 [error]
>>> <0.8430.6>@riak_core_handoff_sender:start_fold:272 hinted transfer of
>>> riak_kv_vnode from 'riak at riak25-2.c.authbox-api.internal'
>>> 185542200051774784537577176028434367729757061120 to
>>> 'riak at riak27-2.c.authbox-api.internal'
>>> 185542200051774784537577176028434367729757061120 failed because of TCP
>>> recv timeout
>>> 
>>> Any suggestions about where to look?
>>> 
>>> Regards,
>>> Josh
> 
> _______________________________________________
> riak-users mailing list
> riak-users at lists.basho.com
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.basho.com/pipermail/riak-users_lists.basho.com/attachments/20160106/75ce4a45/attachment-0002.html>


More information about the riak-users mailing list