High CPU on a single node in production

Josh Yudaken josh at smyte.com
Wed Jan 6 13:03:46 EST 2016


Hi Luke,

We're planning on having a rather large cluster rather soon, which was
the reason for the large ring size. Your documentation indicates ring
resize is *not* possible with search 2.0 [1], although an issue I
found on github indicated it might be now? [2]

If the situation is resolved we might be open to resizing our ring
now, but given the trouble we're seeing with normal handoffs that
seems like a bad idea. Is the 4x ring size expected to completely
break Riak like we're seeing in production, or just a bit of extra
strain/latency?

I've been through the tuning list multiple times, and haven't seen any
changes. I migrated the machine seeing issues to a new host, and now
the new host is seeing similar problems. Heres a screenshot of `htop`
just before I stopped the node in order to bring our site back up [3].

Regards,
Josh

[1] http://docs.basho.com/riak/latest/ops/advanced/ring-resizing/#Feature-Incompatibility
[2] https://github.com/basho/basho_docs/issues/1742
[3] https://slack-files.com/T031MU137-F0HRU4E94-c3ab1e776e

On Wed, Jan 6, 2016 at 6:39 AM, Luke Bakken <lbakken at basho.com> wrote:
> Hi Josh,
>
> 1024 is too large of a ring size for 10 nodes. If it's possible to
> rebuild your cluster using a ring size of 128 or 256 that would be
> ideal (http://docs.basho.com/riak/latest/ops/building/planning/cluster/#Ring-Size-Number-of-Partitions).
> Ring resizing is possible as well
> (http://docs.basho.com/riak/latest/ops/advanced/ring-resizing/).
>
> Have all of our recommended performance tunings been applied to every
> node in this cluster?
> (http://docs.basho.com/riak/latest/ops/tuning/linux/) - these can have
> a dramatic effect on cluster performance.
>
> --
> Luke Bakken
> Engineer
> lbakken at basho.com
>
> On Tue, Jan 5, 2016 at 10:52 AM, Josh Yudaken <josh at smyte.com> wrote:
>> Hi,
>>
>> We're attempting to use Riak as our primary key-value and search
>> database for an analytics-typed solution to blocking spam/fraud.
>>
>> As we expect to eventually be handling a huge amount of data, I
>> started with a ring size of 1024. We currently have 10 nodes on Google
>> Cloud n1-standard-16 instances [ 16 cores, 60gb RAM, 720gb local ssd.
>> ]. Disks are at about 60% usage [ roughly 175gb leveldb, 16gb yz, 45gb
>> anti_entropy, 6gb yz_anti_entropy ], and request wise we're at about
>> 20k/min get, 4k/min set. Load average is usually around 6.
>>
>> I'm assuming most of the issues we're seeing are Yokozuna related, but
>> we're seeing a ton of tcp timeouts during handoffs, very slow get/set
>> queries, and a slew of other errors.
>>
>> Right now I'm trying to debug an issue where one of the 10 nodes
>> pegged all the cpu cores. Mostly with the `bean` process.
>>
>> # riak-admin top
>> Output server crashed: connection_lost
>>
>> With few other options (as it was causing slow queries across the
>> cluster) I stopped the server and saw hundreds of the following
>> (interesting) messages in the log::
>>
>> 2016-01-05 18:28:28.573 [info]
>> <0.4958.0>@yz_index_hashtree:close_trees:557 Deliberately marking YZ
>> hashtree {1458647141945490998441568260777384029383167049728,3} for
>> full rebuild on next restart
>>
>> As well as a ton of (I think related?):
>> 2016-01-05 18:28:31.153 [error] <0.5982.0>@yz_kv:index_internal:237
>> failed to index object
>> {{<<"features">>,<<"features">>},<<"0NKqMtj3O6_">>} with error
>> {noproc,{gen_server,call,[yz_entropy_mgr,{get_tree,1120389438774178506630754486017853682060456099840},infinity]}}
>> because [{gen_server,call,3,[{file,"gen_server.erl"},{line,188}]},{yz_kv,get_and_set_tree,1,[{file,"src/yz_kv.erl"},{line,452}]},{yz_kv,update_hashtree,4,[{file,"src/yz_kv.erl"},{line,340}]},{yz_kv,index,7,[{file,"src/yz_kv.erl"},{line,295}]},{yz_kv,index_internal,5,[{file,"src/yz_kv.erl"},{line,224}]},{riak_kv_vnode,actual_put,6,[{file,"src/riak_kv_vnode.erl"},{line,1619}]},{riak_kv_vnode,perform_put,3,[{file,"src/riak_kv_vnode.erl"},{line,1607}]},{riak_kv_vnode,do_put,7,[{file,"src/riak_kv_vnode.erl"},{line,1398}]}]
>>
>> For reference the TCP timeout error looks like:
>>
>> 2016-01-01 01:09:50.522 [error]
>> <0.8430.6>@riak_core_handoff_sender:start_fold:272 hinted transfer of
>> riak_kv_vnode from 'riak at riak25-2.c.authbox-api.internal'
>> 185542200051774784537577176028434367729757061120 to
>> 'riak at riak27-2.c.authbox-api.internal'
>> 185542200051774784537577176028434367729757061120 failed because of TCP
>> recv timeout
>>
>> Any suggestions about where to look?
>>
>> Regards,
>> Josh




More information about the riak-users mailing list