Single node causing cluster to be extremely slow (leveldb)
graphex at graphex.com
Fri Jan 10 11:59:04 EST 2014
Excellent and informative explanation, thank you very much. We’re very happy that our adjustments have returned the cluster to its normal operating parameters. Also glad that Riak 2 will be handling this stuff programmatically, as prior to your spreadsheet and explanation it was pure voodoo for us. I think the automation will significantly decrease the number of animal sacrifices needed to appease the levelDB gods! :)
On Jan 10, 2014, at 9:18 AM, Matthew Von-Maszewski <matthewv at basho.com> wrote:
> Attached is the spreadsheet I used for deriving the cache_size and max_open_files. The general guidelines of the spreadsheet are:
> vnode count: ring size divided by (number of nodes minus one)
> write_buf_min/max: don't touch … you will screw up my leveldb tuning
> cache_size: 8Mbytes is hard minimum
> max_open_files: this is NOT a file count in 1.4. It is 4Mbytes times the value. File cache is meta-data size based, not file count.
> lower cache_size and raise max_open_files as necessary to keep "remaining" close to zero AND cover your total file metadata size
> What is file metadata size? I looked at one vnode's LOG file for rough estimates:
> - Your total file count was 1,479 in one vnode
> - You typically hit the 75,000 key limit
> - Key count (75,000) divided into a typical file size is 496 bytes … used 496 as average value size
> - Block_size is 4096. 496 value size goes into block size about 10 times (no need for fractions since block_size is a threshold, not fixed value)
> - 75,000 total keys in file, 10 keys per block … that means 7,500 keys in file's index … 100 bytes per key is 750,000 bytes of keys in index.
> - bloom filter is 2 bytes per key (all 75,000 keys) or 150,00 bytes
> - metadata loaded into file cache is therefore 750,000 + 150,000 bytes per file or 900,000 bytes.
> - 900,000 bytes per file times 1,479 files is 1,331,100,000 bytes of file cache needed …
> Your original 315 max_open_files is 1,279,262,720 in size (315 * 4Mbytes) … file cache is thrashing since 1,279,262,720 is less than 1,331,100,000.
> I told you 425 as a max_open_files setting, spreadsheet has 400 as more conservative number.
> On Jan 10, 2014, at 9:41 AM, Martin May <martin at push.io> wrote:
>> Hi Matthew,
>> We applied this change to node 4, started it up, and it seems much happier (no crazy CPU). We’re going to keep an eye on it for a little while, and then apply this setting to all the other nodes as well.
>> Is there anything we can do to prevent this scenario in the future, or should the settings you suggested take care of that?
>> On Jan 10, 2014, at 6:42 AM, Matthew Von-Maszewski <matthewv at basho.com> wrote:
>>> I did some math based upon the app.config and LOG files. I am guessing that you are starting to thrash your file cache.
>>> This theory should be easy to prove / disprove. On that one node, change the cache_size and max_open_files to:
>>> cache_size 68435456
>>> max_open_files 425
>>> If I am correct, the node should come up and not cause problems. We are trading block cache space for file cache space. A miss in the file cache is far more costly than a miss in the block cache.
>>> Let me know how this works for you. It is possible that we might want to talk about raising your block size slightly to reduce file cache overhead.
>>> On Jan 9, 2014, at 9:33 PM, Sean McKibben <graphex at graphex.com> wrote:
>>>> We have a 5 node cluster using elevelDB (1.4.2) and 2i, and this afternoon it started responding extremely slowly. CPU on member 4 was extremely high and we restarted that process, but it didn’t help. We temporarily shut down member 4 and cluster speed returned to normal, but as soon as we boot member 4 back up, the cluster performance goes to shit.
>>>> We’ve run in to this before but were able to just start with a fresh set of data after wiping machines as it was before we migrated to this bare-metal cluster. Now it is causing some pretty significant issues and we’re not sure what we can do to get it back to normal, many of our queues are filling up and we’ll probably have to take node 4 off again just so we can provide a regular quality of service.
>>>> We’ve turned off AAE on node 4 but it hasn’t helped. We have some transfers that need to happen but they are going very slowly.
>>>> 'riak-admin top’ on node 4 reports this:
>>>> Load: cpu 610 Memory: total 503852 binary 231544
>>>> procs 804 processes 179850 code 11588
>>>> runq 134 atom 533 ets 4581
>>>> Pid Name or Initial Func Time Reds Memory MsgQ Current Function
>>>> <6175.29048.3> proc_lib:init_p/5 '-' 462231 51356760 0 mochijson2:json_bin_is_safe/1
>>>> <6175.12281.6> proc_lib:init_p/5 '-' 307183 64195856 1 gen_fsm:loop/7
>>>> <6175.1581.5> proc_lib:init_p/5 '-' 286143 41085600 0 mochijson2:json_bin_is_safe/1
>>>> <6175.6659.0> proc_lib:init_p/5 '-' 281845 13752 0 sext:decode_binary/3
>>>> <6175.6666.0> proc_lib:init_p/5 '-' 209113 21648 0 sext:decode_binary/3
>>>> <6175.12219.6> proc_lib:init_p/5 '-' 168832 16829200 0 riak_client:wait_for_query_results/4
>>>> <6175.8403.0> proc_lib:init_p/5 '-' 133333 13880 1 eleveldb:iterator_move/2
>>>> <6175.8813.0> proc_lib:init_p/5 '-' 119548 9000 1 eleveldb:iterator/3
>>>> <6175.8411.0> proc_lib:init_p/5 '-' 115759 34472 0 riak_kv_vnode:'-result_fun_ack/2-fun-0-'
>>>> <6175.5679.0> proc_lib:init_p/5 '-' 109577 8952 0 riak_kv_vnode:'-result_fun_ack/2-fun-0-'
>>>> Output server crashed: connection_lost
>>>> Based on that, is there anything anyone can think to do to try to bring performance back in to the land of usability? Does this thing appear to be something that may have been resolved in 1.4.6 or 1.4.7?
>>>> Only thing we can think of at this point might be to remove or force remove the member and join in a new freshly built one, but last time we attempted that (on a different cluster) our secondary indexes got irreparably damaged and only regained consistency when we copied every individual key to (this) new cluster! Not a good experience :( but i’m hopeful that 1.4.6 may have addressed some of our issues.
>>>> Any help is appreciated.
>>>> Thank you,
>>>> Sean McKibben
>>>> riak-users mailing list
>>>> riak-users at lists.basho.com
>>> riak-users mailing list
>>> riak-users at lists.basho.com
More information about the riak-users