bitcask merges & deletions

Johnny Tan johnnydtan at gmail.com
Wed Jun 15 13:22:09 EDT 2016


We're running riak-1.4.2

Every few weeks, we have a riak node that starts to slowly fill up on disk
space for several days, and then suddenly gain that space back again.

In looking into this more today, I think I see what's going on.

Per the console.log on a node that it's happening to right now, there are
an unusually large amount of merges happening right now. There are 6 total
nodes in our cluster, it's only happening to this node today. (In previous
weeks, it's been other nodes, but it's always been one node at a time.)

Normally, we get 50-70 merges per day per node (according to various nodes'
console.log, including the node in question). Yesterday and today, the node
in question has several hundred merges happening.

When I look inside the bitcask directory, I see a lot of files with this
set of permissions:
-rwSrw-r--

My understanding is that those are files marked for deletions after bitcask
merging.

The number of those files is currently growing, and from a spot-check, they
indeed match up as the files that have been merged.

So it seems the two are related: a lot of merges are happening, which then
causes a large number of files to be marked for deletion, and those marked
files are piling up and not getting deleted for some reason.

If I don't do anything, those files eventually get deleted, and everything
is good again for another couple weeks until it happens to another node.
But the disk usage does get high enough to alert us, and obviously we don't
want it to get anywhere near 100%.


I'm trying to figure out why there are times when this happens. One thing I
noticed is a difference in the merge log entries.

Here's one from a "normal" day, nearly all the entries for that day are
roughly this same length and same amount of time merging:
2016-06-10 05:27:39.426 UTC [info] <0.15230.160> Merged
{["/var/lib/riak/bitcask/890602560248518965780370444936484965102833893376/84000.bitcask.data","/var/lib/riak/bitcask/890602560248518965780370444936484965102833893376/83999.bitcask.data"],[]}
in 11.902028 seconds.

But here's one from today on the problematic node:
2016-06-15 17:13:40.626 UTC [info] <0.17903.500> Merged
{["/var/lib/riak/bitcask/1233142006497949337234359077604363797834693083136/83633.bitcask.data","/var/lib/riak/bitcask/1233142006497949337234359077604363797834693083136/83632.bitcask.data","/var/lib/riak/bitcask/1233142006497949337234359077604363797834693083136/83631.bitcask.data","/var/lib/riak/bitcask/1233142006497949337234359077604363797834693083136/83630.bitcask.data","/var/lib/riak/bitcask/1233142006497949337234359077604363797834693083136/83629.bitcask.data","/var/lib/riak/bitcask/1233142006497949337234359077604363797834693083136/83628.bitcask.data","/var/lib/riak/bitcask/1233142006497949337234359077604363797834693083136/83627.bitcask.data","/var/lib/riak/bitcask/1233142006497949337234359077604363797834693083136/83626.bitcask.data","/var/lib/riak/bitcask/1233142006497949337234359077604363797834693083136/83625.bitcask.data","/var/lib/riak/bitcask/1233142006497949337234359077604363797834693083136/83624.bitcask.data","/var/lib/riak/bitcask/1233142006497949337234359077604363797834693083136/83623.bitcask.data","/var/lib/riak/bitcask/1233142006497949337234359077604363797834693083136/83622.bitcask.data","/var/lib/riak/bitcask/1233142006497949337234359077604363797834693083136/83621.bitcask.data","/var/lib/riak/bitcask/1233142006497949337234359077604363797834693083136/83620.bitcask.data","/var/lib/riak/bitcask/1233142006497949337234359077604363797834693083136/83619.bitcask.data","/var/lib/riak/bitcask/1233142006497949337234359077604363797834693083136/83618.bitcask.data","/var/lib/riak/bitcask/1233142006497949337234359077604363797834693083136/83617.bitcask.data","/var/lib/riak/bitcask/1233142006497949337234359077604363797834693083136/83616.bitcask.data","/var/lib/riak/bitcask/1233142006497949337234359077604363797834693083136/83615.bitcask.data","/var/lib/riak/bitcask/1233142006497949337234359077604363797834693083136/83614.bitcask.data","/var/lib/riak/bitcask/1233142006497949337234359077604363797834693083136/83613.bitcask.data","/var/lib/riak/bitcask/12331420064979493372343590776043637978346930...",...],...}
in 220.186043 seconds.

It's not just that it takes 20x longer to merge, but it seems to be doing a
lot more files at once.

What is going on?

I'm not sure how much of the app.config is relevant, but I'll at least
paste just the bitcask and merge sections for now:
        {bitcask, [
                {data_root, "/var/lib/riak/bitcask"},
                {dead_bytes_merge_trigger, 268435456},
                {dead_bytes_threshold, 67108864},
                {frag_merge_trigger, 60},
                {frag_threshold, 40},
                {io_mode, erlang},
                {max_file_size, 1073741824},
                {small_file_threshold, 134217728}
        ]},
        {merge_index, [
                {buffer_rollover_size, 1048576},
                {data_root, "/var/lib/riak/merge_index"},
                {max_compact_segments, 20}
        ]},


Thanks for any insight,
johnny
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.basho.com/pipermail/riak-users_lists.basho.com/attachments/20160615/376f6809/attachment-0002.html>


More information about the riak-users mailing list