leveldb Hot Threads in 1.4.9?

Tom Lanyon tom+riak at oneshoeco.com
Mon Jul 7 22:54:46 EDT 2014

Hi Matthew,

On Sunday, 6 July 2014 at 3:04, Matthew Von-Maszewski wrote: 
> Tom,
> Basho prides itself on quickly responding to all user queries. I have failed that tradition in this case. Please accept my apologies.
No problem; I appreciate you taking the time to look into our LOG.
> The LOG data suggests leveldb is not stalling, especially not for 4 hours. Therefore the problem is related to disk utilization.

 That matches our experience - leveldb itself is working hard on disk operations whilst Riak fails to respond to... anything, causing an apparent 'stall' from the client application's perspective.

> You appear to have large values. I see .sst files where the average value is 100K to 1Mbyte in size. Is this intentional, or might you have a sibling problem?
Yes, we have a split between very small (headers only, no body) items and 1MB binary chunks.  If we had our time again we'd probably use multi-backend to store these 1MB chunks in bitcask and keep leveldb for the small body-less items which require 2i.

> My assessment is that your lower levels are full and therefore cascading regularly. "cascading" is like the typical champagne glass pyramid you see at weddings. Once all the glasses are full, new champagne at the top causes each subsequent layer to overflow into the one below that. You have the same problem, but with data. 
> Your large values have filled each of the lower levels and regularly cause cascading data between multiple levels. The cascading is causing each 100K value write to become the equivalent of a 300K or 500K value as levels overflow. This cascading is chewing up your hard disk performance (by reducing the amount of time the hard drive has available for read requests).
By increasing the size of the lower levels (as you show below), does this mean there's more capacity for writes to occur in those levels before compaction is triggered and hence compacting them less frequently?

I guess this turns your champagne fountain analogy into more of a 'tipping bucket' where the data is no longer 'flowing' through the levels but is instead building up in each level before tipping into the next when it's at capacity?  (pictorial representation: http://4.bp.blogspot.com/_DUDhlpPD8X8/SIcN8D66j9I/AAAAAAAAASs/2Va3_n3vamk/s400/23157087_261a5da413.jpg)

> The leveldb code for Riak 2.0 has increased the size of all the levels. The table of sizes is found at the top of leveldb's db/version_set.cc (http://version_set.cc). You could patch your current code if desired with this table from 2.0:
> { 
> {10485760, 262144000, 57671680, 209715200, 0, 420000000, true}, 
> {10485760, 82914560, 57671680, 419430400, 0, 209715200, true}, 
> {10485760, 314572800, 57671680, 3082813440, 200000000, 314572800, false}, 
> {10485760, 419430400, 57671680, 6442450944ULL, 4294967296ULL, 419430400, false}, 
> {10485760, 524288000, 57671680, 128849018880ULL, 85899345920ULL, 524288000, false}, 
> {10485760, 629145600, 57671680, 2576980377600ULL, 1717986918400ULL, 629145600, false}, 
> {10485760, 734003200, 57671680, 51539607552000ULL, 34359738368000ULL, 734003200, false} 
> }; 
> You cannot take the entire 2.0 leveldb into your 1.4 code base due to various option changes.
I assume leveldb will just 'handle' making the levels larger once nodes are restarted with this updated configuration?  I also assume that it would not be wise to then rollback the change to smaller levels after this has been done?
> Let me know if this helps. I have previously hypothesized that "grooming" compactions should be limited to one thread total. However my test datasets never demonstrated a benefit. Your dataset might be the case that proves the benefit. I will go find the grooming patch to hot_threads for you if the above table proves insufficient.

Do I understand correctly that this would mean compactions would continue, but limited to one thread, so that the rest of the application can still respond to client requests?  If so, that sounds like it may help a situation like ours - although I'd wonder whether the rate-limited compaction would ever "keep up" with the inflowing data.


More information about the riak-users mailing list