leveldb Hot Threads in 1.4.9?

Matthew Von-Maszewski matthewv at basho.com
Tue Jul 8 09:28:08 EDT 2014

Responses inline.

On Jul 7, 2014, at 10:54 PM, Tom Lanyon <tom+riak at oneshoeco.com> wrote:

> Hi Matthew,
> On Sunday, 6 July 2014 at 3:04, Matthew Von-Maszewski wrote: 
>> Tom,
>> Basho prides itself on quickly responding to all user queries. I have failed that tradition in this case. Please accept my apologies.
> No problem; I appreciate you taking the time to look into our LOG.
>> The LOG data suggests leveldb is not stalling, especially not for 4 hours. Therefore the problem is related to disk utilization.
> That matches our experience - leveldb itself is working hard on disk operations whilst Riak fails to respond to... anything, causing an apparent 'stall' from the client application's perspective.
>> You appear to have large values. I see .sst files where the average value is 100K to 1Mbyte in size. Is this intentional, or might you have a sibling problem?
> Yes, we have a split between very small (headers only, no body) items and 1MB binary chunks.  If we had our time again we'd probably use multi-backend to store these 1MB chunks in bitcask and keep leveldb for the small body-less items which require 2i.
>> My assessment is that your lower levels are full and therefore cascading regularly. "cascading" is like the typical champagne glass pyramid you see at weddings. Once all the glasses are full, new champagne at the top causes each subsequent layer to overflow into the one below that. You have the same problem, but with data. 
>> Your large values have filled each of the lower levels and regularly cause cascading data between multiple levels. The cascading is causing each 100K value write to become the equivalent of a 300K or 500K value as levels overflow. This cascading is chewing up your hard disk performance (by reducing the amount of time the hard drive has available for read requests).
> By increasing the size of the lower levels (as you show below), does this mean there's more capacity for writes to occur in those levels before compaction is triggered and hence compacting them less frequently?


> I guess this turns your champagne fountain analogy into more of a 'tipping bucket' where the data is no longer 'flowing' through the levels but is instead building up in each level before tipping into the next when it's at capacity?  (pictorial representation: http://4.bp.blogspot.com/_DUDhlpPD8X8/SIcN8D66j9I/AAAAAAAAASs/2Va3_n3vamk/s400/23157087_261a5da413.jpg)

Very good photo.  May have to save it for some future presentation.  Though I was visualizing champaign glasses versus large Octoberfest beer mugs.

>> The leveldb code for Riak 2.0 has increased the size of all the levels. The table of sizes is found at the top of leveldb's db/version_set.cc (http://version_set.cc). You could patch your current code if desired with this table from 2.0:
>> { 
>> {10485760, 262144000, 57671680, 209715200, 0, 420000000, true}, 
>> {10485760, 82914560, 57671680, 419430400, 0, 209715200, true}, 
>> {10485760, 314572800, 57671680, 3082813440, 200000000, 314572800, false}, 
>> {10485760, 419430400, 57671680, 6442450944ULL, 4294967296ULL, 419430400, false}, 
>> {10485760, 524288000, 57671680, 128849018880ULL, 85899345920ULL, 524288000, false}, 
>> {10485760, 629145600, 57671680, 2576980377600ULL, 1717986918400ULL, 629145600, false}, 
>> {10485760, 734003200, 57671680, 51539607552000ULL, 34359738368000ULL, 734003200, false} 
>> }; 
>> You cannot take the entire 2.0 leveldb into your 1.4 code base due to various option changes.
> I assume leveldb will just 'handle' making the levels larger once nodes are restarted with this updated configuration?  I also assume that it would not be wise to then rollback the change to smaller levels after this has been done?

Yes, it "just works".  Rollback to smaller size would cause leveldb to churn for a long time as you assumed.

>> Let me know if this helps. I have previously hypothesized that "grooming" compactions should be limited to one thread total. However my test datasets never demonstrated a benefit. Your dataset might be the case that proves the benefit. I will go find the grooming patch to hot_threads for you if the above table proves insufficient.
> Do I understand correctly that this would mean compactions would continue, but limited to one thread, so that the rest of the application can still respond to client requests?  If so, that sounds like it may help a situation like ours - although I'd wonder whether the rate-limited compaction would ever "keep up" with the inflowing data.

The fourth and fifth columns of the table above represent compaction thresholds.  The fourth column is the size where leveldb should start "grooming" a level (compact data in this level up to next level).  The fifth column is the size where leveldb decides the compactions are not keeping pace with incoming writes.  This is the point where the write throttle gets applied to incoming user Write calls.  My theory is that the buffer zone between the fourth and fifth column sizes would be limited to one thread of compaction to keep I/O bandwidth available for your read operations.  If compactions get too far behind, all threads would activate again once level size exceeds the fifth column amount.

And while we are talking tuning, I have data from different loads that suggests the fourth and fifth numbers of row four above should be doubled (6442450944, 4294967296 becoming 12884901888, 8589934492).  This makes the two values 10x smaller than those of row five.  Currently the row four thresholds are 20x smaller than row five.  No other rows change.

> Thanks,
> Tom

More information about the riak-users mailing list