Matthew Tovbin matthew at tovbin.com
Tue Aug 21 23:05:22 EDT 2012



On Tue, Aug 21, 2012 at 7:29 PM, David Yu <david.yu.ftw at gmail.com> wrote:

> On Wed, Aug 22, 2012 at 5:33 AM, Alexander Sicular <siculars at gmail.com>wrote:
>> I was in the Riak 1.2 webinar earlier today and asked a leveldb question
>> about insertion order and durability vs. bitcask's WOL architecture. Joe
>> was not able to get to my question then but took the time to write me a
>> detailed answer. Great engineers at Basho taking time to answer questions
>> is a great thing. Thanks Joe!
>> -Alexander Sicular
>> @siculars
>> Begin forwarded message:
>> *From: *Joseph Blomstedt <joe at basho.com>
>> *Subject: **LevelDB*
>> *Date: *August 21, 2012 3:45:45 PM EDT
>> *To: *siculars at gmail.com
>> Alexander,
>> I noticed your LevelDB question in the webinar as Reem was closing
>> things out, so I figured I'd follow up via email.
>> As you know, Bitcask maintains a strict set of write-logs and an
>> in-memory hash table that maps keys to (file, offset). Pretty
>> straightforward. Compaction is a separate thing that happens based on
>> independent triggers.
>> LevelDB is rather different. LevelDB does maintain a WAL, but it's
>> short-lived and only for crash recovery. LevelDB writes to the WAL,
>> but also keeps the object in an in-memory write buffer (configurable
>> size, increased in Riak 1.2 by 10x from Riak 1.1). After the buffer
>> becomes full, LevelDB writes the data to disk as a Level-0 SST (data
>> in sorted order + sorted index at the end of the file).
>> There can be multiple Level-0 SSTs. To read a key, LevelDB looks at
>> the index in each SST starting from newest file to oldest. For
>> performance, there's an LRU cache of indexes so you're not always
>> hitting disk. LevelDB now also includes bloom filters (used in Riak
>> 1.2) to make it easier to skip non-interesting SSTs.
>> To make things more efficient, LevelDB does compaction/merging in a
>> background thread. A set of Level-0 files will be selected and merged
>> together into a larger Level-1 file. The format is the same, but the
>> file is now larger and includes the data from multiple Level-0 files.
>> The original Level-0 files are then removed. Likewise, Level-1 files
>> are merged into Level-2 files, and Level-2 into Level-3, etc. Each
>> Level having larger files with a greater chunk of adjacent, sorted
>> data.
>> To read, you check newest to oldest on Level 0, then Level 1, then Level
>> 2, etc.
>> While compaction is a background thing, LevelDB limits the number of
>> Level-0 files you can have. If you hit the limit, LevelDB will block
>> writes until files have been merged into Level-1. With a single
>> compaction thread, it was easy to max out LevelDB in Riak 1.1, and
>> these stalls were fairly frequent and hurt 95% and up latencies, as
>> well as greatly hurt throughput. Our change to use multiple compaction
>> threads has greatly improved the how quickly compaction occurs, and
>> writes rarely (if ever) end up stalling. To further improve things,
>> there's the adaptive write throttling that I mentioned that will slow
>> down writes (increased latency) in order to ensure compaction isn't
>> heavily affected and remains ahead of write traffic -- thus, further
>> preventing stalls. Net effect is somewhat higher latency and lower
>> throughput that is more consistent (ie. 95%+ are tighter around
>> average latency).
>> I hope this answers your question.
>> -Joe
>> Thanks for sharing!
>> _______________________________________________
>> riak-users mailing list
>> riak-users at lists.basho.com
>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
> --
> When the cat is away, the mouse is alone.
> - David Yu
> _______________________________________________
> riak-users mailing list
> riak-users at lists.basho.com
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.basho.com/pipermail/riak-users_lists.basho.com/attachments/20120821/ecfb8467/attachment.html>

More information about the riak-users mailing list