Parameter Planning (eleveldb)

Matthew Von-Maszewski matthewv at basho.com
Tue Feb 5 08:10:31 EST 2013


30,000:  So that you ever have to think about it again.

Matthew


On Feb 5, 2013, at 3:54, Simon Effenberg <seffenberg at team.mobile.de> wrote:

> Hey Matthew,
> 
> thank you very much!
> 
> I know that the ulimit -n isn't only for normal file FD's but we'll
> have probably not more than 4 client connections per server and with 6
> servers in a cluster probably not so much (but correct me if I'm wrong)
> interconnection links so I would use a bigger ulimit for sure but
> 30,000? :)
> 
> Cheers,
> Simon
> 
> On Mon, 4 Feb 2013 08:47:21 -0500
> Matthew Von-Maszewski <matthewv at basho.com> wrote:
> 
>> - file handles / ulimit -n:  Linux "file handles" are not just about open files.  The setting includes open network sockets too.  You do NOT want to set this close to your expected number of files.  You want to set it to a multiple of your expected files.  I use 30,000 or 60,000 depending upon the machine.  In fact, those high numbers are normal for heavy server activities.  Keep in mind that the base Linux settings have a laptop in mind.
>> 
>> - disk scheduler:  "cfq" is really bad for servers, great for laptops.  Whether you use "noop" or "deadline", you are far better off than the Linux default.  I have never personally tested the difference between noop` and deadline.  Different people have told me they found each of them best for spinning hard drives.  However, there seems to be more on-line discussion in recent months for using deadline for spinning and noop (plus other settings) for SSD drives.  Again, I feel your biggest gain is in not use "cfq".  The rest is testing and tuning.
>> 
>> 
>> 
>> On Feb 4, 2013, at 2:18 AM, Simon Effenberg <seffenberg at team.mobile.de> wrote:
>> 
>>> Thanks again Matthew,
>>> 
>>> I think for now I can start with this. I'll have 256/6 partitions per
>>> node and so if there is one dieing the others have to handle it better
>>> so 256/5 per node multiplied by 92 is 4711 as ulimit setting, right?
>>> 
>>> One question coming into my mind, the Tips & Tricks section talks about
>>> the noop elevator/disk scheduler whereas this post:
>>> http://riak-users.197444.n3.nabble.com/Riak-performance-problems-when-LevelDB-database-grows-beyond-16GB-tp4025608p4025622.html
>>> is talking about deadline for spinning disks (what we will have). So
>>> who is right or who is outdated?
>>> 
>>> Thanks again for your help!!
>>> 
>>> Cheers,
>>> Simon
>>> 
>>> On Sun, 3 Feb 2013 17:55:38 -0500
>>> Matthew Von-Maszewski <matthewv at basho.com> wrote:
>>> 
>>>> I will assume you use the default write buffer settings … cuz that is a whole different discussion and there are two settings not one (_min and _max).
>>>> 
>>>> The default min is 32M and the default max is 64M … so your value is 48M for the average_write_buffer_size.
>>>> 
>>>> Superbowl is about to start here, so I am not performing a detailed check of your math.  However, I can judge the 92 open files looks correct compared to similar systems.
>>>> 
>>>> What questions remain?
>>>> 
>>>> Matthew
>>>> 
>>>> 
>>>> On Feb 3, 2013, at 5:44 PM, Simon Effenberg <seffenberg at team.mobile.de> wrote:
>>>> 
>>>>> Hi Matthew,
>>>>> 
>>>>> thanks a lot!
>>>>> 
>>>>> So now I have:
>>>>> 
>>>>> 6 nodes each having 32GB RAM:
>>>>> 
>>>>> vnode_working_memory = 16GB / 256 / 6 (50% of RAM devided by ringsize
>>>>> devided by nodes) = 390 MB
>>>>> 
>>>>> open_file_memory =
>>>>> (max_open_files-10) * (
>>>>>  184 + (104MB/2048) *
>>>>>  (8 + ((16+14336)/2048 +1) *
>>>>>  0.6
>>>>> )
>>>>> 
>>>>> Now I'm missing the max_open_files .. how to calculate it?
>>>>> I'm missing also average_write_buffer_size (see my question in Step 4).
>>>>> 
>>>>> If I would use the default values for average_write_buffer_size the
>>>>> max_open_files could be calculated like:
>>>>> 
>>>>> memory/vnode = average_write_buffer_size + cache_size +
>>>>> open_file_memory + 20 MB
>>>>> <=> (memory/vnode) - 20 MB - cache_size - average_write_buffer_size =
>>>>>  open_file_memory
>>>>> 
>>>>> so with the default values:
>>>>> 
>>>>> open_file_memory = 390MB - 20MB - 8MB - 45MB = 317MB
>>>>> 
>>>>> and now max_open_files would be
>>>>> 
>>>>> open_file_memory = (max_open_files-10) * (184 + (104MB/2048) * (8 + ((16
>>>>>                 +14336)/2048 +1) * 0.6 )
>>>>> <=> (max_open_files-10) = open_file_memory / (184 + (104MB/2048) * (8 +
>>>>>                        ((16+14336)/2048 +1) * 0.6 )
>>>>> <=> max_open_files = open_file_memory / (184 + (104MB/2048) * (8 +
>>>>>                   ((16+14336)/2048 +1) * 0.6 ) + 10
>>>>> <=> max_open_files = 317MB / (184+53248*(8+67.8)) + 10
>>>>> <=> max_open_files = 317MB / 4036382.4 + 10
>>>>> <=> max_open_files ~= 92
>>>>> 
>>>>> That would be the maximum amount of open files a server can handle
>>>>> (per vnode), am I right? But now, is this enough? Or how to calculate
>>>>> 50% temporary server loss (3 of 6) and how is the count of keys/values
>>>>> is taking into account? I'm somehow lost :(
>>>>> 
>>>>> Cheers
>>>>> Simon
>>>>> 
>>>>> On Sun, 3 Feb 2013 16:12:25 -0500
>>>>> Matthew Von-Maszewski <matthewv at basho.com> wrote:
>>>>> 
>>>>>> First:  Step 2 is talking about how many vnodes exist on a physical server.  If your ring size is 256, but you have 8 servers … then your vnode count for step 2 is 32.
>>>>>> 
>>>>>> Second:  the 2048 is a constant forced by Google's leveldb implementation.  It is the portion of a file covered by a single bloom filter.  This calculation constant disappears with the upcoming 1.3 release.
>>>>>> 
>>>>>> Third:  yes there is a "block_size" parameter that is 4096.  Increase that only if you want to REDUCE the performance of the leveldb instance.  4096 is a very happy value.  We have customers and tests with 130K data values, all using 4096 block size.  The block_size only governs the minimum written (aggregate size of small values that must be written as one unit at minimum).
>>>>>> 
>>>>>> Use 104Mbyte for your average sst file size.  It is "good enough"
>>>>>> 
>>>>>> 
>>>>>> I am not following the question stream for Step 4 and beyond.  Please state again.
>>>>>> 
>>>>>> Matthew
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Feb 3, 2013, at 3:44 PM, Simon Effenberg <seffenberg at team.mobile.de> wrote:
>>>>>> 
>>>>>>> Hi,
>>>>>>> 
>>>>>>> I'm not sure if I understand this all well to calculate the memory
>>>>>>> usage per file and other stuff.
>>>>>>> 
>>>>>>> The webpage tells me some steps but I'm completly unsure if I understand all parameters.
>>>>>>> 
>>>>>>> "Step 1: Calculate Available Working Memory"
>>>>>>> 
>>>>>>> taking the example:
>>>>>>> 
>>>>>>> leveldb_working_memory = 32G * (1 - .50) = 16G
>>>>>>> 
>>>>>>> "Step 2: Calculate Working Memory per vnode"
>>>>>>> 
>>>>>>> vnode_working_memory = leveldb_working_memory / vnode_count
>>>>>>> 
>>>>>>> vnode_count = 256
>>>>>>> 
>>>>>>> => vnode_working_memory = 16G / 256 = 64MB/vnode
>>>>>>> 
>>>>>>> also easy
>>>>>>> 
>>>>>>> "Step 3: Estimate Memory Used by Open Files"
>>>>>>> 
>>>>>>> open_file_memory =
>>>>>>> (max_open_files-10) * (
>>>>>>>  184 + (average_sst_filesize/2048) *
>>>>>>>  (8 + ((average_key_size+average_value_size)/2048 +1) *
>>>>>>>  0.6
>>>>>>> )
>>>>>>> 
>>>>>>> so how do I know the average_sst_filesize (and what is this value exactly)
>>>>>>> (and is 2048 for both /2048 true or 4096 in riak 1.2?) and how do I know
>>>>>>> the max_open_files?
>>>>>>> 
>>>>>>> 
>>>>>>> average_key_size could be 16byte (I have to ask someone but taking it for now)
>>>>>>> average_value_size will be 14kbyte 
>>>>>>> 
>>>>>>> so for now
>>>>>>> 
>>>>>>> open_file_memory =
>>>>>>> (max_open_files-10) * (
>>>>>>>  184 + (average_sst_filesize/2048) *
>>>>>>>  (8 + ((16+14336)/2048 +1) *
>>>>>>>  0.6
>>>>>>> )
>>>>>>> 
>>>>>>> (side question: should I increase the block_size because of the big average value size?
>>>>>>> and also should I leave the cache_size at the default value like it was recommended?)
>>>>>>> 
>>>>>>> "Step 4: Calculate Average Write Buffer"
>>>>>>> 
>>>>>>> should I increase these values or not? If only two are held in memory and I have, as an
>>>>>>> example, 32GB or RAM like in this scenario, shouldn't I increase it to something else?
>>>>>>> 
>>>>>>> "Step 5: Calculate vnode Memory Used"
>>>>>>> 
>>>>>>> memory/vnode = average_write_buffer_size + cache_size + open_file_memory + 20 MB
>>>>>>> 
>>>>>>> So for now I miss almost all 3 values :(.
>>>>>>> 
>>>>>>> To get an Idea:
>>>>>>> 
>>>>>>> - 3 buckets
>>>>>>> - overall ~ 343347732 keys (but only 2/3 have 14kbyte in average)
>>>>>>> 
>>>>>>> 
>>>>>>> Thx for help!
>>>>>>> Simon
>>>>>>> 
>>>>>>> _______________________________________________
>>>>>>> riak-users mailing list
>>>>>>> riak-users at lists.basho.com
>>>>>>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>>>>>> 
>>>>> 
>>>>> 
>>>>> -- 
>>>>> Simon Effenberg | Site Ops Engineer | mobile.international GmbH
>>>>> Fon:     + 49-(0)30-8109 - 7173
>>>>> Fax:     + 49-(0)30-8109 - 7131
>>>>> 
>>>>> Mail:     seffenberg at team.mobile.de
>>>>> Web:    www.mobile.de
>>>>> 
>>>>> Marktplatz 1 | 14532 Europarc Dreilinden | Germany
>>>>> 
>>>>> 
>>>>> Geschäftsführer: Malte Krüger
>>>>> HRB Nr.: 18517 P, Amtsgericht Potsdam
>>>>> Sitz der Gesellschaft: Kleinmachnow 
>>>> 
>>> 
>>> 
>>> -- 
>>> Simon Effenberg | Site Ops Engineer | mobile.international GmbH
>>> Fon:     + 49-(0)30-8109 - 7173
>>> Fax:     + 49-(0)30-8109 - 7131
>>> 
>>> Mail:     seffenberg at team.mobile.de
>>> Web:    www.mobile.de
>>> 
>>> Marktplatz 1 | 14532 Europarc Dreilinden | Germany
>>> 
>>> 
>>> Geschäftsführer: Malte Krüger
>>> HRB Nr.: 18517 P, Amtsgericht Potsdam
>>> Sitz der Gesellschaft: Kleinmachnow 
>> 
> 
> 
> -- 
> Simon Effenberg | Site Ops Engineer | mobile.international GmbH
> Fon:     + 49-(0)30-8109 - 7173
> Fax:     + 49-(0)30-8109 - 7131
> 
> Mail:     seffenberg at team.mobile.de
> Web:    www.mobile.de
> 
> Marktplatz 1 | 14532 Europarc Dreilinden | Germany
> 
> 
> Geschäftsführer: Malte Krüger
> HRB Nr.: 18517 P, Amtsgericht Potsdam
> Sitz der Gesellschaft: Kleinmachnow 




More information about the riak-users mailing list