Issues with capacity planning pages on wiki

Nico Meyer nico.meyer at adition.com
Wed May 25 12:22:53 EDT 2011


Hi Anthony,

I think, I can explain at least a big chunk of the difference in RAM and 
disk consumption you see.

Let start with RAM. I could of course be wrong here, but I believe the 
/'static bitcask per key overhead/' is just plainly too small. Let me 
explain why.
The bitcask_keydir_entry struct for each entry looks like this:

typedef struct
{
     uint32_t file_id;
     uint32_t total_sz;
     uint64_t offset;
     uint32_t tstamp;
     uint16_t key_sz;
     char     key[0];
} bitcask_keydir_entry;


This has indeed a size of 22 bytes (The array 'key' has zero entries 
because the key is written to the memory address directly after the 
keydir entry).
As is done int the capacity planner, you need to add the size of the 
bucket and key to get the size of the keydir entry, but that is not the 
whole story.

The thing that is actually stored in key is the result of this Erlang 
expression:

    erlang:term_to_binary( {<<"bucket">>,<<"key">>} )

that is, a tuple of two binaries converted to the Erlang external term 
format.

So lets see:

1>  term_to_binary({<<>>,<<>>}).
<<131,104,2,109,0,0,0,0,109,0,0,0,0>>
2>  iolist_size(term_to_binary({<<>>,<<>>})).
13
3>  iolist_size(term_to_binary({<<"a">>,<<"b">>})).
15
4>  iolist_size(term_to_binary({<<"aa">>,<<"b">>})).
16
5>  iolist_size(term_to_binary({<<"aa">>,<<"bb">>})).
17

so even an empty bucket/key pair take 13 bytes  to store.

Also, since the hashtable storing the keydir entries is essentially an 
array of pointers to bitcask_keydir_entry objects, there is another 8 
bytes of overhead per key, assuming you are running a 64bit system.

so the real static overhead per key is not 22 but 22+13+8 = 43 bytes.

Lets run the numbers for your predicted memory consumption again:

   ( 43 + 10 + 36 ) * 183915891 * 3 = 49105542897 = 45.7 GB


Your actual RAM consumption of 70 GB seems to be at odd with the output 
of erlang:memory/0 that you sent:

{total,7281790968} =>   RAM: 7281790968 * 8 = 54.3 GB


So that is much closer, within about 20 percent. Some additional 
overhead is to be expected, but it is hard to say how much of that is 
due to Erlangs internal usage and how much due to bitcask.

So lets examine the disk consumption next.
As you rightly concluded the equation here 
http://wiki.basho.com/Cluster-Capacity-Planning.html is somewhat 
simplified, and your are also right, that the real equation would be

( 14 + Key + Value ) * Num Entries * N_Val

On the other hand 14 bytes + keysize might be quite irrelevant if your 
values have a size of at least 2KB (as in the example), which seems to 
be the general assumption in some aspects of the design of riak and bitcask.
As you also noticed, this additional small overhead brings you nowhere 
near the disk usage that you observe.

First, the key that is stored in the bitcask files is not the key part 
of the bucket/key pair that riak calls a key, but the serialized 
bucket/key pair described above, so the calculation becomes:

( 14 + ( 13 + Bucket + Key) + Value ) * Num Entries * N_Val

( 14 + ( 13 + 10 + 36) + 36 ) * 183915891 * 3 = 56 GB

Still not enough :-/.
So next lets examine what is actually stored as the value in bitcask. It 
is not simply the data you provide, but a riak object (r_object record) 
which is again serialized by the erlang:term_to_binary/1 function. So 
lets see. I create a new riak object with zero byte bucket, key and value:

3>  Obj = riak_object:new(<<>>,<<>>,<<>>).     
{r_object,<<>>,<<>>,
           [{r_content,{dict,0,16,16,8,80,48,
                             {[],[],[],[],[],[],[],[],[],[],[],[],[],[],...},
                             {{[],[],[],[],[],[],[],[],[],[],[],[],...}}},
                       <<>>}],
           [],
           {dict,1,16,16,8,80,48,
                 {[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],...},
                 {{[],[],[],[],[],[],[],[],[],[],[],[],[],...}}},
           undefined}
4>  iolist_size(erlang:term_to_binary(Obj))._*
205*_

Also, bucket and key are contained int  the riak object itself (and 
therefore in the bitcask notion of the value). So with this information 
the predicted disk usage becomes:

( 14 + ( 13 + Bucket + Key ) + ( 205 + Bucket + Key + Value ) ) * Num Entries * N_Val

( 14 + ( 13 + 10 + 36) + ( 205 + 10 + 36 ) ) * 183915891 * 3 = 166.5 GB

which is way closer to the 341 GB you observe.

But we can get even closer, although the detailes become somewhat more 
fuzzy. But bear with me.
I again create a riak object, but this time with a non empty bucket/key 
so I can store it in riak:

(ctag at 172.20.1.31)7>  Obj = riak_object:new(<<"a">>,<<"a">>,<<>>).
{r_object,<<"a">>,<<"a">>,
           [{r_content,{dict,0,16,16,8,80,48,
                             {[],[],[],[],[],[],[],[],[],[],[],[],[],[],...},
                             {{[],[],[],[],[],[],[],[],[],[],[],[],...}}},
                       <<>>}],
           [],
           {dict,1,16,16,8,80,48,
                 {[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],...},
                 {{[],[],[],[],[],[],[],[],[],[],[],[],[],...}}},
           undefined}

(ctag at 172.20.1.31)8>  iolist_size(erlang:term_to_binary(Obj)).
_*207*_

(ctag at 172.20.1.31)9>  {ok,C}=riak:local_client().
{ok,{riak_client,'ctag at 172.20.1.31',<<2,123,179,255>>}}
(ctag at 172.20.1.31)10>  C:put(Obj,1,1).           
ok

(ctag at 172.20.1.31)12>  {ok,ObjStored} = C:get(<<"a">>,<<"a">>, 1).
{ok,{r_object,<<"a">>,<<"a">>,
          [{r_content,{dict,2,16,16,8,80,48,
			 {[],[],[],[],[],[],[],[],[],[],[],[],...},
                          {{[],[],[],[],[],[],[],[],[],[],...}}},
                      <<>>}],
               [{<<2,123,179,255>>,{1,63473554112}}],
               {dict,1,16,16,8,80,48,
                     {[],[],[],[],[],[],[],[],[],[],[],[],[],...},
                     {{[],[],[],[],[],[],[],[],[],[],[],...}}},
                undefined}}
(ctag at 172.20.1.31)13>  iolist_size(erlang:term_to_binary(ObjStored)).
_*358*_



Ok? What happened? The object we retrieved is considerably larger than 
the one we stored. One culprit is the vector clock data, which was an 
empty list for Obj, and now has one entry:

(ctag at 172.20.1.31)14>  riak_object:vclock(Obj).
[]
(ctag at 172.20.1.31)15>  riak_object:vclock(ObjStored).
[{<<2,123,179,255>>,{1,63473554112}}]
(ctag at 172.20.1.31)23>  iolist_size(term_to_binary(riak_object:vclock(Obj))).     
2
(ctag at 172.20.1.31)24>  iolist_size(term_to_binary(riak_object:vclock(ObjStored))).
30

So thats 28 bytes each time the object is updated with a new client ID 
(so alway use a meaningful client ID!!!!), until the vclock pruning sets 
in. The default bucket property is {big_vclock,50}, so in the worst case 
this could account for 28*50=1400 byte!
But each object that has been stored somehow has at least one entry in 
the vclock, so another 28 bytes of overhead

The other part of the growth stems from some standard entries, which are 
added to the object metadata during the put operation:

(ctag at 172.20.1.31)35>  dict:to_list(riak_object:get_metadata(Obj)).
[]
(ctag at 172.20.1.31)37>  iolist_size(term_to_binary(riak_object:get_metadata(Obj))).
60

(ctag at 172.20.1.31)36>  dict:to_list(riak_object:get_metadata(ObjStored)).
[{<<"X-Riak-VTag">>,"7PoD9FEMUBzNmQeMnjUbas"},
  {<<"X-Riak-Last-Modified">>,{1306,334912,424099}}]
(ctag at 172.20.1.31)38>  iolist_size(term_to_binary(riak_object:get_metadata(ObjStored))).
183

So there are the other 123 bytes.

In total this 356 byte* overhead per object leads us to the following 
calculation:  (* 2 bytes from the above 358 came from the bucket and key 
which are already accounted for)

( 14 + ( 13 + Bucket + Key ) + ( 356 + Bucket + Key + Value ) ) * Num Entries * N_Val

( 14 + ( 13 + 10 + 36) + ( 356 + 10 + 36 ) ) * 183915891 * 3 = 244 GB


We are getting closer!
If you loaded the data via the REST API the overhead is somewhat larger 
still, since the object will also contain 'content-type', 'X-Riak-Meta' 
and 'Link' metadata entries:

xxxx at node2:~$ curl -v -d '' -H "Content-Type: text/plain" http://127.0.0.1:8098/riak/a/a


(ctag at 172.20.1.31)44>  {ok,ObjStored} = C:get(<<"a">>,<<"a">>, 1).
{ok,{r_object,<<"a">>,<<"a">>,
               [{r_content,{dict,5,16,16,8,80,48,
                                 {[],[],[],[],[],[],[],[],[],[],[],[],...},
                                 {{[],[],[[<<"Links">>]],[],[],[],[],[],[],[],...}}},
                           <<>>}],
               [{<<5,134,53,93>>,{1,63473557230}}],
               {dict,1,16,16,8,80,48,
                     {[],[],[],[],[],[],[],[],[],[],[],[],[],...},
                     {{[],[],[],[],[],[],[],[],[],[],[],...}}},
               undefined}}
(ctag at 172.20.1.31)45>  dict:to_list(riak_object:get_metadata(ObjStored)).              
[{<<"Links">>,[]},
  {<<"X-Riak-VTag">>,"3TQzJznzXXWtZefntWXPDR"},
  {<<"content-type">>,"text/plain"},
  {<<"X-Riak-Last-Modified">>,{1306,338030,682871}},
  {<<"X-Riak-Meta">>,[]}]

(ctag at 172.20.1.31)46>  iolist_size(erlang:term_to_binary(ObjStored))._*
449*_


Which leads to: (remember again to subtract 2 bytes)

( 14 + ( 13 + Bucket + Key ) + ( 447 + Bucket + Key + Value ) ) * Num Entries * N_Val

( 14 + ( 13 + 10 + 36) + ( 447 + 10 + 36 ) ) * 183915891 * 3 = 290.8 GB


Nearly there!

Now there are also the hintfiles, which are a kind of an index into the 
bitcask data files to speedup the start of a riak node. The hintfiles 
contain one entry per key and the code that creates one entry looks like 
this:

     [<<Tstamp:?TSTAMPFIELD>>,<<KeySz:?KEYSIZEFIELD>>,
      <<TotalSz:?TOTALSIZEFIELD>>,<<Offset:?OFFSETFIELD>>, Key].


So thats 4 + 2 + 4 + 8 + KeySize (= 18 + KeySize) additonal bytes per key.
So the final result if you inserted the key via the Rest API is:

( 14 + ( 13 + Bucket + Key ) + ( 447 + Bucket + Key + Value ) + (18 + ( 13 + Bucket + Key ) ) ) * Num Entries * N_Val =*( 505 + 3 * (Bucket + Key) + Value ) * Num Entries * N_Val*

( 505 + 3 * (10 + 36) + 36 ) * 183915891 * 3 = 374636669967 = 348.9 GB


And if you used Erlang (or probably any ProtocolBuffers client):

( 14 + ( 13 + Bucket + Key ) + ( 356 + Bucket + Key + Value ) + (18 + ( 13 + Bucket + Key ) ) ) * Num Entries * N_Val =*( 414 + 3 * (Bucket + Key) + Value ) * Num Entries * N_Val*

( 414 + 3 * (10 + 36) + 36 ) * 183915891 * 3 = 324427631724 = 302.1 GB


So the truth is somewhere in between. But as David wrote, there can be 
additional overhead due to the append only nature on bitcask.

Cheers,
Nico

Am 24.05.2011 23:48, schrieb Anthony Molinaro:
> Just curious if anyone has any ideas, for the moment, I'm just taking
> the RAM calculation and multiplying by 2 and the Disk calculation and
> multiplying by 8, based on my findings with my current cluster.  But
> I would like to know why my values are so much higher than those I should
> be getting.
>
> Also, I'd still like to know how the forms calculate things as the disk
> calculation there does not match reality or the formula.
>
> Also, waiting to hear if there is any way to force merge to run so I can
> more accurately gauge whether multiple copies are effecting disk usage.
>
> Thanks,
>
> -Anthony
>
> On Mon, May 23, 2011 at 11:06:31PM -0700, Anthony Molinaro wrote:
>> On Mon, May 23, 2011 at 10:53:29PM -0700, Anthony Molinaro wrote:
>>> On Mon, May 23, 2011 at 09:57:25PM -0600, David Smith wrote:
>>>> On Mon, May 23, 2011 at 9:39 PM, Anthony Molinaro
>>>> Thus, depending on
>>>> your merge triggers, more space can be used than is strictly necessary
>>>> to store the data.
>>> So the lack of any overhead in the calculation is expected?  I mean
>>> according to http://wiki.basho.com/Cluster-Capacity-Planning.html
>>>
>>> Disk = Estimated Total Objects * Average Object Size * n_val
>>>
>>> Which just seems wrong, doesn't it?  I don't quite understand the
>>> bitcask code well enough yet to see what the actual data it stores is,
>>> but the whitepaper suggested several things were involved in the on
>>> disk representation.
>> Okay, finally found the code for this part, I kept looking in the nif
>> but that's only the keydir, not the data files.  It looks like
>>
>>     %% Setup io_list for writing -- avoid merging binaries if we can help it
>>     Bytes0 = [<<Tstamp:?TSTAMPFIELD>>,<<KeySz:?KEYSIZEFIELD>>,
>>               <<ValueSz:?VALSIZEFIELD>>, Key, Value],
>>     Bytes  = [<<(erlang:crc32(Bytes0)):?CRCSIZEFIELD>>  | Bytes0],
>>
>> And looking at the header, it seems that there's 14 bytes of overhead
>> (4 for CRC, 4 for timestamp, 2 for keysize, 4 for valsize).
>>
>> So disk calculation should be
>>
>> ( 14 + Key + Value ) * Num Entries * N_Val
>>
>> So using my numbers from before that gives
>>
>> ( 14 + 36 + 36 ) * 183915891 * 3 = 47450299878 = 44.1 GB
>>
>> which actually isn't much closer to 341 GB than the previous calculation :(
>>
>> So all my questions from the previous email still apply.
>>
>> -Anthony
>>
>> -- 
>> ------------------------------------------------------------------------
>> Anthony Molinaro<anthonym at alumni.caltech.edu>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.basho.com/pipermail/riak-users_lists.basho.com/attachments/20110525/2245bca4/attachment-0001.html>


More information about the riak-users mailing list