Issues with capacity planning pages on wiki

Jonathan Langevin jlangevin at loomlearning.com
Wed May 25 12:55:14 EDT 2011


That was one hell of a response. You need to post that as a Wiki article or
such, after all that work :-O*

<http://www.loomlearning.com/>
Jonathan Langevin
Systems Administrator
Loom Inc.
Wilmington, NC: (910) 241-0433 - jlangevin at loomlearning.com -
www.loomlearning.com - Skype: intel352
*


On Wed, May 25, 2011 at 12:22 PM, Nico Meyer <nico.meyer at adition.com> wrote:

>  Hi Anthony,
>
> I think, I can explain at least a big chunk of the difference in RAM and
> disk consumption you see.
>
> Let start with RAM. I could of course be wrong here, but I believe the *'static
> bitcask per key overhead*' is just plainly too small. Let me explain why.
> The bitcask_keydir_entry struct for each entry looks like this:
>
> typedef struct
> {
>     uint32_t file_id;
>     uint32_t total_sz;
>     uint64_t offset;
>     uint32_t tstamp;
>     uint16_t key_sz;
>     char     key[0];
> } bitcask_keydir_entry;
>
>
> This has indeed a size of 22 bytes (The array 'key' has zero entries
> because the key is written to the memory address directly after the keydir
> entry).
> As is done int the capacity planner, you need to add the size of the bucket
> and key to get the size of the keydir entry, but that is not the whole
> story.
>
> The thing that is actually stored in key is the result of this Erlang
> expression:
>
>    erlang:term_to_binary( {<<"bucket">>, <<"key">>} )
>
> that is, a tuple of two binaries converted to the Erlang external term
> format.
>
> So lets see:
>
> 1> term_to_binary({<<>>,<<>>}).
> <<131,104,2,109,0,0,0,0,109,0,0,0,0>>
> 2> iolist_size(term_to_binary({<<>>,<<>>})).
> 13
> 3> iolist_size(term_to_binary({<<"a">>,<<"b">>})).
> 15
> 4> iolist_size(term_to_binary({<<"aa">>,<<"b">>})).
> 16
> 5> iolist_size(term_to_binary({<<"aa">>,<<"bb">>})).
> 17
>
> so even an empty bucket/key pair take 13 bytes  to store.
>
> Also, since the hashtable storing the keydir entries is essentially an
> array of pointers to bitcask_keydir_entry objects, there is another 8 bytes
> of overhead per key, assuming you are running a 64bit system.
>
> so the real static overhead per key is not 22 but 22+13+8 = 43 bytes.
>
> Lets run the numbers for your predicted memory consumption again:
>
>   ( 43 + 10 + 36 ) * 183915891 * 3 = 49105542897 = 45.7 GB
>
>
> Your actual RAM consumption of 70 GB seems to be at odd with the output of
> erlang:memory/0 that you sent:
>
> {total,7281790968} =>  RAM: 7281790968 * 8 = 54.3 GB
>
>
> So that is much closer, within about 20 percent. Some additional overhead
> is to be expected, but it is hard to say how much of that is due to Erlangs
> internal usage and how much due to bitcask.
>
> So lets examine the disk consumption next.
> As you rightly concluded the equation here
> http://wiki.basho.com/Cluster-Capacity-Planning.html is somewhat
> simplified, and your are also right, that the real equation would be
>
> ( 14 + Key + Value ) * Num Entries * N_Val
>
> On the other hand 14 bytes + keysize might be quite irrelevant if your
> values have a size of at least 2KB (as in the example), which seems to be
> the general assumption in some aspects of the design of riak and bitcask.
> As you also noticed, this additional small overhead brings you nowhere near
> the disk usage that you observe.
>
> First, the key that is stored in the bitcask files is not the key part of
> the bucket/key pair that riak calls a key, but the serialized bucket/key
> pair described above, so the calculation becomes:
>
> ( 14 + ( 13 + Bucket + Key) + Value ) * Num Entries * N_Val
>
> ( 14 + ( 13 + 10 + 36) + 36 ) * 183915891 * 3 = 56 GB
>
> Still not enough :-/.
> So next lets examine what is actually stored as the value in bitcask. It is
> not simply the data you provide, but a riak object (r_object record) which
> is again serialized by the erlang:term_to_binary/1 function. So lets see. I
> create a new riak object with zero byte bucket, key and value:
>
> 3> Obj = riak_object:new(<<>>,<<>>,<<>>).
> {r_object,<<>>,<<>>,
>           [{r_content,{dict,0,16,16,8,80,48,
>                             {[],[],[],[],[],[],[],[],[],[],[],[],[],[],...},
>                             {{[],[],[],[],[],[],[],[],[],[],[],[],...}}},
>                       <<>>}],
>           [],
>           {dict,1,16,16,8,80,48,
>                 {[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],...},
>                 {{[],[],[],[],[],[],[],[],[],[],[],[],[],...}}},
>           undefined}
> 4> iolist_size(erlang:term_to_binary(Obj)).*
> 205*
>
> Also, bucket and key are contained int  the riak object itself (and
> therefore in the bitcask notion of the value). So with this information the
> predicted disk usage becomes:
>
> ( 14 + ( 13 + Bucket + Key ) + ( 205 + Bucket + Key + Value ) ) * Num Entries * N_Val
>
> ( 14 + ( 13 + 10 + 36) + ( 205 + 10 + 36 ) ) * 183915891 * 3 = 166.5 GB
>
> which is way closer to the 341 GB you observe.
>
> But we can get even closer, although the detailes become somewhat more
> fuzzy. But bear with me.
> I again create a riak object, but this time with a non empty bucket/key so
> I can store it in riak:
>
> (ctag at 172.20.1.31)7> Obj = riak_object:new(<<"a">>,<<"a">>,<<>>).
> {r_object,<<"a">>,<<"a">>,
>           [{r_content,{dict,0,16,16,8,80,48,
>                             {[],[],[],[],[],[],[],[],[],[],[],[],[],[],...},
>                             {{[],[],[],[],[],[],[],[],[],[],[],[],...}}},
>                       <<>>}],
>           [],
>           {dict,1,16,16,8,80,48,
>                 {[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],...},
>                 {{[],[],[],[],[],[],[],[],[],[],[],[],[],...}}},
>           undefined}
>
> (ctag at 172.20.1.31)8> iolist_size(erlang:term_to_binary(Obj)).*207*
>
> (ctag at 172.20.1.31)9> {ok,C}=riak:local_client().
> {ok,{riak_client,'ctag at 172.20.1.31',<<2,123,179,255>>}}
> (ctag at 172.20.1.31)10> C:put(Obj,1,1).
> ok
>
> (ctag at 172.20.1.31)12> {ok,ObjStored} = C:get(<<"a">>,<<"a">>, 1).
> {ok,{r_object,<<"a">>,<<"a">>,
>          [{r_content,{dict,2,16,16,8,80,48,
> 			 {[],[],[],[],[],[],[],[],[],[],[],[],...},
>                          {{[],[],[],[],[],[],[],[],[],[],...}}},
>                      <<>>}],
>               [{<<2,123,179,255>>,{1,63473554112}}],
>               {dict,1,16,16,8,80,48,
>                     {[],[],[],[],[],[],[],[],[],[],[],[],[],...},
>                     {{[],[],[],[],[],[],[],[],[],[],[],...}}},
>                undefined}}
> (ctag at 172.20.1.31)13> iolist_size(erlang:term_to_binary(ObjStored)).*358*
>
>  Ok? What happened? The object we retrieved is considerably larger than the
> one we stored. One culprit is the vector clock data, which was an empty list
> for Obj, and now has one entry:
>
> (ctag at 172.20.1.31)14> riak_object:vclock(Obj).
> []
> (ctag at 172.20.1.31)15> riak_object:vclock(ObjStored).
> [{<<2,123,179,255>>,{1,63473554112}}]
> (ctag at 172.20.1.31)23> iolist_size(term_to_binary(riak_object:vclock(Obj))).
> 2
> (ctag at 172.20.1.31)24> iolist_size(term_to_binary(riak_object:vclock(ObjStored))).
> 30
>
> So thats 28 bytes each time the object is updated with a new client ID (so
> alway use a meaningful client ID!!!!), until the vclock pruning sets in. The
> default bucket property is {big_vclock,50}, so in the worst case this could
> account for 28*50=1400 byte!
> But each object that has been stored somehow has at least one entry in the
> vclock, so another 28 bytes of overhead
>
> The other part of the growth stems from some standard entries, which are
> added to the object metadata during the put operation:
>
> (ctag at 172.20.1.31)35> dict:to_list(riak_object:get_metadata(Obj)).
> []
> (ctag at 172.20.1.31)37> iolist_size(term_to_binary(riak_object:get_metadata(Obj))).
> 60
>
> (ctag at 172.20.1.31)36> dict:to_list(riak_object:get_metadata(ObjStored)).
> [{<<"X-Riak-VTag">>,"7PoD9FEMUBzNmQeMnjUbas"},
>  {<<"X-Riak-Last-Modified">>,{1306,334912,424099}}]
> (ctag at 172.20.1.31)38> iolist_size(term_to_binary(riak_object:get_metadata(ObjStored))).
> 183
>
> So there are the other 123 bytes.
>
> In total this 356 byte* overhead per object leads us to the following
> calculation:  (* 2 bytes from the above 358 came from the bucket and key
> which are already accounted for)
>
> ( 14 + ( 13 + Bucket + Key ) + ( 356 + Bucket + Key + Value ) ) * Num Entries * N_Val
>
> ( 14 + ( 13 + 10 + 36) + ( 356 + 10 + 36 ) ) * 183915891 * 3 = 244 GB
>
>
> We are getting closer!
> If you loaded the data via the REST API the overhead is somewhat larger
> still, since the object will also contain 'content-type', 'X-Riak-Meta' and
> 'Link' metadata entries:
>
> xxxx at node2:~$ curl -v -d '' -H "Content-Type: text/plain" http://127.0.0.1:8098/riak/a/a
>
>
> (ctag at 172.20.1.31)44> {ok,ObjStored} = C:get(<<"a">>,<<"a">>, 1).
> {ok,{r_object,<<"a">>,<<"a">>,
>               [{r_content,{dict,5,16,16,8,80,48,
>                                 {[],[],[],[],[],[],[],[],[],[],[],[],...},
>                                 {{[],[],[[<<"Links">>]],[],[],[],[],[],[],[],...}}},
>                           <<>>}],
>               [{<<5,134,53,93>>,{1,63473557230}}],
>               {dict,1,16,16,8,80,48,
>                     {[],[],[],[],[],[],[],[],[],[],[],[],[],...},
>                     {{[],[],[],[],[],[],[],[],[],[],[],...}}},
>               undefined}}
> (ctag at 172.20.1.31)45> dict:to_list(riak_object:get_metadata(ObjStored)).
> [{<<"Links">>,[]},
>  {<<"X-Riak-VTag">>,"3TQzJznzXXWtZefntWXPDR"},
>  {<<"content-type">>,"text/plain"},
>  {<<"X-Riak-Last-Modified">>,{1306,338030,682871}},
>  {<<"X-Riak-Meta">>,[]}]
>
> (ctag at 172.20.1.31)46> iolist_size(erlang:term_to_binary(ObjStored)).                   *
> 449*
>
>
> Which leads to: (remember again to subtract 2 bytes)
>
> ( 14 + ( 13 + Bucket + Key ) + ( 447 + Bucket + Key + Value ) ) * Num Entries * N_Val
>
> ( 14 + ( 13 + 10 + 36) + ( 447 + 10 + 36 ) ) * 183915891 * 3 = 290.8 GB
>
>
> Nearly there!
>
> Now there are also the hintfiles, which are a kind of an index into the
> bitcask data files to speedup the start of a riak node. The hintfiles
> contain one entry per key and the code that creates one entry looks like
> this:
>
>     [<<Tstamp:?TSTAMPFIELD>>, <<KeySz:?KEYSIZEFIELD>>,
>      <<TotalSz:?TOTALSIZEFIELD>>, <<Offset:?OFFSETFIELD>>, Key].
>
>
> So thats 4 + 2 + 4 + 8 + KeySize (= 18 + KeySize) additonal bytes per key.
> So the final result if you inserted the key via the Rest API is:
>
> ( 14 + ( 13 + Bucket + Key ) + ( 447 + Bucket + Key + Value ) + (18 + ( 13 + Bucket + Key ) ) ) * Num Entries * N_Val = *( 505 + 3 * (Bucket + Key) + Value ) * Num Entries * N_Val*
>
> ( 505 + 3 * (10 + 36) + 36 ) * 183915891 * 3 = 374636669967 = 348.9 GB
>
>
> And if you used Erlang (or probably any ProtocolBuffers client):
>
> ( 14 + ( 13 + Bucket + Key ) + ( 356 + Bucket + Key + Value ) + (18 + ( 13 + Bucket + Key ) ) ) * Num Entries * N_Val = *( 414 + 3 * (Bucket + Key) + Value ) * Num Entries * N_Val*
>
> ( 414 + 3 * (10 + 36) + 36 ) * 183915891 * 3 = 324427631724 = 302.1 GB
>
>
> So the truth is somewhere in between. But as David wrote, there can be
> additional overhead due to the append only nature on bitcask.
>
> Cheers,
> Nico
>
> Am 24.05.2011 23:48, schrieb Anthony Molinaro:
>
> Just curious if anyone has any ideas, for the moment, I'm just taking
> the RAM calculation and multiplying by 2 and the Disk calculation and
> multiplying by 8, based on my findings with my current cluster.  But
> I would like to know why my values are so much higher than those I should
> be getting.
>
> Also, I'd still like to know how the forms calculate things as the disk
> calculation there does not match reality or the formula.
>
> Also, waiting to hear if there is any way to force merge to run so I can
> more accurately gauge whether multiple copies are effecting disk usage.
>
> Thanks,
>
> -Anthony
>
> On Mon, May 23, 2011 at 11:06:31PM -0700, Anthony Molinaro wrote:
>
>  On Mon, May 23, 2011 at 10:53:29PM -0700, Anthony Molinaro wrote:
>
>  On Mon, May 23, 2011 at 09:57:25PM -0600, David Smith wrote:
>
>  On Mon, May 23, 2011 at 9:39 PM, Anthony Molinaro
> Thus, depending on
> your merge triggers, more space can be used than is strictly necessary
> to store the data.
>
>  So the lack of any overhead in the calculation is expected?  I mean
> according to http://wiki.basho.com/Cluster-Capacity-Planning.html
>
> Disk = Estimated Total Objects * Average Object Size * n_val
>
> Which just seems wrong, doesn't it?  I don't quite understand the
> bitcask code well enough yet to see what the actual data it stores is,
> but the whitepaper suggested several things were involved in the on
> disk representation.
>
>  Okay, finally found the code for this part, I kept looking in the nif
> but that's only the keydir, not the data files.  It looks like
>
>    %% Setup io_list for writing -- avoid merging binaries if we can help it
>    Bytes0 = [<<Tstamp:?TSTAMPFIELD>>, <<KeySz:?KEYSIZEFIELD>>,
>              <<ValueSz:?VALSIZEFIELD>>, Key, Value],
>    Bytes  = [<<(erlang:crc32(Bytes0)):?CRCSIZEFIELD>> | Bytes0],
>
> And looking at the header, it seems that there's 14 bytes of overhead
> (4 for CRC, 4 for timestamp, 2 for keysize, 4 for valsize).
>
> So disk calculation should be
>
> ( 14 + Key + Value ) * Num Entries * N_Val
>
> So using my numbers from before that gives
>
> ( 14 + 36 + 36 ) * 183915891 * 3 = 47450299878 = 44.1 GB
>
> which actually isn't much closer to 341 GB than the previous calculation :(
>
> So all my questions from the previous email still apply.
>
> -Anthony
>
> --
> ------------------------------------------------------------------------
> Anthony Molinaro                           <anthonym at alumni.caltech.edu> <anthonym at alumni.caltech.edu>
>
>
>
> _______________________________________________
> riak-users mailing list
> riak-users at lists.basho.com
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.basho.com/pipermail/riak-users_lists.basho.com/attachments/20110525/308a370e/attachment.html>


More information about the riak-users mailing list