Riak Cluster Crash down on heavy load Benchmarking

Ryan Zezeski rzezeski at basho.com
Tue Jun 19 14:51:24 EDT 2012


Hi Amol, my response is inline.

On Sat, Jun 16, 2012 at 3:43 AM, Amol Rajoba <amolrajoba at gmail.com> wrote:
>
> Clients were connected using protocol buffer api.
> {pb_backlog, 100000}, in app.config

Why are you setting the backlog so high?  AFAIK that's 3 orders of
magnitude above the default.  I don't pretend to know the details of
the linux TCP/IP stack but a backlog of that size does not come free.
>From a quick google it appears the TCB structure can be anywhere from
280-1300 bytes [1].  Lets assume a middle ground of 790 bytes.  Given
that and the 100K backlog we are looking at 790*100000 = 79000000b or
~75MB.  I'm not sure if there is other overhead for keeping that many
TCBs but it seems sketchy to me.


>
> Nodes: 2  (I know cluster of 5 is best but this is just test setup)
> OS: Ubuntu 12.04 32bit
> CPU: Core i3
> RAM: 4GB
> HDD: 500GB

In a real production cluster we recommend a minimum of 5 nodes.  With
two nodes you'll have replicas overlapping which will increase IO
contention and reduce availability/safety in node down/crash
scenarios.  Trying to infer how 5 nodes will run from 2 nodes is a
guessing game IMO.

>
> app.config [changes only]
>
> %% eLevelDB Config
>  {eleveldb, [
>              {data_root, "/data/riak/leveldb"},
>              {block_size, 262144}, %%256k
>              {cache_size, 104857600}, %% 100MB - default cache size 8MB
> per-partition
>              {write_buffer_size, 524288000}, %% 500MB in bytes
>                 {write_buffer_size_min, 524288000}, %% 500MB in bytes
>                 {write_buffer_size_max, 524288000}, %% 500MB in bytes
>                 {max_open_files, 100} %% Maximum number of files open at
> once per partition- Default: 20 - Minimum: 20
>             ]},
>
>
> vm.args [changes only]
> ## Enable kernel poll and a few async threads
> +K true
> +A 128

Do you realize those eleveldb configs apply for each instance (one per
vnode)?  If you have two nodes that's 32 leveldb instances per node.
I would recommend trying the defaults first.

-Z

[1]: http://www.cisco.com/web/about/ac123/ac147/archived_issues/ipj_9-4/syn_flooding_attacks.html




More information about the riak-users mailing list