Riak 1.4.2 10G Ethernet Performance Problems

Earl Ruby earl_ruby at xyratex.com
Fri Jun 20 21:22:46 EDT 2014


True, but if I test random-writes of 1M objects using fio on the filesystem
-- no Riak involved -- the disk random write aggregate speed (32 fio
threads all randomly writing to one node's array) is 9.4 Gbits/sec ***per
node***, and I have 6 nodes. If Riak was distributing all writes evenly and
I had infinite network bandwidth I'd expect something closer to 9.4 * 6 =
56.4 Gbits/s over the whole cluster.

However, iperf3 shows the network topping out at 9.92 Gbits/s, so I expect
to run out of network bandwidth before disk speed became a bottleneck.

If I use basho_bench to test RiakCS with a new bucket, an 8M file_size and
a 1M ibrowse_chunk_size, when I check the aggregate network speed using
jnettop it hovers right around 3Gbps, and my CPUs are mostly idle -- it
seems like writes should be 3x faster.

I just read an interesting blog post
http://www.snookles.com/slf-blog/2012/01/05/tcp-incast-what-is-it/ which
sounds like the same issue. I'm going to try some of the tests suggested
there and see if I'm seeing the same problem described by the author.



On 20 June 2014 11:09, Evan Vigil-McClanahan <emcclanahan at basho.com> wrote:

> In my testing with large but smaller binaries (median 40k), I found
> that the settings gave a noticeable bump (8000 -> 10000 ops/s) , but
> only so far as the disk could keep up (and the disk cache, of course).
> Typically, for larger objects, you're going to be disk limited most of
> the time.  Remember that riak is basically doing a bunch of random,
> mid-sized (to the disk, at least) reads and writes here, so disk
> limitations are going to make it really hard to get near to your
> disk's theoretical maximums.
>
> On Fri, Jun 20, 2014 at 10:53 AM, Chris Read <chris.read at gmail.com> wrote:
> > We still have this problem (we're on riak 1.4.9) and it's very
> frustrating!
> >
> > Our average object size right now is ~250k. We're running with:
> >
> > +zdbbl 2097151
> >
> >
> > I've tried the settings above on a 5 node test cluster, no improvement.
> >
> > I then bumped both buffers up to 1048576 on all nodes - no improvement.
> >
> > Finally I tried putting the buffers up to 4194304 - still no improvement.
> >
> > For the record my kernel is Ubuntu 3.13.0-27, with the following
> > network settings:
> >
> > net.core.netdev_max_backlog = 10000
> > net.core.rmem_default = 8388608
> > net.core.rmem_max = 104857600
> > net.core.somaxconn = 4000
> > net.core.wmem_default = 8388608
> > net.core.wmem_max = 104857600
> > net.ipv4.tcp_congestion_control = cubic
> > net.ipv4.tcp_fin_timeout = 15
> > net.ipv4.tcp_low_latency = 0
> > net.ipv4.tcp_max_syn_backlog = 40000
> > net.ipv4.tcp_slow_start_after_idle = 0
> > net.ipv4.tcp_tw_reuse = 1
> >
> > Chris
> >
> > On Wed, Jun 18, 2014 at 7:32 PM, Evan Vigil-McClanahan
> > <emcclanahan at basho.com> wrote:
> >> Hi Earl,
> >>
> >> There are some known internode bottlenecks in riak 1.4.x.  We've
> >> addressed some of them in 2.0, but others likely remain.  If you're
> >> willing to run some code at the console, running the following at the
> >> console (from `riak attach`) should tell you whether or not the 2.0
> >> changes are likely to help you.  I am not sure when 2.0 ready versions
> >> of CS are slated for, however.
> >>
> >> -----
> >> [inet:setopts(Port, [{sndbuf, 393216}, {recbuf, 786432}])
> >>   || {_Node, Port} <- erlang:system_info(dist_ctrl)].
> >>
> >> or to run this on all nodes (which you'll have to do to see if it
> helps):
> >>
> >> FF = fun() ->
> >>                   [inet:setopts(Port, [{sndbuf, 393216}, {recbuf,
> 786432}])
> >>                     || {_Node, Port} <- erlang:system_info(dist_ctrl)]
> >>         end.
> >> rpc:multicall(erlang, apply, [FF, []]).
> >>
> >> You should not run any of this on production machines without
> >> extensive testing first.  Also if you have huge objects, like in a CS
> >> cluster, it may help to increase the buffer sizes somewhat.
> >>
> >> Note that increasing +zdbbl in your vm.args can also help somewhat, if
> >> it isn't already prohibitively large.
> >>
> >> Hope that this helps.  Let us know what you find.
> >>
> >> Evan
> >>
> >> On Wed, Jun 18, 2014 at 4:57 PM, Earl Ruby <earl_ruby at xyratex.com>
> wrote:
> >>> Chris Read:
> >>>
> >>> Back in 2013 you reported a performance problem with Riak 1.4.2
> running on a
> >>> 10GbE network where Riak would never hit speeds faster than 2.5Gbps on
> the
> >>> network.
> >>>
> >>> I'm seeing the same thing with Riak 1.4.2 and RiakCS. I've followed
> all of
> >>> the tuning suggestions, my MTU is set to 9000 on the ethernet
> interfaces, I
> >>> have one 10GbE network just for the backend inter-node data and one
> 10GbE
> >>> "public" network where RiakCS listens for connections and which
> basho_bench
> >>> uses to generate the load. I have 1-4 client systems on the public side
> >>> running basho_bench and no matter how much traffic I generate with
> >>> basho_bench I never see more than 3Gbits/s on the network. (It doesn't
> seem
> >>> to matter if I run 1 or 4 clients, each with 200 concurrent sessions,
> the
> >>> network data rate is about the same.) I'm running jnettop in two
> different
> >>> windows during the tests to watch the aggregate network traffic on the
> >>> private inter-node data network and the "public" basho_bench
> >>> traffic-generating network.
> >>>
> >>> I've tested the network with iperf3 and it shows 9.92Gbits/s
> throughput with
> >>> a TCP maximum segment size of 9000.
> >>>
> >>> I've tested the filesystems on each of the 6 Riak nodes using fio, and
> I can
> >>> write to the filesystems at ~12.8Gbits/s, so the filesystem is not the
> >>> bottleneck. Each node has 128GB RAM and is running the bitcask
> backend. The
> >>> servers are mostly idle.
> >>>
> >>> I tried Sean's solution of increasing these values to:
> >>>
> >>> {riak_core, [
> >>>     {handoff_batch_threshold, 4194304},
> >>>     {handoff_concurrency, 10} ]}
> >>>
> >>> ... as described in
> >>>
> http://lists.basho.com/pipermail/riak-users_lists.basho.com/2013-October/013787.html
> ,
> >>> but that had no effect.
> >>>
> >>> With my current hardware I'd expect that the 10GbE network would be the
> >>> bottleneck, and I'd expect write speeds to top out at the top end of
> the
> >>> network speed.
> >>>
> >>> There was no follow-up message on the mailing list to indicate how or
> if
> >>> you'd solved the problem. Did you find a solution?
> >>>
> >>> (Please direct replies to the mailing list.)
> >>>
> >>>
> >>> _______________________________________________
> >>> riak-users mailing list
> >>> riak-users at lists.basho.com
> >>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
> >>>
> >>
> >> _______________________________________________
> >> riak-users mailing list
> >> riak-users at lists.basho.com
> >> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>



-- 

*Earl C. Ruby III*
*Software Staff Engineer (Cloud Platform)*
*+1 (415) 527-7275*
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.basho.com/pipermail/riak-users_lists.basho.com/attachments/20140620/64f2f218/attachment.html>


More information about the riak-users mailing list