How to profile a cluster of Riak nodes

Greg Burd greg at basho.com
Thu Aug 9 13:02:03 EDT 2012


Amir,

I'll add one more major consideration to Ryan's excellent list, check your network for TCP Incast.  Every cluster at reasonable scale will have to manage this issue carefully, 20 nodes is more than enough to create this kind of problem (I see it with as few as 9).  Here's more information: http://www.snookles.com/slf-blog/2012/01/05/tcp-incast-what-is-it/

Please also let us know how you've tuned your TCP stack on the cluster nodes, what your networking topology is (equipment/models/configuration, connections, etc.) and any other network traffic you might be putting on that same interface.  We commonly separate inbound client connections on HTTP and TCP via Protobufs from internal distributed Erlang and handoff traffic.  That can be key in reducing the number of TCP retransmits and slow starts.

In a distributed database, the network is part of the thing being stressed and measured and so it should be given as much attention in your paper and in your preparation as anything else.

best of luck,

-greg


On Aug 9, 2012, at 12:12 PM, Ryan Zezeski <rzezeski at basho.com> wrote:

> Amir,
> 
> Are you using one node to run basho bench?  If so, have you tried running multiple basho bench instances on separate nodes (or tried other benchmark tools)?  There could be many reasons for your plateau but I would first rule out that your not maxing out the basho bench instance or the node it is running on.
> 
> It also would help to know the hardware, any modifications to your app.config, and the basho bench config you are running.
> 
> As for profiling tools, everyone has there favorite but some that come to mind:
> 
> 1. boundary - We've had much success internally at Basho using Boundary to view network traffic.
> 
> 2. iostat - Run it continuously at 1s intervals and watch for spikes.
> 
> 3. vmstat - Look for paging.
> 
> Finally, we've noticed that a lot of users run horrible OS settings such as non-zero swappines and other such things which make a database server unhappy.  See this link: http://wiki.basho.com/LevelDB.html#Tuning-LevelDB
> 
> -Z
> 
> On Thu, Aug 9, 2012 at 10:32 AM, amir ghaffari <ami.ghaffari at gmail.com> wrote:
> Hi there,
> 
> I have done a scalability benchmark for Riak DBMS and we couldn't scale up the throughput beyond 20 Riak nodes. The benchmarking with Basho_Bench has been run on a 31 node cluster and each node has its own hard disk but the maximum throughput is on 20 nodes.
> 
> I’d like to understand why Riak didn’t scale e.g. is it the connection, or other network traffic. I’d like to use some profiling tools to get more information. Please can you advise us a helpful profiling tool to use?
> 
> Thanks in advance,
> 
> Amir
> 
> 
> _______________________________________________
> riak-users mailing list
> riak-users at lists.basho.com
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
> 
> 
> _______________________________________________
> riak-users mailing list
> riak-users at lists.basho.com
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.basho.com/pipermail/riak-users_lists.basho.com/attachments/20120809/25710ff7/attachment.html>


More information about the riak-users mailing list