Large ring_creation_size

Greg Nelson grourk at dropcam.com
Wed Apr 20 13:47:15 EDT 2011


Hi Jon,

Thanks for looking into this! The changes you suggest do make sense. What ring_creation_size did you use?

I was able to get a cluster of 3 with ring_creation_size=2048 up and running on 3 nodes that have only 2GB or RAM each. Without having to patch riak_core_ring_handler, basically by joining the nodes slowly. So, let the first node get up and running and settle in to where it's using little or no CPU or RAM, then join the second node and wait several minutes for them to stabilize, then join the third. It also seems that the biggest spike is when the second node joins the first, so it probably tapers off with each subsequent join.

I'm wondering how much of an impact it had to disable stats and stop os_mon?

Greg
On Tuesday, April 19, 2011 at 8:24 AM, Jon Meredith wrote: 
> Hi Greg,
> 
> I had a chance to play around a little over the weekend. I was able to get a couple of nodes to join and store data over the weekend by doing the following.
> 
> * Disabling riak_core_ring_handler (commenting it out in riak_core_app around line 70). This means that vnodes will be started on-demand when they are first accessed and any fallbacks that were running when the node is stopped will not be restarted automatically. There is still a mechanism to pick a random vnode and start it up just in case it has some data, but it will take a long time to find it. 
> 
> * Disable stats in app.config
> 
> * Stop os_mon after startup - application:stop(os_mon) from the console.
> 
> * Bumped ERL_MAX_PORTS to 20000 or so etc/vm.args 
> 
> I got it to store data with ETS (after adding a -env ERL_MAX_ETS_TABLES 20000 to vm.args). The two nodes were able to join with <2gb memory per beam process on my 64-bit osx box.
> 
> Next I tried running with bitcask, but ran out of *os* level file descriptors and made my laptop sad and haven't had a chance to raise the limit even higher and retry.  If I have problems with that, my next tack was to try innostore as you can bound the number of open file handles using that. Once you have a large number of physical nodes then the per-node file handle requirements will drop, but you need to system to stay up long enough to get to that point. 
> 
> Good luck,
> Jon Meredith
> Basho Technologies.
> 
> 
> 
> On Thu, Apr 14, 2011 at 12:00 PM, Jon Meredith <jmeredith at basho.com> wrote:
> > Hi Greg,
> > 
> > I played with this a little last night and this morning and I can reproduce the behavior you are seeing - my two nodes ate more than a combined 15gig of memory with 16384 partitions and were promptly killed by the O/S. 
> > 
> > I haven't had a chance to analyze yet, so this is pure speculation. I suspect that implementations of functions for calculating partition owners, preference lists and which partitions to take when nodes join perform acceptably for the <= 1024 partition case but blow up in some way beyond that. I was hoping that 4096 would work for you, but it sounds like that has problems too. 
> > 
> > I'm finishing up some high priority work at the moment but will investigate as time permits.
> > 
> > Jon
> > 
> > On Thu, Apr 14, 2011 at 11:48 AM, Greg Nelson <grourk at dropcam.com> wrote:
> > > We have a exact idea of the amount of data we'll be storing, and the kinds of machines we'll be storing them on. The simple math of (total data we'll be storing 6 months from now) / (total capacity of a single node) * (number of duplicates of each datum we'd like to store for redundancy) gives us a number a concrete number of machines we'll need, whether we're using Riak or something else... 
> > > 
> > > However, that's a little off track from the question I'm trying to answer. Which is: 
> > > 
> > > Why is it when I start two nodes with a large-ish ring_creation_size -- as soon as I join the second node to the first, CPU and memory usage of both nodes goes through the roof? No data stored yet. This even happens with a ring_creation_size I wouldn't consider huge, like 4096. 
> > > 
> > > So that is the small configuration I'm starting with and that is the limit I'm hitting.
> > > 
> > > I realize that the feasibility of building out a single 1000 node cluster is a larger question. But I can tell you we'll get there; having to shard across 10 100-node clusters is an option. Regardless, I'd like to have an understanding of what the resource overhead of each vnode is... 
> > > On Thursday, April 14, 2011 at 6:38 AM, Sean Cribbs wrote:
> > > > Good points, Dave.
> > > > 
> > > > Also, it's worth mentioning that we've seen that many customers and open-source users think they will need many more nodes than they actually do. Many are able to start with 5 nodes and are happy for quite a while. The only way to tell what you actually need is to start with a baseline configuration and simulate some percentage above your current load. Once you've figured out what size that initial cluster is, start with (number of nodes) * 50 as the ring_creation_size (rounded to the nearest power of 2 of course). This gives you a growth factor of about 5 before you need to consider changing.
> > > > 
> > > > As well, there's some ops "common sense" that says the lifetime of any single architecture is 18 months or less. That doesn't necessarily mean that you'll be building a new cluster with a larger ring size in 18 months, but just that your needs will be different at that time and are hard to predict. Plan for now, worry about the 1000 node cluster when you actually need it.
> > > > 
> > > >  Sean Cribbs <sean at basho.com>
> > > > Developer Advocate
> > > > Basho Technologies, Inc.
> > > > http://basho.com/
> > > > 
> > > > 
> > > > On Apr 14, 2011, at 9:09 AM, Dave Barnes wrote:
> > > > > Sorry I feel compelled to chime in.
> > > > > 
> > > > > Maybe you could assess your physical node limits and start with a small configuration, then increase  it and increase it until you hit a limit.
> > > > > 
> > > > > Work small to large.
> > > > > 
> > > > >  Once you find the pain point, lets us know what resource ran out.
> > > > > 
> > > > > You will learn a lot along the way on how your servers behave and we'll discover a lot when you share the results.
> > > > > 
> > > > > Thanks for digging in,
> > > > > 
> > > > > Dave
> > > > > 
> > > > > On Wed, Apr 13, 2011 at 5:11 PM, Greg Nelson <grourk at dropcam.com> wrote:
> > > > > > Ok, how about in this case I described? It runs out of memory with a single pair of nodes...
> > > > > > 
> > > > > > (Or did you mean there's a connection between each pair of vnodes?) 
> > > > > > On Wednesday, April 13, 2011 at 1:56 PM, Jon Meredith wrote:
> > > > > > > Hi Greg et al,
> > > > > > > 
> > > > > > > As you say largest known is not largest possible. Internally within Basho, the largest cluster we've experimented with so far had 50 nodes. 
> > > > > > > 
> > > > > > > Going beyond that it's speculation from me about pain points. 
> > > > > > > 
> > > > > > > 1) It is true that you need enough file descriptors to start up all partitions when a node restarts - Riak checks if there is any handoff data pending for each partition. We have work scheduled to address that in the medium term. The plan is to only spin up partitions the node owns and any that have been started as fallbacks that handoff has not completed for. Until that work is done you will need a high ulimit with large ring sizes. 
> > > > > > > 
> > > > > > > 2) It is also true that Erlang runs a fully connected network, so there will be connections between each node pair in the cluster. We haven't determined the point at which it becomes a problem. 
> > > > > > > 
> > > > > > > So it looks like you'll be pushing the known limits. Basho will do our very best to help overcome any obstacles as you encounter them.
> > > > > > > 
> > > > > > > Jon Meredith
> > > > > > > Basho Technologies.
> > > > > > > 
> > > > > > > On Wed, Apr 13, 2011 at 1:41 PM, Greg Nelson <grourk at dropcam.com> wrote:
> > > > > > > > The largest known riak cluster != the largest possible riak cluster. ;-)
> > > > > > > > 
> > > > > > > > The inter node communication of the cluster depends on the data set and usage pattern, doesn't it? Or is there some constant overhead that tops out at a few hundred nodes? I should point out that we'll have big data, but not a huge number of keys. 
> > > > > > > > 
> > > > > > > > The number of vnodes in the cluster should be equal to the ring_creation_size under normal circumstances, shouldn't it? So when I have a one node cluster, that node is running ring_creation_size vnodes... File descriptors probably isn't a problem -- these machines won't be doing anything else, and the limits are set to 65536. 
> > > > > > > > 
> > > > > > > > Thinking about the internode communication you mentioned, that's probably where the resource hog is.. socket buffers, etc.
> > > > > > > > 
> > > > > > > > Anyway, I'd also love to hear more from basho. :)
> > > > > > > > On Wednesday, April 13, 2011 at 12:33 PM, siculars at gmail.com wrote:
> > > > > > > > > Ill just chime in and say that this is not practical for a few reasons. The largest known riak cluster has like 50 or 60 nodes. Afaik, inter node communication of erlang clusters top out at a few hundred nodes. I'm also under the impression that each physical node has to have enough file descriptors to accommodate every virtual node in the cluster. 
> > > > > > > > > 
> > > > > > > > > I'd love to hear more from basho. 
> > > > > > > > > 
> > > > > > > > > -alexander 
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > Sent from my Verizon Wireless BlackBerry
> > > > > > > > > 
> > > > > > > > > -----Original Message-----
> > > > > > > > > From: Greg Nelson <grourk at dropcam.com>
> > > > > > > > >  Sender: riak-users-bounces at lists.basho.com
> > > > > > > > > Date: Wed, 13 Apr 2011 12:13:34 
> > > > > > > > > To: <riak-users at lists.basho.com>
> > > > > > > > >  Subject: Large ring_creation_size
> > > > > > > > > 
> > > > > > > > > _______________________________________________
> > > > > > > > > riak-users mailing list
> > > > > > > > > riak-users at lists.basho.com
> > > > > > > > > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
> > > > > > > > > 
> > > > > > > > 
> > > > > > > > _______________________________________________
> > > > > > > >  riak-users mailing list
> > > > > > > > riak-users at lists.basho.com
> > > > > > > > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
> > > > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > 
> > > > > > _______________________________________________
> > > > > >  riak-users mailing list
> > > > > > riak-users at lists.basho.com
> > > > > > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
> > > > > > 
> > > > > 
> > > > >  _______________________________________________
> > > > > riak-users mailing list
> > > > > riak-users at lists.basho.com
> > > > > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
> > > > > 
> > > > 
> > > 
> > > _______________________________________________
> > >  riak-users mailing list
> > > riak-users at lists.basho.com
> > > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
> > > 
> > 
> > 
> 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.basho.com/pipermail/riak-users_lists.basho.com/attachments/20110420/464a19c8/attachment.html>


More information about the riak-users mailing list