Large ring_creation_size

Dave Barnes dbarnes001 at gmail.com
Thu Apr 14 16:47:21 EDT 2011


Greg,

What is the size of the HW or VM you plan to deploy as 1000 nodes (memory
and disk space)?
I'm very interested in the trade-off between hardware and software...

Dave

On Thu, Apr 14, 2011 at 2:00 PM, Jon Meredith <jmeredith at basho.com> wrote:

> Hi Greg,
>
> I played with this a little last night and this morning and I can reproduce
> the behavior you are seeing - my two nodes ate more than a combined 15gig of
> memory with 16384 partitions and were promptly killed by the O/S.
>
> I haven't had a chance to analyze yet, so this is pure speculation. I
> suspect that implementations of functions for calculating partition owners,
> preference lists and which partitions to take when nodes join perform
> acceptably for the <= 1024 partition case but blow up in some way beyond
> that. I was hoping that 4096 would work for you, but it sounds like that has
> problems too.
>
> I'm finishing up some high priority work at the moment but will investigate
> as time permits.
>
> Jon
>
> On Thu, Apr 14, 2011 at 11:48 AM, Greg Nelson <grourk at dropcam.com> wrote:
>
>>  We have a exact idea of the amount of data we'll be storing, and the
>> kinds of machines we'll be storing them on.  The simple math of (total data
>> we'll be storing 6 months from now) / (total capacity of a single node) *
>> (number of duplicates of each datum we'd like to store for redundancy) gives
>> us a number a concrete number of machines we'll need, whether we're using
>> Riak or something else...
>>
>> However, that's a little off track from the question I'm trying to answer.
>>  Which is:
>>
>> Why is it when I start two nodes with a large-ish ring_creation_size -- as
>> soon as I join the second node to the first, CPU and memory usage of both
>> nodes goes through the roof?  No data stored yet.  This even happens with a
>> ring_creation_size I wouldn't consider huge, like 4096.
>>
>> So that is the small configuration I'm starting with and that is the limit
>> I'm hitting.
>>
>> I realize that the feasibility of building out a single 1000 node cluster
>> is a larger question.  But I can tell you we'll get there; having to shard
>> across 10 100-node clusters is an option.  Regardless, I'd like to have an
>> understanding of what the resource overhead of each vnode is...
>>
>> On Thursday, April 14, 2011 at 6:38 AM, Sean Cribbs wrote:
>>
>> Good points, Dave.
>>
>> Also, it's worth mentioning that we've seen that many customers and
>> open-source users think they will need many more nodes than they actually
>> do. Many are able to start with 5 nodes and are happy for quite a while.
>>  The only way to tell what you actually need is to start with a baseline
>> configuration and simulate some percentage above your current load.  Once
>> you've figured out what size that initial cluster is, start with (number of
>> nodes) * 50 as the ring_creation_size (rounded to the nearest power of 2 of
>> course).  This gives you a growth factor of about 5 before you need to
>> consider changing.
>>
>> As well, there's some ops "common sense" that says the lifetime of any
>> single architecture is 18 months or less.  That doesn't necessarily mean
>> that you'll be building a new cluster with a larger ring size in 18 months,
>> but just that your needs will be different at that time and are hard to
>> predict.  Plan for now, worry about the 1000 node cluster when you actually
>> need it.
>>
>>  Sean Cribbs <sean at basho.com>
>> Developer Advocate
>> Basho Technologies, Inc.
>> http://basho.com/
>>
>> On Apr 14, 2011, at 9:09 AM, Dave Barnes wrote:
>>
>> Sorry I feel compelled to chime in.
>>
>> Maybe you could assess your physical node limits and start with a small
>> configuration, then increase  it and increase it until you hit a limit.
>>
>> Work small to large.
>>
>> Once you find the pain point, lets us know what resource ran out.
>>
>> You will learn a lot along the way on how your servers behave and we'll
>> discover a lot when you share the results.
>>
>> Thanks for digging in,
>>
>> Dave
>>
>> On Wed, Apr 13, 2011 at 5:11 PM, Greg Nelson <grourk at dropcam.com> wrote:
>>
>>   Ok, how about in this case I described?  It runs out of memory with a
>> single pair of nodes...
>>
>> (Or did you mean there's a connection between each pair of vnodes?)
>>
>> On Wednesday, April 13, 2011 at 1:56 PM, Jon Meredith wrote:
>>
>> Hi Greg et al,
>>
>> As you say largest known is not largest possible.  Internally within
>> Basho, the largest cluster we've experimented with so far had 50 nodes.
>>
>> Going beyond that it's speculation from me about pain points.
>>
>> 1) It is true that you need enough file descriptors to start up all
>> partitions when a node restarts - Riak checks if there is any handoff data
>> pending for each partition.  We have work scheduled to address that in the
>> medium term. The plan is to only spin up partitions the node owns and any
>> that have been started as fallbacks that handoff has not completed for.
>> Until that work is done you will need a high ulimit with large ring sizes.
>>
>> 2) It is also true that Erlang runs a fully connected network, so there
>> will be connections between each node pair in the cluster.  We haven't
>> determined the point at which it becomes a problem.
>>
>> So it looks like you'll be pushing the known limits.  Basho will do our
>> very best to help overcome any obstacles as you encounter them.
>>
>> Jon Meredith
>> Basho Technologies.
>>
>> On Wed, Apr 13, 2011 at 1:41 PM, Greg Nelson <grourk at dropcam.com> wrote:
>>
>>  The largest known riak cluster != the largest possible riak cluster.
>>  ;-)
>>
>> The inter node communication of the cluster depends on the data set and
>> usage pattern, doesn't it?  Or is there some constant overhead that tops out
>> at a few hundred nodes?  I should point out that we'll have big data, but
>> not a huge number of keys.
>>
>> The number of vnodes in the cluster should be equal to the
>> ring_creation_size under normal circumstances, shouldn't it?  So when I have
>> a one node cluster, that node is running ring_creation_size vnodes...  File
>> descriptors probably isn't a problem -- these machines won't be doing
>> anything else, and the limits are set to 65536.
>>
>> Thinking about the internode communication you mentioned, that's probably
>> where the resource hog is..  socket buffers, etc.
>>
>> Anyway, I'd also love to hear more from basho.  :)
>>
>> On Wednesday, April 13, 2011 at 12:33 PM, siculars at gmail.com wrote:
>>
>> Ill just chime in and say that this is not practical for a few reasons.
>> The largest known riak cluster has like 50 or 60 nodes. Afaik, inter node
>> communication of erlang clusters top out at a few hundred nodes. I'm also
>> under the impression that each physical node has to have enough file
>> descriptors to accommodate every virtual node in the cluster.
>>
>> I'd love to hear more from basho.
>>
>> -alexander
>>
>>
>> Sent from my Verizon Wireless BlackBerry
>>
>> -----Original Message-----
>> From: Greg Nelson <grourk at dropcam.com>
>> Sender: riak-users-bounces at lists.basho.com
>> Date: Wed, 13 Apr 2011 12:13:34
>> To: <riak-users at lists.basho.com>
>> Subject: Large ring_creation_size
>>
>> _______________________________________________
>> riak-users mailing list
>> riak-users at lists.basho.com
>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>>
>>
>>
>> _______________________________________________
>> riak-users mailing list
>> riak-users at lists.basho.com
>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>>
>>
>>
>>
>> _______________________________________________
>> riak-users mailing list
>> riak-users at lists.basho.com
>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>>
>>
>> _______________________________________________
>> riak-users mailing list
>> riak-users at lists.basho.com
>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>>
>>
>>
>>
>> _______________________________________________
>> riak-users mailing list
>> riak-users at lists.basho.com
>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>>
>>
>
> _______________________________________________
> riak-users mailing list
> riak-users at lists.basho.com
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.basho.com/pipermail/riak-users_lists.basho.com/attachments/20110414/519f9a09/attachment.html>


More information about the riak-users mailing list