Riak for Messaging Project Question
jeremiah.peschka at gmail.com
Wed Feb 22 18:15:30 EST 2012
Managing Director, Brent Ozar PLF, LLC
On Wed, Feb 22, 2012 at 2:10 PM, <charles at contentomni.com> wrote:
> I'm building an online tool/app that is heavily dependent on messaging. This
> messaging is simple text, nothing complicated, and it will take place
> between my server back-end and the desktop/device. These messages would be
> very easy to store in Riak.
> Each message is created after a specific user event e.g. a user posts a
> request, etc. In turn, each message created could spawn another 200 to 3,000
> messages (based on some other social networking features I can't say too
> much about to keep this short). I believe, in this case, we can assume each
> message will be a Riak Object.
> All tolled, from my estimation, I'm looking at 400,000 messages/objects
> generated per user per year. With an estimated active user base of 20
> million (I hope some day), that would be 8 billion keys generated each year.
> The size of each object is about 2Kb max. So that works out about 16
> Terabytes of data generated per year.
> 1. Is Riak a good fit for this solution going up to and beyond 20 million
> users (i.e. terabytes upon terabytes added per year)?
I think Riak is a good fit for this solution in terms of the ability
to handle data size.
> 2. I plan to use 2i, which means I would be using the LevelDB backend. Will
> this be reasonably performant for billions of keys added each year?
LevelDB is a good backend fit, especially for when the size of your
keyspace exceeds the size of RAM.
> 3. I'm using what I have here
> (http://wiki.basho.com/Cluster-Capacity-Planning.html) as my guide for
> capacity planning. I plan on using Rackspace Cloud Servers for this specific
> project. Can I just keep adding servers as the size of my data grows?!
This planning guide is aimed at planning for Bitcask specifically, but
most of the advice applies
You can keep adding servers, but you need to be careful about the
initial size of your ring. The ring size defaults to 64 virtual nodes
and it can't be changed once you put data in the cluster, so you'll
need to do some careful planning up front. Having more virtual nodes
will enabled you to safely increase the size of your ring. I believe
the current guidance is that you want no fewer 10 v-nodes per physical
server in the cluster. Also, I seem to recall reading that you want to
make sure the number of v-nodes is a power of two. Going by this,
you'll want to start with 2048 v-nodes, which could prove somewhat
problematic on a small cluster.
> 4. From the guide mentioned in 3 above, it appears I will need about 400
> [4GbRAM 160GbHDD] servers for 20 million users (assuming an n_val of 4).
> This means I would need to add 20 servers annually for each million active
> users I add. Is it plausible to have an n_val of 4 for this many servers?!
> Wouldn't going higher just mean I'd have to add many more servers
I'm not sure I understand the question. Basically, each node in the
cluster is aware of where data belongs. When you query a Riak node,
it'll route the request to the nodes that should have the key in
question. With an n_val of 4 and say 200 servers, you'll still be
querying a maximum of 5 servers (one for whichever node coordinates
the request, and up to 4 servers sending data back). With a large
number of servers, I would be more concerned about traffic around the
ring. However, some of the changes outlined in Riak 1.1's release
notes make me think that it isn't that big of a concern.
As an aside, 4GB of RAM and a 160GB HDD sounds like the specs on low
end cable box. You can avoid having 200+ servers by using servers with
more RAM and more drives. It's something to plan for in the long run,
but you can fit an incredible amount of storage and RAM into some
server chassis. E.g. the Dell C2200 can hold 192GB of RAM and many TB
of storage - 12 bays in chassis - and the server doesn't cost that
much in the grand scheme of things.
> 5. Should I put all my keys in one bucket (considering I'm using 2i, does it
Buckets are a logical namespace - use as many or as few as you want.
Of course, using buckets could make it easier to logically move some
of your data to another cluster if you find that one cluster can't
handle the load.
> I'd appreciate some assistance with this.
A word of warning: I/O is your enemy in shared hosting environments.
Be wary of Rackspace's I/O pipeline. Most cloud providers are using
low end commodity servers with low end commodity storage in the back
end. That means you're going to share a host server with multiple
tenants and you'll be sharing the same single crappy Broadcom ethernet
port with everyone else on that box and you will most likely be
sharing the same Dell EqualLogic or EMC Isilon (I think Rackspace use
the Isilon unless you ask for a VMAX).
Point is: you'll have a terribly narrow and shared pipeline to your
disk subsystem. Expect your I/O to be in the 70MB/s or lower rate.
Or... what you'd expect from a USB flash drive.
Edit: Rackspace allege to be using local storage, so you'll be
fighting with everyone else on your server for access to the same four
7200 RPM drives ;) Again, expect terrible performance and you won't be
> riak-users mailing list
> riak-users at lists.basho.com
More information about the riak-users