First stab at sizing a cluster

N. Tucker ntucker-ml-riak-users at august20th.com
Sat Oct 19 14:47:45 EDT 2013


Hi all, I've been experimenting with using riak to index a large
amount of log data collected from a bunch of different app instances
across different machines.  I have our app code instrumented such that
it attaches secondary indexes to log entries based on some interesting
metadata (for example, the date, the thread id, the hostname, the
identity of the user on whose behalf we were doing something, if
appropriate) and then submits them to riak.  So far I have this
working against a 1-node riak cluster on a very small slice of
production log data, which obviously doesn't really add much benefit.
Time to see about scaling it up.

Ultimately, I'd like my database to reflect an N-most-recent-days
window of our logs, to make querying them easier than grepping
gigabytes and gigabytes of logs across dozens of machines.  The
secondary indexes are especially appealing, because the most common
task is "give me all the logs associated with this user across all
machines for a given date window".  This seems like a problem riak is
well suited for, given the appropriate secondary indexes.

Having no riak sizing experience to speak of and no outside guidance,
my approach was basically going to be to start out with a 3 or 5 node
cluster of SoftLayer's "small" riak nodes (see
http://www.softlayer.com/solutions/big-data/riak-hosting ) or
comparable hardware, then start shoveling data into it and see how
large a window I can retain (and query against with reasonable
performance) while still writing at full blast (assuming I can
actually write full blast to it -- that remains to be seen).

But then I realized there are probably a few people on this list that
might be able to give me at least a rough recommendation if I can give
some details on the data load.  The average log entry is around 140
bytes of message and maybe another 60 bytes of metadata for secondary
indexes.  We churn out about 400 million of these log entries per day,
so in the neighborhood of 4500 per second.

Is this something we should be able to handle on a smallish riak
cluster using a LevelDB backend?  I'm trying to puzzle out just how
much this scheme will end up costing us.  Also, what would be a good
approach for pruning items as they get outside the sliding N-day
window?  TTL? Delete query by date?  Will this be expensive?  I've
also seen some threads recently about LevelDB never actually shrinking
when data is deleted.  Is that a problem I'll run into quickly?

Thanks in advance for any guidance you can give.  Even if the advice
is "give it up, just use <x>, which was designed for exactly this",
I'm interested in that type of response, too.  Maybe I'm overlooking
something much easier.  Stuff like splunk is worth consideration,
although I'm a pretty big believer in reducing dependencies on outside
services.  I'm also happy to provide more details on our use case if
what I've provided isn't enough.




More information about the riak-users mailing list