First stab at sizing a cluster

Mark Phillips mark at basho.com
Mon Oct 21 02:02:39 EDT 2013


Hi Nathan,

One alternative to the pure 2i-based solution for this would be time
boxing. Sean referenced it a few months back on the list [1] and it's
worth investigating. There are a few other resources I'm failing to
remember at the moment but I'll send them along tomorrow if I do.
That said, 2i will most-likely work for your queries, too. I would
prototype both and let performance testing be your guide.

On the topic is cluster sizing, it's tough to pin it precisely before
you're up and running. That said, I would start with five of the
Softlayer Smalls at the very least.

Hope that helps.

Mark
twitter.com/pharkmillups

PS - You might also want to experiment with lower N, R, and W values
as log data tends to be immutable and you can pick up some performance
gains by cutting down on how many replicas you're storing and
querying.

[1] (Dietrich's talk Sean links to is a great resource)
http://riak.markmail.org/search/?q=timebox#query:timebox+page:1+mid:e3a7ivrn5eyw3vtz+state:results

On Sat, Oct 19, 2013 at 11:47 AM, N. Tucker
<ntucker-ml-riak-users at august20th.com> wrote:
> Hi all, I've been experimenting with using riak to index a large
> amount of log data collected from a bunch of different app instances
> across different machines.  I have our app code instrumented such that
> it attaches secondary indexes to log entries based on some interesting
> metadata (for example, the date, the thread id, the hostname, the
> identity of the user on whose behalf we were doing something, if
> appropriate) and then submits them to riak.  So far I have this
> working against a 1-node riak cluster on a very small slice of
> production log data, which obviously doesn't really add much benefit.
> Time to see about scaling it up.
>
> Ultimately, I'd like my database to reflect an N-most-recent-days
> window of our logs, to make querying them easier than grepping
> gigabytes and gigabytes of logs across dozens of machines.  The
> secondary indexes are especially appealing, because the most common
> task is "give me all the logs associated with this user across all
> machines for a given date window".  This seems like a problem riak is
> well suited for, given the appropriate secondary indexes.
>
> Having no riak sizing experience to speak of and no outside guidance,
> my approach was basically going to be to start out with a 3 or 5 node
> cluster of SoftLayer's "small" riak nodes (see
> http://www.softlayer.com/solutions/big-data/riak-hosting ) or
> comparable hardware, then start shoveling data into it and see how
> large a window I can retain (and query against with reasonable
> performance) while still writing at full blast (assuming I can
> actually write full blast to it -- that remains to be seen).
>
> But then I realized there are probably a few people on this list that
> might be able to give me at least a rough recommendation if I can give
> some details on the data load.  The average log entry is around 140
> bytes of message and maybe another 60 bytes of metadata for secondary
> indexes.  We churn out about 400 million of these log entries per day,
> so in the neighborhood of 4500 per second.
>
> Is this something we should be able to handle on a smallish riak
> cluster using a LevelDB backend?  I'm trying to puzzle out just how
> much this scheme will end up costing us.  Also, what would be a good
> approach for pruning items as they get outside the sliding N-day
> window?  TTL? Delete query by date?  Will this be expensive?  I've
> also seen some threads recently about LevelDB never actually shrinking
> when data is deleted.  Is that a problem I'll run into quickly?
>
> Thanks in advance for any guidance you can give.  Even if the advice
> is "give it up, just use <x>, which was designed for exactly this",
> I'm interested in that type of response, too.  Maybe I'm overlooking
> something much easier.  Stuff like splunk is worth consideration,
> although I'm a pretty big believer in reducing dependencies on outside
> services.  I'm also happy to provide more details on our use case if
> what I've provided isn't enough.
>
> _______________________________________________
> riak-users mailing list
> riak-users at lists.basho.com
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com




More information about the riak-users mailing list