First stab at sizing a cluster
mark at basho.com
Mon Oct 21 02:02:39 EDT 2013
One alternative to the pure 2i-based solution for this would be time
boxing. Sean referenced it a few months back on the list  and it's
worth investigating. There are a few other resources I'm failing to
remember at the moment but I'll send them along tomorrow if I do.
That said, 2i will most-likely work for your queries, too. I would
prototype both and let performance testing be your guide.
On the topic is cluster sizing, it's tough to pin it precisely before
you're up and running. That said, I would start with five of the
Softlayer Smalls at the very least.
Hope that helps.
PS - You might also want to experiment with lower N, R, and W values
as log data tends to be immutable and you can pick up some performance
gains by cutting down on how many replicas you're storing and
 (Dietrich's talk Sean links to is a great resource)
On Sat, Oct 19, 2013 at 11:47 AM, N. Tucker
<ntucker-ml-riak-users at august20th.com> wrote:
> Hi all, I've been experimenting with using riak to index a large
> amount of log data collected from a bunch of different app instances
> across different machines. I have our app code instrumented such that
> it attaches secondary indexes to log entries based on some interesting
> metadata (for example, the date, the thread id, the hostname, the
> identity of the user on whose behalf we were doing something, if
> appropriate) and then submits them to riak. So far I have this
> working against a 1-node riak cluster on a very small slice of
> production log data, which obviously doesn't really add much benefit.
> Time to see about scaling it up.
> Ultimately, I'd like my database to reflect an N-most-recent-days
> window of our logs, to make querying them easier than grepping
> gigabytes and gigabytes of logs across dozens of machines. The
> secondary indexes are especially appealing, because the most common
> task is "give me all the logs associated with this user across all
> machines for a given date window". This seems like a problem riak is
> well suited for, given the appropriate secondary indexes.
> Having no riak sizing experience to speak of and no outside guidance,
> my approach was basically going to be to start out with a 3 or 5 node
> cluster of SoftLayer's "small" riak nodes (see
> http://www.softlayer.com/solutions/big-data/riak-hosting ) or
> comparable hardware, then start shoveling data into it and see how
> large a window I can retain (and query against with reasonable
> performance) while still writing at full blast (assuming I can
> actually write full blast to it -- that remains to be seen).
> But then I realized there are probably a few people on this list that
> might be able to give me at least a rough recommendation if I can give
> some details on the data load. The average log entry is around 140
> bytes of message and maybe another 60 bytes of metadata for secondary
> indexes. We churn out about 400 million of these log entries per day,
> so in the neighborhood of 4500 per second.
> Is this something we should be able to handle on a smallish riak
> cluster using a LevelDB backend? I'm trying to puzzle out just how
> much this scheme will end up costing us. Also, what would be a good
> approach for pruning items as they get outside the sliding N-day
> window? TTL? Delete query by date? Will this be expensive? I've
> also seen some threads recently about LevelDB never actually shrinking
> when data is deleted. Is that a problem I'll run into quickly?
> Thanks in advance for any guidance you can give. Even if the advice
> is "give it up, just use <x>, which was designed for exactly this",
> I'm interested in that type of response, too. Maybe I'm overlooking
> something much easier. Stuff like splunk is worth consideration,
> although I'm a pretty big believer in reducing dependencies on outside
> services. I'm also happy to provide more details on our use case if
> what I've provided isn't enough.
> riak-users mailing list
> riak-users at lists.basho.com
More information about the riak-users