High volume data series storage and queries
jeremiah.peschka at gmail.com
Mon Aug 8 15:32:43 EDT 2011
It sounds like a potentially interesting use case.
The questions that immediately enter my head are:
* How much data do you currently have?
* How much data do you plan to have?
* Do you have a data retention policy? If so, what is it? How do you plan to implement it?
* What's the anticipated rate of growth per day? Week? Year?
* What type of queries will you have? Is it a fixed set of queries? Is it a decision support system?
* What does your read to write ratio look like?
Your plan to support Riak with a hybrid system isn't that out of whack; it's very doable.
You can certainly do the type of querying you've described through careful choice of key names, sorting in memory, and only using the first N data points in a given Map Reduce query result. The main reason to not perform range queries in Riak is that they'll result in full key space scans across the Riak cluster. If you're using bitcask as your backend then it's an in memory scan, otherwise you're doing a much more costly scan from disk. And, since key names are hashed as they are partitioned across the cluster, you're not going to get the benefit of sequential disk scan performance like you might get with a traditional database.
The only thing that worries me is the phrase "should grow more than what a 'vanilla' RDBMS would support". Are you thinking 1TB? 10TB? 50TB? 500TB? I'm trying to get a handle on what size and performance characteristics you're looking for before diving into how to look at your system vs. saying "Hell if I know, does someone else on the list have a good idea?"
Jeremiah Peschka - Founder, Brent Ozar PLF, LLC
Microsoft SQL Server MVP
On Aug 8, 2011, at 11:21 AM, Paul O wrote:
> Hello Riak enthusiasts,
> I am trying to design a solution for storing time series data coming from a very large number of potential high-frequency sources.
> I thought Riak could be of help, though based on what I read about it I can't use it without some other layer on top of it.
> The problem is I need to be able to do range queries over this data, by the source. Hence, I want to be able to say "give me the N first data points for source S between time T1 and time T2."
> I need to store this data for a rather long time, and the expected volume should grow more than what a "vanilla" RDBMS would support.
> Another thing to note is that I can restrict the number of data points to be returned by a query, so no query would return more than MaxN data points.
> I thought about doing this the following way:
> 1. bundle date time series in batches of MaxN, to ensure that any query would require reading at most two batches. The batches would be store inside Riak.
> 2. Store the start-time, end-time, size and Riak batch ID in a MySQL (or PostgreSQL) DB.
> My thinking is such a strategy would allow me to persist data in Riak and linearly grow with the data, and the index would be kept in a RDBM for fast range queries.
> Does it sound sensible to use Riak this way? Does this make you laugh/cry/shake your head in disbelief? Am I overlooking something from Riak which would make all this much better?
> Thanks and best regards,
> riak-users mailing list
> riak-users at lists.basho.com
More information about the riak-users