High volume data series storage and queries

Paul O pcotec at gmail.com
Mon Aug 8 14:21:23 EDT 2011

Hello Riak enthusiasts,

I am trying to design a solution for storing time series data coming from a
very large number of potential high-frequency sources.

I thought Riak could be of help, though based on what I read about it I
can't use it without some other layer on top of it.

The problem is I need to be able to do range queries over this data, by the
source. Hence, I want to be able to say "give me the N first data points for
source S between time T1 and time T2."

I need to store this data for a rather long time, and the expected volume
should grow more than what a "vanilla" RDBMS would support.

Another thing to note is that I can restrict the number of data points to be
returned by a query, so no query would return more than MaxN data points.

I thought about doing this the following way:

1. bundle date time series in batches of MaxN, to ensure that any query
would require reading at most two batches. The batches would be store inside
2. Store the start-time, end-time, size and Riak batch ID in a MySQL (or
PostgreSQL) DB.

My thinking is such a strategy would allow me to persist data in Riak and
linearly grow with the data, and the index would be kept in a RDBM for fast
range queries.

Does it sound sensible to use Riak this way? Does this make you
laugh/cry/shake your head in disbelief? Am I overlooking something from Riak
which would make all this much better?

Thanks and best regards,

