High volume data series storage and queries

Ciprian Dorin Craciun ciprian.craciun at gmail.com
Mon Aug 8 15:34:11 EDT 2011


On Mon, Aug 8, 2011 at 21:21, Paul O <pcotec at gmail.com> wrote:
> Hello Riak enthusiasts,
> I am trying to design a solution for storing time series data coming from a
> very large number of potential high-frequency sources.
> I thought Riak could be of help, though based on what I read about it I
> can't use it without some other layer on top of it.
> The problem is I need to be able to do range queries over this data, by the
> source. Hence, I want to be able to say "give me the N first data points for
> source S between time T1 and time T2."
> I need to store this data for a rather long time, and the expected volume
> should grow more than what a "vanilla" RDBMS would support.
> Another thing to note is that I can restrict the number of data points to be
> returned by a query, so no query would return more than MaxN data points.
> I thought about doing this the following way:
> 1. bundle date time series in batches of MaxN, to ensure that any query
> would require reading at most two batches. The batches would be store inside
> Riak.
> 2. Store the start-time, end-time, size and Riak batch ID in a MySQL (or
> PostgreSQL) DB.
> My thinking is such a strategy would allow me to persist data in Riak and
> linearly grow with the data, and the index would be kept in a RDBM for fast
> range queries.
> Does it sound sensible to use Riak this way? Does this make you
> laugh/cry/shake your head in disbelief? Am I overlooking something from Riak
> which would make all this much better?
> Thanks and best regards,
> Paul

    Hello all!

    (Disclaimer: I'm not a Riak-KV user per-se, but I've built
something on-top of Riak-Core. Also the solution I'm proposing is not
just something "on-top" of Riak-KV, but ontop of Riak-Core.)
    (Context: Some time ago we had a similar problem in a project of
ours: store time-series of power consumption from a lot of devices.
The current solution is using Informix, but this is out of scope.)

    First I would say that storing data in batches will have a high
penalty on writes -- and such an application I guess is write
intensive. (As I know there is no support for appending to values.
Or?)

    Now because you have a "very large number of sources" means that
you could do -- on top of Riak-Core the following:
    * take the identifier of your data source and hash it -- just like
Riak-KV does with a key -- and use this as the key for Riak-Core;
    * build a Vnode module that handles writes and queries, and which
stores the data in either:
        * RRDtool -- one data-file per data-source;
        * or an embedded database allowing sorted data-sets; (we've
had some very nice experiments with BerkeleyDB;) just use as key the
concatenation from the source key and the timestamp, and being careful
to have the resulting key correctly sorted lexicographically (i.e.
keep the key and timestamp at fixed length (maybe padded) and
big-endian encoded);
    * all that remains to be solved is data-replication -- similar to
how Riak handles it;

    Thinking a little bit about it, you could actually do also this:
    * Riak-KV uses 160bits for keys, thus you could "partition" this
key in two parts: "sensor-key" | "timestamp"; for example 96 bits for
sensor, and 64 bits for timestamp;
    * implement the riak_kv_backend module ontop of such an embedded
database as described above (i.e. BerkeleyDB, or LevelDB);
    * when storing and querying data you'll still need to bypass the
Riak-KV normal key hashing;
    * you'll have replication and balancing for free;

    Ciprian.




More information about the riak-users mailing list