High volume data series storage and queries

Paul O pcotec at gmail.com
Mon Aug 8 21:17:10 EDT 2011

I sent this earlier to Ciprian without cc-ing the list.
Hi Ciprian,

Regarding the write penalty for batches, I plan to have a pre-write cache
(hence the query step will also have to include a step for this
verification, but I'm expecting the pre-cache volume to be small so I'd
consider a RDBMs for it.)

As I wrote in my clarifications for Jeremiah, I'm expecting lots of data
even per source, that's why I'd like to take advantage of the data kind and
limit the queries to avoid having to go through all values even those of a
single source.

Think about multiple readers and writers at the same time and I thought (but
am not sure) that storing opaque batches would be the most predictable and
would allow the data store to grow relatively linearly (this is where Riak
should help :-)

The strategy you're suggesting would eliminate the need for the intermediary
index, am I understanding this correctly?

Anyway, I still have to think about your suggestions, using Riak-core and
then storing the individual data files in an embedded DB is a neat idea and
might allow more flexibility later (operations across all batches, etc.)



On Mon, Aug 8, 2011 at 3:34 PM, Ciprian Dorin Craciun <
ciprian.craciun at gmail.com> wrote:

> On Mon, Aug 8, 2011 at 21:21, Paul O <pcotec at gmail.com> wrote:
> > Hello Riak enthusiasts,
> > I am trying to design a solution for storing time series data coming from
> a
> > very large number of potential high-frequency sources.
> > I thought Riak could be of help, though based on what I read about it I
> > can't use it without some other layer on top of it.
> > The problem is I need to be able to do range queries over this data, by
> the
> > source. Hence, I want to be able to say "give me the N first data points
> for
> > source S between time T1 and time T2."
> > I need to store this data for a rather long time, and the expected volume
> > should grow more than what a "vanilla" RDBMS would support.
> > Another thing to note is that I can restrict the number of data points to
> be
> > returned by a query, so no query would return more than MaxN data points.
> > I thought about doing this the following way:
> > 1. bundle date time series in batches of MaxN, to ensure that any query
> > would require reading at most two batches. The batches would be store
> inside
> > Riak.
> > 2. Store the start-time, end-time, size and Riak batch ID in a MySQL (or
> > PostgreSQL) DB.
> > My thinking is such a strategy would allow me to persist data in Riak and
> > linearly grow with the data, and the index would be kept in a RDBM for
> fast
> > range queries.
> > Does it sound sensible to use Riak this way? Does this make you
> > laugh/cry/shake your head in disbelief? Am I overlooking something from
> Riak
> > which would make all this much better?
> > Thanks and best regards,
> > Paul
>     Hello all!
>    (Disclaimer: I'm not a Riak-KV user per-se, but I've built
> something on-top of Riak-Core. Also the solution I'm proposing is not
> just something "on-top" of Riak-KV, but ontop of Riak-Core.)
>    (Context: Some time ago we had a similar problem in a project of
> ours: store time-series of power consumption from a lot of devices.
> The current solution is using Informix, but this is out of scope.)
>    First I would say that storing data in batches will have a high
> penalty on writes -- and such an application I guess is write
> intensive. (As I know there is no support for appending to values.
> Or?)
>    Now because you have a "very large number of sources" means that
> you could do -- on top of Riak-Core the following:
>    * take the identifier of your data source and hash it -- just like
> Riak-KV does with a key -- and use this as the key for Riak-Core;
>    * build a Vnode module that handles writes and queries, and which
> stores the data in either:
>        * RRDtool -- one data-file per data-source;
>        * or an embedded database allowing sorted data-sets; (we've
> had some very nice experiments with BerkeleyDB;) just use as key the
> concatenation from the source key and the timestamp, and being careful
> to have the resulting key correctly sorted lexicographically (i.e.
> keep the key and timestamp at fixed length (maybe padded) and
> big-endian encoded);
>    * all that remains to be solved is data-replication -- similar to
> how Riak handles it;
>    Thinking a little bit about it, you could actually do also this:
>    * Riak-KV uses 160bits for keys, thus you could "partition" this
> key in two parts: "sensor-key" | "timestamp"; for example 96 bits for
> sensor, and 64 bits for timestamp;
>    * implement the riak_kv_backend module ontop of such an embedded
> database as described above (i.e. BerkeleyDB, or LevelDB);
>    * when storing and querying data you'll still need to bypass the
> Riak-KV normal key hashing;
>    * you'll have replication and balancing for free;
>    Ciprian.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.basho.com/pipermail/riak-users_lists.basho.com/attachments/20110808/958fc91a/attachment.html>

More information about the riak-users mailing list