High volume data series storage and queries

Ciprian Dorin Craciun ciprian.craciun at gmail.com
Tue Aug 9 14:02:20 EDT 2011

On Tue, Aug 9, 2011 at 04:17, Paul O <pcotec at gmail.com> wrote:
> I sent this earlier to Ciprian without cc-ing the list.
> ---
> Hi Ciprian,
> Regarding the write penalty for batches, I plan to have a pre-write cache
> (hence the query step will also have to include a step for this
> verification, but I'm expecting the pre-cache volume to be small so I'd
> consider a RDBMs for it.)

    Got it.

> As I wrote in my clarifications for Jeremiah, I'm expecting lots of data
> even per source, that's why I'd like to take advantage of the data kind and
> limit the queries to avoid having to go through all values even those of a
> single source.

    From what I've seen from your estimation, the data amount you're
going to store is huge. Not only that but also the bandwidth required
is quite a lot. (Assuming you have a 200MBit connection and you send
data over UDP (128 bytes in total = headers + payload), after a simple
calculation it results that you'll only be able to handle 16384
sensors. Thus maybe you should reduce the readings.)

> Think about multiple readers and writers at the same time and I thought (but
> am not sure) that storing opaque batches would be the most predictable and
> would allow the data store to grow relatively linearly (this is where Riak
> should help :-)
> The strategy you're suggesting would eliminate the need for the intermediary
> index, am I understanding this correctly?

    Yes, what I was describing would use Riak-core only for the
distribution of load and cluster management, and would eliminate the
need of any index -- the main data-store would be the index itself.

> Anyway, I still have to think about your suggestions, using Riak-core and
> then storing the individual data files in an embedded DB is a neat idea and
> might allow more flexibility later (operations across all batches, etc.)
> Regards,
> Paul

    I wouldn't store the "data files" inside the embedded DB, but the
actual raw readings.


> On Mon, Aug 8, 2011 at 3:34 PM, Ciprian Dorin Craciun
> <ciprian.craciun at gmail.com> wrote:
>> On Mon, Aug 8, 2011 at 21:21, Paul O <pcotec at gmail.com> wrote:
>> > Hello Riak enthusiasts,
>> > I am trying to design a solution for storing time series data coming
>> > from a
>> > very large number of potential high-frequency sources.
>> > I thought Riak could be of help, though based on what I read about it I
>> > can't use it without some other layer on top of it.
>> > The problem is I need to be able to do range queries over this data, by
>> > the
>> > source. Hence, I want to be able to say "give me the N first data points
>> > for
>> > source S between time T1 and time T2."
>> > I need to store this data for a rather long time, and the expected
>> > volume
>> > should grow more than what a "vanilla" RDBMS would support.
>> > Another thing to note is that I can restrict the number of data points
>> > to be
>> > returned by a query, so no query would return more than MaxN data
>> > points.
>> > I thought about doing this the following way:
>> > 1. bundle date time series in batches of MaxN, to ensure that any query
>> > would require reading at most two batches. The batches would be store
>> > inside
>> > Riak.
>> > 2. Store the start-time, end-time, size and Riak batch ID in a MySQL (or
>> > PostgreSQL) DB.
>> > My thinking is such a strategy would allow me to persist data in Riak
>> > and
>> > linearly grow with the data, and the index would be kept in a RDBM for
>> > fast
>> > range queries.
>> > Does it sound sensible to use Riak this way? Does this make you
>> > laugh/cry/shake your head in disbelief? Am I overlooking something from
>> > Riak
>> > which would make all this much better?
>> > Thanks and best regards,
>> > Paul
>>    Hello all!
>>    (Disclaimer: I'm not a Riak-KV user per-se, but I've built
>> something on-top of Riak-Core. Also the solution I'm proposing is not
>> just something "on-top" of Riak-KV, but ontop of Riak-Core.)
>>    (Context: Some time ago we had a similar problem in a project of
>> ours: store time-series of power consumption from a lot of devices.
>> The current solution is using Informix, but this is out of scope.)
>>    First I would say that storing data in batches will have a high
>> penalty on writes -- and such an application I guess is write
>> intensive. (As I know there is no support for appending to values.
>> Or?)
>>    Now because you have a "very large number of sources" means that
>> you could do -- on top of Riak-Core the following:
>>    * take the identifier of your data source and hash it -- just like
>> Riak-KV does with a key -- and use this as the key for Riak-Core;
>>    * build a Vnode module that handles writes and queries, and which
>> stores the data in either:
>>        * RRDtool -- one data-file per data-source;
>>        * or an embedded database allowing sorted data-sets; (we've
>> had some very nice experiments with BerkeleyDB;) just use as key the
>> concatenation from the source key and the timestamp, and being careful
>> to have the resulting key correctly sorted lexicographically (i.e.
>> keep the key and timestamp at fixed length (maybe padded) and
>> big-endian encoded);
>>    * all that remains to be solved is data-replication -- similar to
>> how Riak handles it;
>>    Thinking a little bit about it, you could actually do also this:
>>    * Riak-KV uses 160bits for keys, thus you could "partition" this
>> key in two parts: "sensor-key" | "timestamp"; for example 96 bits for
>> sensor, and 64 bits for timestamp;
>>    * implement the riak_kv_backend module ontop of such an embedded
>> database as described above (i.e. BerkeleyDB, or LevelDB);
>>    * when storing and querying data you'll still need to bypass the
>> Riak-KV normal key hashing;
>>    * you'll have replication and balancing for free;
>>    Ciprian.

More information about the riak-users mailing list