High volume data series storage and queries
Ciprian Dorin Craciun
ciprian.craciun at gmail.com
Tue Aug 9 14:02:20 EDT 2011
On Tue, Aug 9, 2011 at 04:17, Paul O <pcotec at gmail.com> wrote:
> I sent this earlier to Ciprian without cc-ing the list.
> Hi Ciprian,
> Regarding the write penalty for batches, I plan to have a pre-write cache
> (hence the query step will also have to include a step for this
> verification, but I'm expecting the pre-cache volume to be small so I'd
> consider a RDBMs for it.)
> As I wrote in my clarifications for Jeremiah, I'm expecting lots of data
> even per source, that's why I'd like to take advantage of the data kind and
> limit the queries to avoid having to go through all values even those of a
> single source.
From what I've seen from your estimation, the data amount you're
going to store is huge. Not only that but also the bandwidth required
is quite a lot. (Assuming you have a 200MBit connection and you send
data over UDP (128 bytes in total = headers + payload), after a simple
calculation it results that you'll only be able to handle 16384
sensors. Thus maybe you should reduce the readings.)
> Think about multiple readers and writers at the same time and I thought (but
> am not sure) that storing opaque batches would be the most predictable and
> would allow the data store to grow relatively linearly (this is where Riak
> should help :-)
> The strategy you're suggesting would eliminate the need for the intermediary
> index, am I understanding this correctly?
Yes, what I was describing would use Riak-core only for the
distribution of load and cluster management, and would eliminate the
need of any index -- the main data-store would be the index itself.
> Anyway, I still have to think about your suggestions, using Riak-core and
> then storing the individual data files in an embedded DB is a neat idea and
> might allow more flexibility later (operations across all batches, etc.)
I wouldn't store the "data files" inside the embedded DB, but the
actual raw readings.
> On Mon, Aug 8, 2011 at 3:34 PM, Ciprian Dorin Craciun
> <ciprian.craciun at gmail.com> wrote:
>> On Mon, Aug 8, 2011 at 21:21, Paul O <pcotec at gmail.com> wrote:
>> > Hello Riak enthusiasts,
>> > I am trying to design a solution for storing time series data coming
>> > from a
>> > very large number of potential high-frequency sources.
>> > I thought Riak could be of help, though based on what I read about it I
>> > can't use it without some other layer on top of it.
>> > The problem is I need to be able to do range queries over this data, by
>> > the
>> > source. Hence, I want to be able to say "give me the N first data points
>> > for
>> > source S between time T1 and time T2."
>> > I need to store this data for a rather long time, and the expected
>> > volume
>> > should grow more than what a "vanilla" RDBMS would support.
>> > Another thing to note is that I can restrict the number of data points
>> > to be
>> > returned by a query, so no query would return more than MaxN data
>> > points.
>> > I thought about doing this the following way:
>> > 1. bundle date time series in batches of MaxN, to ensure that any query
>> > would require reading at most two batches. The batches would be store
>> > inside
>> > Riak.
>> > 2. Store the start-time, end-time, size and Riak batch ID in a MySQL (or
>> > PostgreSQL) DB.
>> > My thinking is such a strategy would allow me to persist data in Riak
>> > and
>> > linearly grow with the data, and the index would be kept in a RDBM for
>> > fast
>> > range queries.
>> > Does it sound sensible to use Riak this way? Does this make you
>> > laugh/cry/shake your head in disbelief? Am I overlooking something from
>> > Riak
>> > which would make all this much better?
>> > Thanks and best regards,
>> > Paul
>> Hello all!
>> (Disclaimer: I'm not a Riak-KV user per-se, but I've built
>> something on-top of Riak-Core. Also the solution I'm proposing is not
>> just something "on-top" of Riak-KV, but ontop of Riak-Core.)
>> (Context: Some time ago we had a similar problem in a project of
>> ours: store time-series of power consumption from a lot of devices.
>> The current solution is using Informix, but this is out of scope.)
>> First I would say that storing data in batches will have a high
>> penalty on writes -- and such an application I guess is write
>> intensive. (As I know there is no support for appending to values.
>> Now because you have a "very large number of sources" means that
>> you could do -- on top of Riak-Core the following:
>> * take the identifier of your data source and hash it -- just like
>> Riak-KV does with a key -- and use this as the key for Riak-Core;
>> * build a Vnode module that handles writes and queries, and which
>> stores the data in either:
>> * RRDtool -- one data-file per data-source;
>> * or an embedded database allowing sorted data-sets; (we've
>> had some very nice experiments with BerkeleyDB;) just use as key the
>> concatenation from the source key and the timestamp, and being careful
>> to have the resulting key correctly sorted lexicographically (i.e.
>> keep the key and timestamp at fixed length (maybe padded) and
>> big-endian encoded);
>> * all that remains to be solved is data-replication -- similar to
>> how Riak handles it;
>> Thinking a little bit about it, you could actually do also this:
>> * Riak-KV uses 160bits for keys, thus you could "partition" this
>> key in two parts: "sensor-key" | "timestamp"; for example 96 bits for
>> sensor, and 64 bits for timestamp;
>> * implement the riak_kv_backend module ontop of such an embedded
>> database as described above (i.e. BerkeleyDB, or LevelDB);
>> * when storing and querying data you'll still need to bypass the
>> Riak-KV normal key hashing;
>> * you'll have replication and balancing for free;
More information about the riak-users