High volume data series storage and queries

Paul O pcotec at gmail.com
Mon Aug 8 16:25:14 EDT 2011

Hi Jeremiah,

This is for a yet-to-exist system, so the existing data characteristics are
not that important.

The volume of data would be something like : average 10 events per second
per source meaning about 320 million events per source, for tens of
thousands of sources, potentially hundreds of thousands.

Data retention policy would be in the range of years, probably 5 years.

Most of the above-mentioned are averages, some sources might be sampled even
hundreds of times per second. There is also a layer of creating aggregates
for "regressive granularity" (a la RRD) but it's a bit less of a concern
(i.e. the same strategy I'm describing could be used for storing the

The strategy I've described tries to make the most common query (time range
per source with a max number of elements) predictable and as performant as
possible. I.e. for any range I know at most three batches need to be read
from Riak (or equivalent) so I can say that, if reading a batch takes 20 ms
and the initial query takes 10 ms I can predictably respond to most such
requests under 100 ms.

So as long as I can benchmark individual aspects of the strategy I hope to a
predictable query cost and an idea of how to grow the system.

As for the read to write ration I don't have an exact estimate (the system
will be generic and consumption applications will be built on top of it) but
the system is expected to be a lot more write intensive than read intensive.
Most data might go completely unused, some data might be rather "hot" so
additional caching might be implemented later but I'm trying to design the
underlying system so at least some performance axioms are computable.

Does this clarify or confuses further?



On Mon, Aug 8, 2011 at 3:32 PM, Jeremiah Peschka <jeremiah.peschka at gmail.com
> wrote:

> It sounds like a potentially interesting use case.
> The questions that immediately enter my head are:
> * How much data do you currently have?
> * How much data do you plan to have?
> * Do you have a data retention policy? If so, what is it? How do you plan
> to implement it?
> * What's the anticipated rate of growth per day? Week? Year?
> * What type of queries will you have? Is it a fixed set of queries? Is it a
> decision support system?
> * What does your read to write ratio look like?
> Your plan to support Riak with a hybrid system isn't that out of whack;
> it's very doable.
> You can certainly do the type of querying you've described through careful
> choice of key names, sorting in memory, and only using the first N data
> points in a given Map Reduce query result. The main reason to not perform
> range queries in Riak is that they'll result in full key space scans across
> the Riak cluster. If you're using bitcask as your backend then it's an in
> memory scan, otherwise you're doing a much more costly scan from disk. And,
> since key names are hashed as they are partitioned across the cluster,
> you're not going to get the benefit of sequential disk scan performance like
> you might get with a traditional database.
> The only thing that worries me is the phrase "should grow more than what a
> 'vanilla' RDBMS would support". Are you thinking 1TB? 10TB? 50TB? 500TB? I'm
> trying to get a handle on what size and performance characteristics you're
> looking for before diving into how to look at your system vs. saying "Hell
> if I know, does someone else on the list have a good idea?"
> ---
> Jeremiah Peschka - Founder, Brent Ozar PLF, LLC
> Microsoft SQL Server MVP
> On Aug 8, 2011, at 11:21 AM, Paul O wrote:
> > Hello Riak enthusiasts,
> >
> > I am trying to design a solution for storing time series data coming from
> a very large number of potential high-frequency sources.
> >
> > I thought Riak could be of help, though based on what I read about it I
> can't use it without some other layer on top of it.
> >
> > The problem is I need to be able to do range queries over this data, by
> the source. Hence, I want to be able to say "give me the N first data points
> for source S between time T1 and time T2."
> >
> > I need to store this data for a rather long time, and the expected volume
> should grow more than what a "vanilla" RDBMS would support.
> >
> > Another thing to note is that I can restrict the number of data points to
> be returned by a query, so no query would return more than MaxN data points.
> >
> > I thought about doing this the following way:
> >
> > 1. bundle date time series in batches of MaxN, to ensure that any query
> would require reading at most two batches. The batches would be store inside
> Riak.
> > 2. Store the start-time, end-time, size and Riak batch ID in a MySQL (or
> PostgreSQL) DB.
> >
> > My thinking is such a strategy would allow me to persist data in Riak and
> linearly grow with the data, and the index would be kept in a RDBM for fast
> range queries.
> >
> > Does it sound sensible to use Riak this way? Does this make you
> laugh/cry/shake your head in disbelief? Am I overlooking something from Riak
> which would make all this much better?
> >
> > Thanks and best regards,
> >
> > Paul
> > _______________________________________________
> > riak-users mailing list
> > riak-users at lists.basho.com
> > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.basho.com/pipermail/riak-users_lists.basho.com/attachments/20110808/cc289eb6/attachment.html>

More information about the riak-users mailing list