High volume data series storage and queries

Jeremiah Peschka jeremiah.peschka at gmail.com
Mon Aug 8 17:53:08 EDT 2011

Responses inline

On Aug 8, 2011, at 1:25 PM, Paul O wrote:

> Hi Jeremiah,
> This is for a yet-to-exist system, so the existing data characteristics are not that important.

Will any existing data be imported? If this is totally greenfield, then you're free to do whatever zany things you want!

> The volume of data would be something like : average 10 events per second per source meaning about 320 million events per source, for tens of thousands of sources, potentially hundreds of thousands.

Ah, so you need IOPS throughput, not storage capacity. On the hardware side make sure your storage subsystem can keep up - don't cheap out on disks just because you have a lot of nodes. A single rotational HDD can only handle about 180 IOPS on average. There's a lot you can do on the storage backend to make sure you're able to keep up there.

> Data retention policy would be in the range of years, probably 5 years.

You may want to look into ways to force Riak to clean up the bitcask files. I don't entirely remember how it's going to handle cleaning up deleted records, but you might run into some tricky situations where compactions aren't occurring.

> Most of the above-mentioned are averages, some sources might be sampled even hundreds of times per second. There is also a layer of creating aggregates for "regressive granularity" (a la RRD) but it's a bit less of a concern (i.e. the same strategy I'm describing could be used for storing the aggregates.)

I like this - write more to avoid aggregations on reads. Us relational types do this, too.

> The strategy I've described tries to make the most common query (time range per source with a max number of elements) predictable and as performant as possible. I.e. for any range I know at most three batches need to be read from Riak (or equivalent) so I can say that, if reading a batch takes 20 ms and the initial query takes 10 ms I can predictably respond to most such requests under 100 ms.

Riak is pretty constant time for Bitcask. The tricky part with the amount of data you're describing is that Bitcask requires (I think) that all keys fit into memory. As your data volume increases, you'll need to do a combination of scaling up and scaling out. Scale up RAM in the nodes and then add additional nodes to handle load. RAM will help with data volume, more nodes will help with write throughput.

> So as long as I can benchmark individual aspects of the strategy I hope to a predictable query cost and an idea of how to grow the system.

You're speaking my language! I love it. 

> As for the read to write ration I don't have an exact estimate (the system will be generic and consumption applications will be built on top of it) but the system is expected to be a lot more write intensive than read intensive. Most data might go completely unused, some data might be rather "hot" so additional caching might be implemented later but I'm trying to design the underlying system so at least some performance axioms are computable.
> Does this clarify or confuses further?

Very clear.

Since you're searching on time series, mostly, you could build time indexes in your RDBMS. The nice thing is that querying temporal data is well documented in the relational world, especially in the data warehousing world. In your case, I'd create a dates table and have a foreign key relating to my RDBMS index table to make it easy to search for dates. Querying your time table will be fast which reduces the need for scans in your index table.


CREATE TABLE timeseries (
  time_key INT,
  datestring VARCHAR(30),
  year SMALLINT,
  month TINYINT,
  day TINYINT,
  day_of_week TINYINT
  -- etc

CREATE TABLE riak_index (
  time_key INT NOT NULL REFERENCES timeseries(time_key),
  riak_key VARCHAR(100) NOT NULL

SELECT ri.riak_key
FROM timeseries ts
JOIN riak_index ri ON ts.time_key = ri.time_key
WHERE ts.date BETWEEN '20090702' AND '20100702';

Without going too much into RDBMS fun, this pattern can get your RDBMS running pretty quickly and then you can combine that with Riak's performance and have a really good idea of how quick any query will be.

> Regards,
> Paul
> On Mon, Aug 8, 2011 at 3:32 PM, Jeremiah Peschka <jeremiah.peschka at gmail.com> wrote:
> It sounds like a potentially interesting use case.
> The questions that immediately enter my head are:
> * How much data do you currently have?
> * How much data do you plan to have?
> * Do you have a data retention policy? If so, what is it? How do you plan to implement it?
> * What's the anticipated rate of growth per day? Week? Year?
> * What type of queries will you have? Is it a fixed set of queries? Is it a decision support system?
> * What does your read to write ratio look like?
> Your plan to support Riak with a hybrid system isn't that out of whack; it's very doable.
> You can certainly do the type of querying you've described through careful choice of key names, sorting in memory, and only using the first N data points in a given Map Reduce query result. The main reason to not perform range queries in Riak is that they'll result in full key space scans across the Riak cluster. If you're using bitcask as your backend then it's an in memory scan, otherwise you're doing a much more costly scan from disk. And, since key names are hashed as they are partitioned across the cluster, you're not going to get the benefit of sequential disk scan performance like you might get with a traditional database.
> The only thing that worries me is the phrase "should grow more than what a 'vanilla' RDBMS would support". Are you thinking 1TB? 10TB? 50TB? 500TB? I'm trying to get a handle on what size and performance characteristics you're looking for before diving into how to look at your system vs. saying "Hell if I know, does someone else on the list have a good idea?"
> ---
> Jeremiah Peschka - Founder, Brent Ozar PLF, LLC
> Microsoft SQL Server MVP
> On Aug 8, 2011, at 11:21 AM, Paul O wrote:
> > Hello Riak enthusiasts,
> >
> > I am trying to design a solution for storing time series data coming from a very large number of potential high-frequency sources.
> >
> > I thought Riak could be of help, though based on what I read about it I can't use it without some other layer on top of it.
> >
> > The problem is I need to be able to do range queries over this data, by the source. Hence, I want to be able to say "give me the N first data points for source S between time T1 and time T2."
> >
> > I need to store this data for a rather long time, and the expected volume should grow more than what a "vanilla" RDBMS would support.
> >
> > Another thing to note is that I can restrict the number of data points to be returned by a query, so no query would return more than MaxN data points.
> >
> > I thought about doing this the following way:
> >
> > 1. bundle date time series in batches of MaxN, to ensure that any query would require reading at most two batches. The batches would be store inside Riak.
> > 2. Store the start-time, end-time, size and Riak batch ID in a MySQL (or PostgreSQL) DB.
> >
> > My thinking is such a strategy would allow me to persist data in Riak and linearly grow with the data, and the index would be kept in a RDBM for fast range queries.
> >
> > Does it sound sensible to use Riak this way? Does this make you laugh/cry/shake your head in disbelief? Am I overlooking something from Riak which would make all this much better?
> >
> > Thanks and best regards,
> >
> > Paul
> > _______________________________________________
> > riak-users mailing list
> > riak-users at lists.basho.com
> > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
> _______________________________________________
> riak-users mailing list
> riak-users at lists.basho.com
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Jeremiah Peschka - Founder, Brent Ozar PLF, LLC
 Microsoft SQL Server MVP

More information about the riak-users mailing list