High volume data series storage and queries
jeremiah.peschka at gmail.com
Thu Aug 11 12:02:18 EDT 2011
On Aug 11, 2011, at 7:49 AM, Paul O wrote:
> However, I'm still a bit in the dark regarding some aspects of storing
> everything in one big database. Are these records stored in an "append
> only" fashion? If they are, then how can the range query be done
> without considering potentially a huge number of records? If they are
> not "append only" then the DB might have to shuffle data around which
> would degrade performance or store an index.
Having one monolithic table could result in a lot of random reads since data will likely be written in insertion order. With appropriate indexing, you can reduce the randomness considerably. In a worst case scenario your database will effectively say "Up yours, I'm not doing this many random reads, it's going to be cheaper to scan the whole table."
> Anyway, I'll read some more on this, it's probably just my ignorance
> regarding how those databases work, but if the DBs can be append-only
> then I need one DB per data source and even then I have to contend
> with data insertions in the past (a requirement which I hope I made
> explicit in previous emails.) And if I have to use multiple DBs are
> these solutions able to juggle multiple DBs, open/close them quickly,
This depends a lot on your database back end. Both Oracle and PostgreSQL don't have any physical insertion order other than the order the rows come in - they use append only structures on disk. PostgreSQL compounds this by storing different versions of the row in the main table (e.g. if there is a modification).
SQL Server and MySQL use index ordered tables, Oracle can be told to do this on a table by table basis. An index ordered table is often referred to as a table with a clustered index. Ciprian mentioned that a few times in his benchmark.
A good fit solution to your problem would be to create a single table per data source and instruct the database to leave a bit of extra space on each data page (by setting a fill factor around 80% or so). That gives you some room for inserts of historical records. You'd want to adjust your fill factor as real world usage evolves. By putting each data source in a separate physical table, you can do a lot of interesting things with storage techniques and whatnot, but you also get the advantage of only having to perform reads on a single table at a time to get data from any given source. With a monolithic table, you can run into all kinds of problems with data statistics and accurately predicting the kind of data access you need, especially as your database grows in size.
If you need to perform analytical queries across all of your data, you can always build views to help you out there, too.
All of these features vary slightly from vendor to vendor, and some of them even have specific features to make this kind of design easier to implement and administer. This is why Ciprian and I were talking about an RDBMS solution requiring domain specific knowledge of a single product/vendor. The advantage of something like Riak is that you stand up some nodes and you're going to get relatively predictable performance.
I hate to scare you off from any solution, but Ciprian has experience building this on top of Riak core and I have experience building this solution on top of an RDBMS.
> I hope I'm not too off-list-topic with this and hope at least some
> other people find the discussion useful.
I find data storage systems to be fascinating :)
> riak-users mailing list
> riak-users at lists.basho.com
Jeremiah Peschka - Founder, Brent Ozar PLF, LLC
Microsoft SQL Server MVP
More information about the riak-users