High volume data series storage and queries

Paul O pcotec at gmail.com
Thu Aug 11 10:49:15 EDT 2011


Hi Ciprian, first of all thanks a lot for the detailed explanations
and references. A lot of food for thought here and I'm not sure I grok
it all just yet. I'll inline my comments.

On Wed, Aug 10, 2011 at 9:14 AM, Ciprian Dorin Craciun
<ciprian.craciun at gmail.com> wrote:
[...]
>    So the first problem I see with any solution would be the 1
> billion records. I would try to find an alternative solution for
> "historical-data", like dumping it in flat files, and just keeping in
> the database recent data. By recent I don't have a specific timeframe
> in mind, but whatever you determine it works in your particular case.

Well this is kind of the crux of the problem, in the end. Regardless
of how I store the older data, I still need a predictable query time
for it. So if I dump it in flat files, for instance bundles as
described in my initial solution that I pre-index then I'm by
definition back to the starting point. And if that (1 index check + 3
bundles read + quick range chop off) is quick enough, then everything
else is caching for more recent info. I'm not saying that to discount
the solution, just saying that I can't ignore the historical-data
(actually I might have to degrade historical data performance in the
end, I'm just saying that if the degraded solution is good enough, it
could serve for recent info, too.)


>    Now about the solution: I'm not proposing any batching or
> partitioning based on either timestamp or source; neither as
> individual files on the file system or stored inside the database as
> opaque values.
>
>    Instead I suggest to just put every data point in the same "table"
> or "database" as described below:
>    * the first requirement for the embedded database is to support
> lexicographically sorted keys and cursors which can be pointed to
> start from a particular (or greater) key;
>    * the key of the record would be the binary representation of the
> source id padded with 0's to an exact length, concatenated with the
> timestamp which follows the same rules;
>
>    Now about your concerns about executing range queries without
> considering the whole data is already solved by the database which
> doesn't need to look at the whole data but only search for the first
> data point matching your query. Actually behind scenes it does
> something similar -- but even better -- than what you're proposing
> with the MaxN records opaque value.
>
>    Furthermore, to back my claims I've put on Github the old code
> which I've used to benchmark different databases -- Riak or other
> key-value stores are not included except BerkeleyDB, but I've included
> others like Hypertable which could be a good choice -- at the
> following address (you can easily extend the "framework" to include
> new backends):
>        https://github.com/cipriancraciun/sds-benchmark
>
>    Also there is a "so-called" report I've put up at that time in the
> same place at -- the second link is the content of the wkipage and
> there are also images which GitHub doesn't display in the page:
>        https://github.com/cipriancraciun/sds-benchmark/tree/master/results
>        https://github.com/cipriancraciun/sds-benchmark/blob/master/results/results.mediawiki
>
>    For the record we've tested up to 100m records on some
> data-stores, but on some other (like PostgreSQL and MySQL we've
> stopped at 10 million as the insertion rate dropped tremendously).
>
>    My conclusion from this experiment was: any database which is
> backed by a tree-like data structure (almost all use B trees or
> derivate) will get to a grinding halt in insert speed if the
> clustering keys (clientid + timpstamp in this case) don't exhibit any
> kind of locality (as is your case). See the quote from my report:
> ~~~~
> * this section applies to Postgres and SQLite, as MonetDB behaved Ok;
> * it seems that the initial insert speed is good for the first couple
> million records;
> * as the data accumulates and new inserts are done, the indices start
> to be rewritten, and consume the entire disk bandwidth; (the initial
> good speed is due the fact that the indices fit into the RAM memory;)
> * if the inserts are done without any indices defined, the insert
> speed is incredible (600k in the case of Postgres), but the scan speed
> is under 100;
> * maybe, it is possible, that this behavior is specific to any
> database which uses trees (like BerkeleyDB?);
> ~~~~
>
>    Hope this helps you,

This is very good information and thanks a lot for this, I was not
expecting such in depth benchmarks and am wondering why yours didn't
turn up in my Google searches.

However, I'm still a bit in the dark regarding some aspects of storing
everything in one big database. Are these records stored in an "append
only" fashion? If they are, then how can the range query be done
without considering potentially a huge number of records? If they are
not "append only" then the DB might have to shuffle data around which
would degrade performance or store an index.

Anyway, I'll read some more on this, it's probably just my ignorance
regarding how those databases work, but if the DBs can be append-only
then I need one DB per data source and even then I have to contend
with data insertions in the past (a requirement which I hope I made
explicit in previous emails.) And if I have to use multiple DBs are
these solutions able to juggle multiple DBs, open/close them quickly,
etc.?

I hope I'm not too off-list-topic with this and hope at least some
other people find the discussion useful.

Regards,

Paul




More information about the riak-users mailing list