High volume data series storage and queries
pcotec at gmail.com
Mon Aug 8 21:40:30 EDT 2011
Quite a few interesting points, thanks!
On Mon, Aug 8, 2011 at 5:53 PM, Jeremiah Peschka <jeremiah.peschka at gmail.com
> Responses inline
> On Aug 8, 2011, at 1:25 PM, Paul O wrote:
> Will any existing data be imported? If this is totally greenfield, then
> you're free to do whatever zany things you want!
Almost totally greenfield, yes. Some data will need to be imported but it's
already in the format described.
Ah, so you need IOPS throughput, not storage capacity. On the hardware side
> make sure your storage subsystem can keep up - don't cheap out on disks just
> because you have a lot of nodes. A single rotational HDD can only handle
> about 180 IOPS on average. There's a lot you can do on the storage backend
> to make sure you're able to keep up there.
Indeed, storage capacity is also an issue but IOPS would be important, too.
I assume that sending batches to Riak (opaque blobs) would help a lot with
the quantity of writes, but it's still a very important point.
You may want to look into ways to force Riak to clean up the bitcask files.
> I don't entirely remember how it's going to handle cleaning up deleted
> records, but you might run into some tricky situations where compactions
> aren't occurring.
Hm, any references regarding that? It would be a major snag in the whole
schema Riak doesn't properly reclaim space for deleted records.
Riak is pretty constant time for Bitcask. The tricky part with the amount of
> data you're describing is that Bitcask requires (I think) that all keys fit
> into memory. As your data volume increases, you'll need to do a combination
> of scaling up and scaling out. Scale up RAM in the nodes and then add
> additional nodes to handle load. RAM will help with data volume, more nodes
> will help with write throughput.
Indeed, for high frequency sources that would create lots of bundles even
the MaxN to 1 reduction for key names might still generate loads of keys.
Any idea how much RAM Riak requires per record, or a reference that would
point me to it?
Since you're searching on time series, mostly, you could build time indexes
> in your RDBMS. The nice thing is that querying temporal data is well
> documented in the relational world, especially in the data warehousing
> world. In your case, I'd create a dates table and have a foreign key
> relating to my RDBMS index table to make it easy to search for dates.
> Querying your time table will be fast which reduces the need for scans in
> your index table.
> CREATE TABLE timeseries (
> time_key INT,
> date TIMESTAMP,
> datestring VARCHAR(30),
> year SMALLINT,
> month TINYINT,
> day TINYINT,
> day_of_week TINYINT
> -- etc
> CREATE TABLE riak_index (
> id INT NOT NULL,
> time_key INT NOT NULL REFERENCES timeseries(time_key),
> riak_key VARCHAR(100) NOT NULL
> SELECT ri.riak_key
> FROM timeseries ts
> JOIN riak_index ri ON ts.time_key = ri.time_key
> WHERE ts.date BETWEEN '20090702' AND '20100702';
My plan was to have the riak_index contain something like: (id, start_time,
end_time, source_id, record_count.)
Without going too much into RDBMS fun, this pattern can get your RDBMS
> running pretty quickly and then you can combine that with Riak's performance
> and have a really good idea of how quick any query will be.
That's roughly the plan, thanks again for your contributions to the
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the riak-users