High volume data series storage and queries

Jeremiah Peschka jeremiah.peschka at gmail.com
Tue Aug 9 10:24:50 EDT 2011

Jeremiah Peschka - Founder, Brent Ozar PLF, LLC
 Microsoft SQL Server MVP

On Aug 8, 2011, at 6:40 PM, Paul O wrote:

> Indeed, storage capacity is also an issue but IOPS would be important, too. I assume that sending batches to Riak (opaque blobs) would help a lot with the quantity of writes, but it's still a very important point.
> You may want to look into ways to force Riak to clean up the bitcask files. I don't entirely remember how it's going to handle cleaning up deleted records, but you might run into some tricky situations where compactions aren't occurring.
> Hm, any references regarding that? It would be a major snag in the whole schema Riak doesn't properly reclaim space for deleted records.

You might have to tweak the merge settings (http://wiki.basho.com/Bitcask-Configuration.html#Disk-Usage-and-Merging-Settings) depending on how and when data is deleted. You could bypass these configuration settings by manually running bitcask:merge.

More info here: https://help.basho.com/entries/20141178-why-does-it-seem-that-bitcask-merging-is-only-triggered-when-a-riak-node-is-restarted
and here: http://lists.basho.com/pipermail/riak-users_lists.basho.com/2011-July/005055.html

> Riak is pretty constant time for Bitcask. The tricky part with the amount of data you're describing is that Bitcask requires (I think) that all keys fit into memory. As your data volume increases, you'll need to do a combination of scaling up and scaling out. Scale up RAM in the nodes and then add additional nodes to handle load. RAM will help with data volume, more nodes will help with write throughput.
> Indeed, for high frequency sources that would create lots of bundles even the MaxN to 1 reduction for key names might still generate loads of keys. Any idea how much RAM Riak requires per record, or a reference that would point me to it?

There's a capacity planning page: http://wiki.basho.com/Bitcask-Capacity-Planning.html
And some additional information about RAM and disk requirements here: http://wiki.basho.com/Cluster-Capacity-Planning.html

> Since you're searching on time series, mostly, you could build time indexes in your RDBMS. The nice thing is that querying temporal data is well documented in the relational world, especially in the data warehousing world. In your case, I'd create a dates table and have a foreign key relating to my RDBMS index table to make it easy to search for dates. Querying your time table will be fast which reduces the need for scans in your index table.
> CREATE TABLE timeseries (
>  time_key INT,
>  date TIMESTAMP,
>  datestring VARCHAR(30),
>  year SMALLINT,
>  month TINYINT,
>  day TINYINT,
>  day_of_week TINYINT
>  -- etc
> );
> CREATE TABLE riak_index (
>  time_key INT NOT NULL REFERENCES timeseries(time_key),
>  riak_key VARCHAR(100) NOT NULL
> );
> SELECT ri.riak_key
> FROM timeseries ts
> JOIN riak_index ri ON ts.time_key = ri.time_key
> WHERE ts.date BETWEEN '20090702' AND '20100702';
> My plan was to have the riak_index contain something like: (id, start_time, end_time, source_id, record_count.)
> Without going too much into RDBMS fun, this pattern can get your RDBMS running pretty quickly and then you can combine that with Riak's performance and have a really good idea of how quick any query will be.
> That's roughly the plan, thanks again for your contributions to the discussion!
> Paul 
> _______________________________________________
> riak-users mailing list
> riak-users at lists.basho.com
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

More information about the riak-users mailing list