Is Riak suitable for a short-term scatter/gather sort of data store?

Keith Irwin keith at
Sat Nov 12 17:19:07 EST 2011


(Apologies up front for the length of this.)

I'm wondering if you can let me know if Riak is a good fit for a simple not-quite-key-value scenario described below. MongoDB or (say) Postgresql seem a more natural fit conceptually, but I really, really like Riak's distribution strategy.

## context

The basic overview is this: 

50K devices push data once a second to web services which need to store that data in short-term storage (Riak). Once an hour, a sweeper needs to take an hour's worth of data per device (if there is any) and ship it off to long term storage, then delete it from short-term storage. Ideally, there'd only ever be slightly more than 1 hour's worth of data still in short-term storage for any given device. The goal is to write down the data as simply and safely as possible, with little or no processing on that data.

Each second's worth of data is:

* A device identifier
* A timestamp (epoch seconds, integer) for the slice of time the data represents
* An opaque blob of binary data (2 to 4k)

Once an hour, I'd like to do something like:

* For each device:
	* Find (and concat) all the data between time1 and time2 (an hour).
	* Move that data to long-term storage (not Riak) as a single blob.
	* Delete that data from Riak.

For an SQL db, this is a really simple problem, conceptually. You can have a table with three columns: device-id, timestamp, blob. You can index the first two columns and roll up the data easily enough and then delete it via single SQL statements (or buffer as needed). The harder part is partitioning, replication, etc, etc.

For MongoDB, it's also fairly simple. Just use a document with the same device-id, timestamp and binary-array data (as JSON), make sure indexes are declared, and query/delete just as in SQL. MongoDB provides sharding, replica-sets, recovery, etc. Set up, while less complicated than an RDBMS, still seems way more complicated than necessary.

These solutions also provide sorting (which, while nice, isn't a requirement for my case).

## question

I've been reading the Riak docs, and I'm just not sure if this simple "queryable" case can really fit all that well. I'm not so concerned about having to send 50K "deletes" to delete data. I'm more concerned about being able to find it. Given what I've written above, I may be blocked conceptually by the above index/query mentality such that I'm just not seeing the Riak way of doing things.

Anyway, I can "tag" (via the secondary index feature) each blob of data with the device-id and the timestamp. I could then do a range query similar to:

    GET /buckets/devices/index/timestamp/start/end

However, this doesn't allow me to group based on device-id. I could create a separate bucket for every device, such that I could do:

    GET /buckets/device-id/index/timestamp/start/end

but if I do this, how can I get a list of the device-ids I need so that I can create that specific URL? The docs say listing buckets and keys is problematic.

Might be that Riak just isn't a good case for this sort of thing, especially given I want to use it for short-term transient data, and that's fine. But I wanted to ask you all just to make sure that I'm not missing something somewhere.

For instance, might link walking help? How about a map/reduce to find a unique list of device-ids within a given time-horizon, and a streaming map job to gather the data for export? Does that seem pretty reasonable?



More information about the riak-users mailing list