Is Riak suitable for a short-term scatter/gather sort of data store?
keith at zentrope.com
Sat Nov 12 18:43:35 EST 2011
On Nov 12, 2011, at 2:32 PM, Gordon Tillman wrote:
> Keith I have an idea that might work for you. This is a bit vague but I would be glad to put together a more concrete example if you like.
Okay, thanks! Not sure I understand everything, though.
> Use secondary indexes to tag each entry with the device id.
I get the tagging part, but I'm not sure what the bucket and key being tagged would look like. Are you taking a single bucket for all data?
Something like that?
> You can then find all of the entries for a given device by using the the secondary index to feed into a simple map phase operation that returns only the entries that you want; i.e., those that are in a given time range.
This I don't know how to do based on my reading of the docs. Something like:
> In addition, to easily find all of the registered device ids easily you can create one entry for each device. The key can be most anything (even the device id if you encode it properly -- hash it), and you could tag each of those entries with a secondary index whose field is something like "type" or whatever and whose value is "deviceid". The value for each entry could be just a simple text/plain value whose contents is just the device id of the registered device.
Okay, I think I get this:
When a device comes in, just do something like:
When I want a list of device IDs, I can:
and get them all, right? This is more efficient than the various list functions? That makes sense to me.
I guess I'll have to try a few examples and see what happens. What you're telling me is that what I want to do is possible, or is at least not pressing against Riak's particular trade-offs too much. Or at least I hope that's what you're telling me. ;)
> On Nov 12, 2011, at 16:19 , Keith Irwin wrote:
>> (Apologies up front for the length of this.)
>> I'm wondering if you can let me know if Riak is a good fit for a simple not-quite-key-value scenario described below. MongoDB or (say) Postgresql seem a more natural fit conceptually, but I really, really like Riak's distribution strategy.
>> ## context
>> The basic overview is this:
>> 50K devices push data once a second to web services which need to store that data in short-term storage (Riak). Once an hour, a sweeper needs to take an hour's worth of data per device (if there is any) and ship it off to long term storage, then delete it from short-term storage. Ideally, there'd only ever be slightly more than 1 hour's worth of data still in short-term storage for any given device. The goal is to write down the data as simply and safely as possible, with little or no processing on that data.
>> Each second's worth of data is:
>> * A device identifier
>> * A timestamp (epoch seconds, integer) for the slice of time the data represents
>> * An opaque blob of binary data (2 to 4k)
>> Once an hour, I'd like to do something like:
>> * For each device:
>> * Find (and concat) all the data between time1 and time2 (an hour).
>> * Move that data to long-term storage (not Riak) as a single blob.
>> * Delete that data from Riak.
>> For an SQL db, this is a really simple problem, conceptually. You can have a table with three columns: device-id, timestamp, blob. You can index the first two columns and roll up the data easily enough and then delete it via single SQL statements (or buffer as needed). The harder part is partitioning, replication, etc, etc.
>> For MongoDB, it's also fairly simple. Just use a document with the same device-id, timestamp and binary-array data (as JSON), make sure indexes are declared, and query/delete just as in SQL. MongoDB provides sharding, replica-sets, recovery, etc. Set up, while less complicated than an RDBMS, still seems way more complicated than necessary.
>> These solutions also provide sorting (which, while nice, isn't a requirement for my case).
>> ## question
>> I've been reading the Riak docs, and I'm just not sure if this simple "queryable" case can really fit all that well. I'm not so concerned about having to send 50K "deletes" to delete data. I'm more concerned about being able to find it. Given what I've written above, I may be blocked conceptually by the above index/query mentality such that I'm just not seeing the Riak way of doing things.
>> Anyway, I can "tag" (via the secondary index feature) each blob of data with the device-id and the timestamp. I could then do a range query similar to:
>> GET /buckets/devices/index/timestamp/start/end
>> However, this doesn't allow me to group based on device-id. I could create a separate bucket for every device, such that I could do:
>> GET /buckets/device-id/index/timestamp/start/end
>> but if I do this, how can I get a list of the device-ids I need so that I can create that specific URL? The docs say listing buckets and keys is problematic.
>> Might be that Riak just isn't a good case for this sort of thing, especially given I want to use it for short-term transient data, and that's fine. But I wanted to ask you all just to make sure that I'm not missing something somewhere.
>> For instance, might link walking help? How about a map/reduce to find a unique list of device-ids within a given time-horizon, and a streaming map job to gather the data for export? Does that seem pretty reasonable?
>> riak-users mailing list
>> riak-users at lists.basho.com
More information about the riak-users