Is Riak suitable for a short-term scatter/gather sort of data store?

Gordon Tillman gtillman at mezeo.com
Sat Nov 12 20:41:20 EST 2011


Keith you are pretty close!

Everything could go into one bucket, not really an issue.  About this:

> 
> This I don't know how to do based on my reading of the docs. Something like:
> 
>    get /buckets/mydata/index/device_bin/FF345678912
> 
> which would return a list of .... what, device-timestamp compound keys? And then would I feed a potentially huge list of "bucket/key" pairs into a gigantic javascript query for the map-reduce phase?


You wouldn't do a get on it, you would initiate a map-reduce operation by issuing a POST to the /mapred endpoint.  You can see an example here:  http://wiki.basho.com/Secondary-Indexes.html  Look for the section titled "Exact Match Query".

You just need a simple map-phase function that emits only the objects whose timestamp is in the desired range.  It's a lot faster than you think because the map function executes on all nodes in the cluster simultaneously.

And for best performance you would use a map function written in Erlang rather than JavaScript because there is some extra overhead using JavaScript that you don't have when using an Erlang function.

I do believe that you can use Riak very well to handle what your application requires.

Give me a shout off-list if you you like and I'll put together a working example to get you started.

--gordon


On Nov 12, 2011, at 17:43 , Keith Irwin wrote:

> On Nov 12, 2011, at 2:32 PM, Gordon Tillman wrote:
> 
>> Keith I have an idea that might work for you.  This is a bit vague but I would be glad to put together a more concrete example if you like.
> 
> Okay, thanks! Not sure I understand everything, though.
> 
>> Use secondary indexes to tag each entry with the device id.
> 
> I get the tagging part, but I'm not sure what the bucket and key being tagged would look like. Are you taking a single bucket for all data?
> 
> put /buckets/mydata/keys/<device>-<timestamp>
> x-riak-index-device_bin: FF06541287AB
> 
> Something like that?
> 
>> You can then find all of the entries  for a given device by using the the secondary index to feed into a simple map phase operation that returns only the entries that you want; i.e., those that are in a given time range.
> 
> This I don't know how to do based on my reading of the docs. Something like:
> 
>    get /buckets/mydata/index/device_bin/FF345678912
> 
> which would return a list of .... what, device-timestamp compound keys? And then would I feed a potentially huge list of "bucket/key" pairs into a gigantic javascript query for the map-reduce phase?
> 
>> In addition, to easily find all of the registered device ids easily you can create one entry for each device.  The key can be most anything (even the device id if you encode it properly -- hash it), and you could tag each of those entries with a secondary index whose field is something like "type" or whatever and whose value is "deviceid".  The value for each entry could be just a simple text/plain value whose contents is just the device id of the registered device.
> 
> Okay, I think I get this:
> 
> When a device comes in, just do something like:
> 
> put /buckets/devices/<device-id>
> x-riak-index-type_bin: "device"
> 
> When I want a list of device IDs, I can:
> 
> get /buckets/devices/index/type_bin/device
> 
> and get them all, right? This is more efficient than the various list functions? That makes sense to me.
> 
> I guess I'll have to try a few examples and see what happens. What you're telling me is that what I want to do is possible, or is at least not pressing against Riak's particular trade-offs too much. Or at least I hope that's what you're telling me. ;)
> 
> Keith
> 
> 
>> 
>> --gordon
>> 
>> On Nov 12, 2011, at 16:19 , Keith Irwin wrote:
>> 
>>> Folks--
>>> 
>>> (Apologies up front for the length of this.)
>>> 
>>> I'm wondering if you can let me know if Riak is a good fit for a simple not-quite-key-value scenario described below. MongoDB or (say) Postgresql seem a more natural fit conceptually, but I really, really like Riak's distribution strategy.
>>> 
>>> ## context
>>> 
>>> The basic overview is this: 
>>> 
>>> 50K devices push data once a second to web services which need to store that data in short-term storage (Riak). Once an hour, a sweeper needs to take an hour's worth of data per device (if there is any) and ship it off to long term storage, then delete it from short-term storage. Ideally, there'd only ever be slightly more than 1 hour's worth of data still in short-term storage for any given device. The goal is to write down the data as simply and safely as possible, with little or no processing on that data.
>>> 
>>> Each second's worth of data is:
>>> 
>>> * A device identifier
>>> * A timestamp (epoch seconds, integer) for the slice of time the data represents
>>> * An opaque blob of binary data (2 to 4k)
>>> 
>>> Once an hour, I'd like to do something like:
>>> 
>>> * For each device:
>>> 	* Find (and concat) all the data between time1 and time2 (an hour).
>>> 	* Move that data to long-term storage (not Riak) as a single blob.
>>> 	* Delete that data from Riak.
>>> 
>>> For an SQL db, this is a really simple problem, conceptually. You can have a table with three columns: device-id, timestamp, blob. You can index the first two columns and roll up the data easily enough and then delete it via single SQL statements (or buffer as needed). The harder part is partitioning, replication, etc, etc.
>>> 
>>> For MongoDB, it's also fairly simple. Just use a document with the same device-id, timestamp and binary-array data (as JSON), make sure indexes are declared, and query/delete just as in SQL. MongoDB provides sharding, replica-sets, recovery, etc. Set up, while less complicated than an RDBMS, still seems way more complicated than necessary.
>>> 
>>> These solutions also provide sorting (which, while nice, isn't a requirement for my case).
>>> 
>>> ## question
>>> 
>>> I've been reading the Riak docs, and I'm just not sure if this simple "queryable" case can really fit all that well. I'm not so concerned about having to send 50K "deletes" to delete data. I'm more concerned about being able to find it. Given what I've written above, I may be blocked conceptually by the above index/query mentality such that I'm just not seeing the Riak way of doing things.
>>> 
>>> Anyway, I can "tag" (via the secondary index feature) each blob of data with the device-id and the timestamp. I could then do a range query similar to:
>>> 
>>>  GET /buckets/devices/index/timestamp/start/end
>>> 
>>> However, this doesn't allow me to group based on device-id. I could create a separate bucket for every device, such that I could do:
>>> 
>>>  GET /buckets/device-id/index/timestamp/start/end
>>> 
>>> but if I do this, how can I get a list of the device-ids I need so that I can create that specific URL? The docs say listing buckets and keys is problematic.
>>> 
>>> Might be that Riak just isn't a good case for this sort of thing, especially given I want to use it for short-term transient data, and that's fine. But I wanted to ask you all just to make sure that I'm not missing something somewhere.
>>> 
>>> For instance, might link walking help? How about a map/reduce to find a unique list of device-ids within a given time-horizon, and a streaming map job to gather the data for export? Does that seem pretty reasonable?
>>> 
>>> Thanks!
>>> 
>>> Keith
>>> _______________________________________________
>>> riak-users mailing list
>>> riak-users at lists.basho.com
>>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>> 
> 
> 
> _______________________________________________
> riak-users mailing list
> riak-users at lists.basho.com
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.basho.com/pipermail/riak-users_lists.basho.com/attachments/20111112/bc90f330/attachment.html>


More information about the riak-users mailing list