Storage of time-series data

Alexander Sicular siculars at gmail.com
Wed May 19 02:57:30 EDT 2010


That is exactly correct. Most everything performance wise in riak when it comes to m/r, best as I can tell, revolves around total number of objects in a bucket. If your architecture can be constructed in such a way that your buckets will have tens of thousands of keys vs. hundreds of thousands or millions of keys then the list all keys function that is called when you pass a bucket via the input field of a map/reduce function will execute that much quicker. Also note that the input field will accommodate bucket/key pairs, although I have not tested passing in thousands of bucket/key pairs to a map/reduce function.

In my testing of riak I have been bypassing the internal list keys function as much as possible in favor of my own bucket index key kept in a separate _index bucket of my own fashion or better, in a redis set. My testing tops out at keys less than 100k per bucket though which is enough for my use cases. Redis is a fantastic pairing with riak, imho. Specifically the atomicity of it's functions like set and increment which are not in the riak offering. When using redis in a purely index capacity you do not have to be as concerned with it's memory footprint.

Here is an exercise: Lets say you have a bucket with 10000 keys. You use the bucket name as input in one m/r and list out 10000 b/k pairs in another. Which would run faster?

-Alexander

On May 18, 2010, at 10:51 PM, Daniel Einspanjer wrote:

> I do a lot of temporal aggregate statistics in the Mozilla Socorro project using HBase.  The problem is made much easier there because you can have a rowkey that uses the timestamp as a prefix making it easy to do a range query, and then HBase also has an atomic increment function that can be used to easily accumulate and store the aggregates.
> 
> Thinking about this problem from what I've learned so far about Riak (which I confess I am still learning), it seems to me that the hardest part would be querying for a particular subset of the bucket objects for which you wish to aggregate statistics.  If you don't expect to be storing so many documents that it would be unreasonable to map reduce over the entire bucket set and filter for only the time range you are interested in, then you shouldn't have a problem.  If you were expecting massive quantities of documents, then maybe you could partition the data into a bucket for each day or week or whatever interval gives you a small enough collection size that you can map over them.
> 
> Once the problem of the input data set is resolved, I suspect you could have the reduce phase build a json object containing all the relevant aggregate statistics for that time period, then store that object in a "metrics" bucket with the key being the time period.  I'm thinking something along the lines of this (based on https://wiki.mozilla.org/Socorro:HBase#special_records):
> 
> bucket: "metrics"
> key: "2010-05-19T00
> value: {
>  widgets_sold: 15000
>  website_visits: 2
>  sum_page_views: 46
>  average_page_views_per_visit: 23
>  sum_visit_duration_seconds: 1216
>  average_visit_duration_seconds: 608
> }
> 
> 
> On 5/18/10 11:01 PM, Sean Cribbs wrote:
>> Buckets are essentially free if you are not changing their properties from the defaults (which you can set globally in app.config).  Keep in mind the options I presented are not the only ones, just points of departure for your own schema design.
>> 
>> Sean Cribbs<sean at basho.com>
>> Developer Advocate
>> Basho Technologies, Inc.
>> http://basho.com/
>> 
>> On May 18, 2010, at 8:03 PM, Joel Pitt wrote:
>> 
>>> Thanks Sean. Looks like 3 might be the best plan.
>>> 
>>> And, pre/post-commit hooks... cool! I didn't see those - that's
>>> something I've been looking for (since I'd prefer to keep that kind of
>>> stuff happening on the data nodes rather than in the client/app
>>> itself).
>>> 
>>> One further question, is there any limitation to how the number of
>>> buckets can scale? If you're recommending using them to box data by
>>> minute I'm guessing that # buckets can increase without worry, but is
>>> this still the case if say I started binning into buckets by second?
>>> 
>>> J
>>> 
>>> On Wed, May 19, 2010 at 1:53 AM, Sean Cribbs<sean at basho.com>  wrote:
>>>> Joel,
>>>> 
>>>> Riak's only query mechanism aside from simple key retrieval is map-reduce.  However, there are a number of strategies you could take, depending on what you want to query. I don't know the requirements of your application, but here are some options:
>>>> 
>>>> 1) Store the data either keyed on the timestamp, or as separate objects linked from a timestamp object.
>>>> 2) Create buckets for each time-window you want to track.  For example, if I wanted to box data by minute, I'd make bucket names that look like: 2010-05-18T09.46.  Then if I want all the data from that minute, I'd run a map-reduce query with that bucket name as the inputs.
>>>> 3) Create your own secondary indexes with a post-commit hook or code in your application for year, month, day, etc.  The secondary index would be, like #1, keys that only contain links to the actual data.
>>>> 
>>>> With any of these options (which are by no means exhaustive), your map-reduce query will need to sort the data in a reduce phase if you require chronological ordering. Also, if you're building your own indexes in separate buckets, depending on the write throughput of your application, you might want to build in some sort of conflict resolution and turn on allow_mult so that concurrent updates are not lost.
>>>> 
>>>> Sean Cribbs<sean at basho.com>
>>>> Developer Advocate
>>>> Basho Technologies, Inc.
>>>> http://basho.com/
>>>> 
>>>> On May 17, 2010, at 8:31 PM, Joel Pitt wrote:
>>>> 
>>>>> Hi,
>>>>> 
>>>>> I'm trying to work out the best way of storing temporal data in Riak.
>>>>> 
>>>>> I've been investigating several NoSQL solutions and originally started
>>>>> out using CouchDB, however I want to move to a db that scales more
>>>>> gradually (CouchDB scales, but you really have to set up the
>>>>> architecture before-hand and I'd prefer to be able to build a cluster
>>>>> a node at a time)
>>>>> 
>>>>> In CouchDB, I use a multi-level key in a map-reduce view to create an
>>>>> index by time. Each reduce level corresponds to year, month, day,
>>>>> time... so I can easily get aggregate data for say a month.
>>>>> 
>>>>> In addition to Riak I'm investigating Cassandra. In Cassandra the way
>>>>> to store time series is by making the column keys timestamps and
>>>>> sorting columns by TimeUUID. This allows one to do slices across a
>>>>> range of time. This isn't exactly the same as what I have in CouchDB,
>>>>> but by consensus it seems to be the way to store a time index.
>>>>> 
>>>>> Any suggestions for working with or creating time indexes in Riak?
>>>>> 
>>>>> Ideally I'd be able to query documents with a time range to either get
>>>>> the documents, or to calculate aggregate statistics using a map-reduce
>>>>> task.
>>>>> 
>>>>> Any information appreciated :-)
>>>>> 
>>>>> Joel Pitt, PhD | http://ferrouswheel.me | +64 21 101 7308
>>>>> NetEmpathy Co-founder | http://netempathy.com
>>>>> OpenCog Developer | http://opencog.org
>>>>> Board member, Humanity+ | http://humanityplus.org
>>>>> 
>>>>> _______________________________________________
>>>>> riak-users mailing list
>>>>> riak-users at lists.basho.com
>>>>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>>>> 
>> 
>> _______________________________________________
>> riak-users mailing list
>> riak-users at lists.basho.com
>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
> 
> _______________________________________________
> riak-users mailing list
> riak-users at lists.basho.com
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com





More information about the riak-users mailing list