Schema design - version history and time travel

Patrik Sundberg patrik.sundberg at
Tue Aug 28 05:47:07 EDT 2012

On Tue, Aug 28, 2012 at 7:11 AM, Mark Phillips <mark at> wrote:

> Hi Patrik,
> Sorry for the late response here.
> On Fri, Aug 17, 2012 at 9:37 AM, Patrik Sundberg
> <patrik.sundberg at> wrote:
> > Hi,
> >
> > I'll simplify the case to something easier to follow. The typical
> question I
> > have is: find piece of data X as of time Y. A piece of data X has a start
> > time and end time, can think of as positive integers (that I could have
> 2i
> > indices for). I'm trying to find the version of X whose start and end
> time
> > integer range includes the integer Y.
> >
> > I don't see how I make the query with 2i, and doing it via Search seems
> > wrong since don't need the overhead of converting to text etc. Do I need
> to
> > do a 2step procedure where I get all the possible X versions (think
> > intervals) via a 2i query (I can organize that easily), then a map
> reduce on
> > those results to find the right interval covering Y? The number of
> versions
> > for the map reduce will typically be in the range of 10s to 1000s at the
> > maximum, not more.
> >
> My initial thoughts (based on a quick reading of this email) is that a
> 2i range query that feeds the resulting keys to a M/R job [0] would do
> the trick.
I came to same conclusion for the design where each version is a different
key. I may go with a single key for all versions design though, see below.

> What type of response times are you looking for with these queries?
> When you say "The number of versions for the map reduce will typically
> be in the range of 10s to 1000s at the maximum" do you mean that the
> total number of keys you'll be map-reduce'ing over will be in the 10s
> to 1000s range? Or the result set you'll be producing with that M/R
> job will be on the order of that?
I meant that for object with ID X, there will be 10s to 1000s version of
that object over time. So 2i result would be 10s to 1000s and it'd be 10s
to 1000s inputs to M/R, with just 1 result.

Response times are not my main concern, I doubt they can be too bad unless
I mess up completely to become huge M/R jobs for everything. Ease of use
and low complexity scoring higher for time being.

However, I'm now thinking that including all versions in the value for a
single key is the way to go for me, will be easier and estimating I don't
think values will become too big. And I can add in some archiving if it
becomes apparent that I need to later. Since ordered by time I've got the
natural archiving by time where far back data will be used less often so
the hit of double dispatch to trawl archives not that important. Include
all those 10s to 1000s version of the object with ID X in one value, hold
the version sorted by time and have an easy way to find the right version
in application logic.

Thanks for the input!

> Hope that helps.
> Mark
> [0] All the way at the bottom of this --->
> > Any input would be great!
> >
> > On Wed, Aug 15, 2012 at 11:54 AM, Patrik Sundberg
> > <patrik.sundberg at> wrote:
> >>
> >> Hi,
> >>
> >> I have a domain where I want to be able to "time travel". I don't have
> >> many of updates (many more reads), but when there is an update I need to
> >> preserve history and create new versions. Setting my local "application
> >> time" determines which version of a particular piece of data is
> fetched, and
> >> I can go back in time and recreate how things looked previously. One
> can't
> >> change the past, just create new versions in the "future" relative to
> last
> >> version. Using a model of "starting point + replaying deltas" to get to
> a
> >> given time is not a good idea, it's an ever evolving state where
> snapshots
> >> are cheap enough to store and reduces complexity a lot.
> >>
> >> My domain objects are in the order of a couple of hundred types, each
> type
> >> having some pure data properties (10s, up to hundreds, easily
> represented as
> >> JSON blobs) and in the order of tens, maximum hundreds of has_one and
> >> has_many type relationships to other objects (which can be of different
> >> type). The relationships only require one direction, always from parent
> to
> >> child (sourced to destination). An object has a given unique ID, and a
> >> version of that object has a given unique valid time period (with the
> latest
> >> version having an implicit "infinity" end of period).
> >>
> >> The queries are mostly to find a data property or a relationship for a
> >> given object. A few special cases may be for range queries and exact
> queries
> >> on properties, easily taken care of by 2i queries.
> >>
> >> I'm trying to think of if and how my domain would be fitted into a riak
> >> "schema". My hunch of starting point:
> >> - map object types to buckets
> >> - make the unique object IDs the keys in the bucket to represent the
> >> concept of that object
> >> - not sure how to represent the links to versions of that particular
> >> object
> >> - the versions themselves may be either in the same bucket or in a
> another
> >> bucket (think "cars" and "car-versions" or using "cars" for both)
> >> - a version has a JSON value with its properties, some 2i for any
> possible
> >> exact and range queries I need
> >> - the has_one and has_many links i could do in several ways. first
> >> decision is if to point them to the object identity or directly to a
> >> specific version. then can use Link, can use 2i, can store IDs in the
> >> and do a 2 query fetch to get there
> >> - 99% of read operations are of the type "given the time of X, give me
> the
> >> property or relation Y of object with ID Z"
> >>
> >> Anyone having built something similar with a time snapshot/version angle
> >> with experience to share? Any input in general appreciated.
> >>
> >> Thanks,
> >> Patrik
> >>
> >
> >
> > _______________________________________________
> > riak-users mailing list
> > riak-users at
> >
> >
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

More information about the riak-users mailing list