Alternative to Post-Commit in EDS

Anthony Molinaro anthonym at alumni.caltech.edu
Wed Apr 4 17:37:59 EDT 2012


Okay, so here's what I'm thinking now after reading through some of
the M/R docs.  Suppose I did this.

1. Create 2 buckets
   - one for K/V pairs
   - one for changed keys keyed by a timestamp or bin or something
     (run in post-commit on source colo).
2. Replicate both buckets to remote colo
2. Use a key filter with M/R to get keys changed from some time in the past
3. Run M/R regularly to publish key changes (probably to a rabbit queue)
4. Have local consumer read key changes then grab updated Values from first
bucket

I think this will all work, I'm not totally sure on the key filtering, but
it seems like a second bucket with time based keys would work best.  I plan
to serialize all writes to each bucket as that is a requirement for auditing
so just having a single integer key with the time the entry was written
will probably work, then a key filter with a simple greater than.  I can
even overlap times to pick up any late additions caused by backups in
replication, since I only keep track of changed keys, and always read
the most current.  I guess you could end up with the timestamp based
bucket replicating faster and thus data drift, hmm, that could be an issue.

Maybe a secondary index with time might work better.  I believe I need
some sort of secondary index as otherwise iterating over all the entries
in a bucket would be costly.  I don't know exact numbers but I would guess
I'm looking at worst case several million K/V pairs per bucket so maybe M/R
on that isn't so bad.  Is there any speed up with 2i and a key filter (can
you even create a key filter based on 2i?).

Anyway, still searching for a way to do this efficiently,

-Anthony

On Wed, Apr 04, 2012 at 09:20:04AM -0700, Anthony Molinaro wrote:
> 
> On Wed, Apr 04, 2012 at 08:10:29AM -0600, Jon Meredith wrote:
> > Riak does have a last modified field, but it's last modified by client so
> > is deliberately left untouched on replication. Similarly the vclock is not
> > incremented either (the vclocks/siblings from both sides are resolved using
> > the two vclocks).
> 
> That's great, as I'd want to know on the far end when the client modified
> it.
> 
> > There are no obvious mechanisms for doing what you want currently.  I'll
> > think about options and somebody will get back to you.
> 
> Is it not possible to use the last modified filed in a Map/Reduce?  I've
> not actually played with M/R in Riak yet (as I've only ever used it
> previously as a Key/Value store).  I'll try to dig into it a bit today
> but I assumed I could do something to map over all records in a bucket
> checking last modified, and return the set modified since a certain
> time (or better yet put them in a rabbit queue to be consumed by my
> systems which will cache the data).
> 
> Alternatively, I could maybe have a second bucket representing the changed
> keys, where each time a key is changed in the primary bucket, I could
> add an entry to the other bucket.  I could then replicate that bucket
> and just list keys on the remote side (maybe also deleting so subsequent
> list keys only get changes, but then I think the replicator will replace
> those keys, so I'd have to have some sort of bidirectional replication
> for those buckets, sounds messy).
> 
> Anyway, hopefully someone will have an idea,
> 
> -Anthony
> 
> -- 
> ------------------------------------------------------------------------
> Anthony Molinaro                           <anthonym at alumni.caltech.edu>
> 
> _______________________________________________
> riak-users mailing list
> riak-users at lists.basho.com
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

-- 
------------------------------------------------------------------------
Anthony Molinaro                           <anthonym at alumni.caltech.edu>




More information about the riak-users mailing list