Secondary indexing

Jason J. W. Williams jasonjwwilliams at gmail.com
Fri Apr 22 17:13:59 EDT 2011


We implemented a scheme for maintaining secondary indexes using
MapReduce and KeyFilters. It's written in Python but there's a
description of how the design works and a test script for validating
other implementations:

https://github.com/williamsjj/txriakidx

Nor sure if that will help...it's reasonably performant and much
faster than regular Riak MR on the values.

-J

On Fri, Apr 22, 2011 at 2:45 PM, Bryan O'Sullivan <bos at mailrank.com> wrote:
> In my application, I have a frequent need to store collections of data where
> I rarely access the items individually, but much more commonly in aggregate
> form. In principle, the mapreduce feature is well suited to this, but it
> must be fed a set of {bucket,key} pairs to consume, and maintaining these
> collections is painful.
> The naive approach would be to store each collection in its own bucket, and
> use listkeys to fetch the keys that I can then mapreduce over. However,
> listkeys has red flags all over it, presumably because it's implemented as a
> scan+filter of all buckets and keys.
> The slightly less naive approach would be to store the keys for a collection
> in a list. Unfortunately, for a collection of even modest size (10,000
> elements), the cost of retrieving and storing this index to make small
> changes (e.g. "I'm adding one element to the collection") rapidly becomes
> untenable, and that's before such knotty issues as "how do I delete items
> from this index?" arise. This is essentially the approach taken by the
> riak_link_index project.
> A more subtle approach would be to maintain a hierarchical index, with a
> "current" index that is modified whenever items are added, and promoted to
> "frozen" once it reaches a certain size. Frozen indices would be maintained
> in an index of their own, which itself would be frozen, and so on.
> Performance would be maintained by limiting the maximum size of each index
> object. This is a similar scheme to the traditional filesystem approach of
> direct, single indirect, and double indirect block indexing. It's also quite
> a lot of work to implement.
> Does Basho have any plans to address this? It seems likely to be a very
> frequently arising need.
> _______________________________________________
> riak-users mailing list
> riak-users at lists.basho.com
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>
>




More information about the riak-users mailing list