Secondary indexing

Kresten Krab Thorup krab at trifork.com
Sat Apr 23 05:28:33 EDT 2011


My colleague Erik is working on something that sounds like a solution for your problem, http://polymorphictypist.blogspot.com/2011/04/multi-version-collections-in-riak.html

In combination with a set of commit hooks like riak_link_index, it may be just what you're looking for. 

I believe that he has implemented some of it, but It hasn't been released yet.  He's at github.com/eriksoe.

Kresten

--
301 Moved Permanently
Email: Kresten Krab Thorup <kresten at krab.dk>

On 22/04/2011, at 22.46, "Bryan O'Sullivan" <bos at mailrank.com> wrote:

> In my application, I have a frequent need to store collections of data where I rarely access the items individually, but much more commonly in aggregate form. In principle, the mapreduce feature is well suited to this, but it must be fed a set of {bucket,key} pairs to consume, and maintaining these collections is painful.
> 
> The naive approach would be to store each collection in its own bucket, and use listkeys to fetch the keys that I can then mapreduce over. However, listkeys has red flags all over it, presumably because it's implemented as a scan+filter of all buckets and keys.
> 
> The slightly less naive approach would be to store the keys for a collection in a list. Unfortunately, for a collection of even modest size (10,000 elements), the cost of retrieving and storing this index to make small changes (e.g. "I'm adding one element to the collection") rapidly becomes untenable, and that's before such knotty issues as "how do I delete items from this index?" arise. This is essentially the approach taken by the riak_link_index project.
> 
> A more subtle approach would be to maintain a hierarchical index, with a "current" index that is modified whenever items are added, and promoted to "frozen" once it reaches a certain size. Frozen indices would be maintained in an index of their own, which itself would be frozen, and so on. Performance would be maintained by limiting the maximum size of each index object. This is a similar scheme to the traditional filesystem approach of direct, single indirect, and double indirect block indexing. It's also quite a lot of work to implement.
> 
> Does Basho have any plans to address this? It seems likely to be a very frequently arising need.
> _______________________________________________
> riak-users mailing list
> riak-users at lists.basho.com
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com




More information about the riak-users mailing list