Secondary indexing

Bryan O'Sullivan bos at mailrank.com
Fri Apr 22 16:45:50 EDT 2011


In my application, I have a frequent need to store collections of data where
I rarely access the items individually, but much more commonly in aggregate
form. In principle, the mapreduce feature is well suited to this, but it
must be fed a set of {bucket,key} pairs to consume, and maintaining these
collections is painful.

The naive approach would be to store each collection in its own bucket, and
use listkeys to fetch the keys that I can then mapreduce over. However,
listkeys has red flags all over it, presumably because it's implemented as a
scan+filter of all buckets and keys.

The slightly less naive approach would be to store the keys for a collection
in a list. Unfortunately, for a collection of even modest size (10,000
elements), the cost of retrieving and storing this index to make small
changes (e.g. "I'm adding one element to the collection") rapidly becomes
untenable, and that's before such knotty issues as "how do I delete items
from this index?" arise. This is essentially the approach taken by the
riak_link_index project.

A more subtle approach would be to maintain a hierarchical index, with a
"current" index that is modified whenever items are added, and promoted to
"frozen" once it reaches a certain size. Frozen indices would be maintained
in an index of their own, which itself would be frozen, and so on.
Performance would be maintained by limiting the maximum size of each index
object. This is a similar scheme to the traditional filesystem approach of
direct, single indirect, and double indirect block indexing. It's also quite
a lot of work to implement.

Does Basho have any plans to address this? It seems likely to be a very
frequently arising need.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.basho.com/pipermail/riak-users_lists.basho.com/attachments/20110422/6f434d09/attachment.html>


More information about the riak-users mailing list