Riak Secondary Index Limits

Sean Cribbs sean at basho.com
Fri Aug 29 09:46:54 EDT 2014

I made a minor mistake in my example, the PrimaryKey is part of the
index key, whereas the value contains nothing. It's more like this:

{i, IndexName, IndexTerm, PrimaryKey} => <<>>

So for the initial seek, we construct a key like so:

{i, <<"foobar_bin">>, <<"baz">>, <<>>}

On Fri, Aug 29, 2014 at 8:44 AM, Sean Cribbs <sean at basho.com> wrote:
> Hi Bryan,
> Index entries are just keys in LevelDB like normal values are. So,
> performance is relatively constant at write time but is O(N) at read
> (because you are scanning the index). The high-cardinality term will
> definitely be expensive to enumerate, but the low-cardinality terms
> will be much less so. The index entries in LevelDB look roughly like
> this:
> {i, IndexName, IndexTerm} => PrimaryKey
> Since the keys are encoded with sext, we encode the start key of the
> range (or the equality query), start an iterator and then seek to that
> start key and start reading values. When a key is reached that exceeds
> the range or equality query, we stop iterating.
> Of course, that is an oversimplification, there are issues with
> backpressure from the request coordinator process against the
> iterators, streaming merge sort if you want the results back in order
> or paginated, etc. Also, all 2I queries are run via coverage, which
> ensures that the entire keyspace is "covered" but hits most if not all
> nodes in the cluster.
> On Thu, Aug 28, 2014 at 9:18 PM, Bryan <bryan at go-factory.net> wrote:
>> Hi Everyone,
>> Apologies as this has probably been asked before. Unfortunately I have not
>> been able to parse through the list serve to find a reasonable answer and
>> the Basho wiki docs seem to be missing this information. I have read up on
>> the secondary index docs.
>> I am interested to better understand how the secondary indexes perform when
>> there is a very low distribution of values that are indexed. For example,
>> lets say I have a bucket with 1 million objects that I create a secondary
>> index on. Now lets say the index is on a value that has an uneven
>> distribution where one of the values is not selective while the others are,
>> such that 60% of the values fall into a single indexed value, while the
>> remaining 40% have a good distribution.
>> For example, I have a record (i.e. object) where the indexed field is
>> ‘foobar_bin'. I have 1 million objects in the bucket that have 100 unique
>> ‘foobar’ values distributed over the 1 million objects. One of the values
>> repeats for 60% of the records (600K) and the rest have an even distribution
>> of about 4%.
>> How will the secondary indexes perform with this and is this an appropriate
>> use of the secondary indexes? Finally, what I have read is not completely
>> clear on what happens if the indexed value is updated when the value has
>> such a low degree of selectivity?
>> We have less than 512 partitions and are using the erlang client.
>> Thanks in advance - any insights will be much appreciated!
>> Cheers,
>> Bryan
>> ----
>> Bryan Hughes
>> Go Factory
>> http://www.go-factory.net
>> _______________________________________________
>> riak-users mailing list
>> riak-users at lists.basho.com
>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
> --
> Sean Cribbs <sean at basho.com>
> Software Engineer
> Basho Technologies, Inc.
> http://basho.com/

Sean Cribbs <sean at basho.com>
Software Engineer
Basho Technologies, Inc.

More information about the riak-users mailing list