Riak Secondary Index Limits

Bryan bryan at go-factory.net
Fri Aug 29 12:01:13 EDT 2014

Hi Sean,

Sweet! Thanks for the explanation. Much appreciated and very helpful.  Just a bit more clarification, on an equality lookup, where the ‘foobar’ key has a value ‘barfoo’ that is the very low-cardinality, are those indexed objects individually stored as a key/value term which then is enumerated over, versus a key/list_of_values term? In other words, if I have 600K records where foobar=barfoo, will this be 600K reads, or 1 read with a list of 600K entries which are then enumerated over in memory?


Bryan Hughes
Go Factory


On Aug 29, 2014, at 6:46 AM, Sean Cribbs <sean at basho.com> wrote:

> I made a minor mistake in my example, the PrimaryKey is part of the
> index key, whereas the value contains nothing. It's more like this:
> {i, IndexName, IndexTerm, PrimaryKey} => <<>>
> So for the initial seek, we construct a key like so:
> {i, <<"foobar_bin">>, <<"baz">>, <<>>}
> On Fri, Aug 29, 2014 at 8:44 AM, Sean Cribbs <sean at basho.com> wrote:
>> Hi Bryan,
>> Index entries are just keys in LevelDB like normal values are. So,
>> performance is relatively constant at write time but is O(N) at read
>> (because you are scanning the index). The high-cardinality term will
>> definitely be expensive to enumerate, but the low-cardinality terms
>> will be much less so. The index entries in LevelDB look roughly like
>> this:
>> {i, IndexName, IndexTerm} => PrimaryKey
>> Since the keys are encoded with sext, we encode the start key of the
>> range (or the equality query), start an iterator and then seek to that
>> start key and start reading values. When a key is reached that exceeds
>> the range or equality query, we stop iterating.
>> Of course, that is an oversimplification, there are issues with
>> backpressure from the request coordinator process against the
>> iterators, streaming merge sort if you want the results back in order
>> or paginated, etc. Also, all 2I queries are run via coverage, which
>> ensures that the entire keyspace is "covered" but hits most if not all
>> nodes in the cluster.
>> On Thu, Aug 28, 2014 at 9:18 PM, Bryan <bryan at go-factory.net> wrote:
>>> Hi Everyone,
>>> Apologies as this has probably been asked before. Unfortunately I have not
>>> been able to parse through the list serve to find a reasonable answer and
>>> the Basho wiki docs seem to be missing this information. I have read up on
>>> the secondary index docs.
>>> I am interested to better understand how the secondary indexes perform when
>>> there is a very low distribution of values that are indexed. For example,
>>> lets say I have a bucket with 1 million objects that I create a secondary
>>> index on. Now lets say the index is on a value that has an uneven
>>> distribution where one of the values is not selective while the others are,
>>> such that 60% of the values fall into a single indexed value, while the
>>> remaining 40% have a good distribution.
>>> For example, I have a record (i.e. object) where the indexed field is
>>> ‘foobar_bin'. I have 1 million objects in the bucket that have 100 unique
>>> ‘foobar’ values distributed over the 1 million objects. One of the values
>>> repeats for 60% of the records (600K) and the rest have an even distribution
>>> of about 4%.
>>> How will the secondary indexes perform with this and is this an appropriate
>>> use of the secondary indexes? Finally, what I have read is not completely
>>> clear on what happens if the indexed value is updated when the value has
>>> such a low degree of selectivity?
>>> We have less than 512 partitions and are using the erlang client.
>>> Thanks in advance - any insights will be much appreciated!
>>> Cheers,
>>> Bryan
>>> ----
>>> Bryan Hughes
>>> Go Factory
>>> http://www.go-factory.net
>>> _______________________________________________
>>> riak-users mailing list
>>> riak-users at lists.basho.com
>>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>> --
>> Sean Cribbs <sean at basho.com>
>> Software Engineer
>> Basho Technologies, Inc.
>> http://basho.com/
> -- 
> Sean Cribbs <sean at basho.com>
> Software Engineer
> Basho Technologies, Inc.
> http://basho.com/

More information about the riak-users mailing list