Riak Secondary Index Limits

Sean Cribbs sean at basho.com
Fri Aug 29 09:44:35 EDT 2014


Hi Bryan,

Index entries are just keys in LevelDB like normal values are. So,
performance is relatively constant at write time but is O(N) at read
(because you are scanning the index). The high-cardinality term will
definitely be expensive to enumerate, but the low-cardinality terms
will be much less so. The index entries in LevelDB look roughly like
this:

{i, IndexName, IndexTerm} => PrimaryKey

Since the keys are encoded with sext, we encode the start key of the
range (or the equality query), start an iterator and then seek to that
start key and start reading values. When a key is reached that exceeds
the range or equality query, we stop iterating.

Of course, that is an oversimplification, there are issues with
backpressure from the request coordinator process against the
iterators, streaming merge sort if you want the results back in order
or paginated, etc. Also, all 2I queries are run via coverage, which
ensures that the entire keyspace is "covered" but hits most if not all
nodes in the cluster.

On Thu, Aug 28, 2014 at 9:18 PM, Bryan <bryan at go-factory.net> wrote:
> Hi Everyone,
>
> Apologies as this has probably been asked before. Unfortunately I have not
> been able to parse through the list serve to find a reasonable answer and
> the Basho wiki docs seem to be missing this information. I have read up on
> the secondary index docs.
>
> I am interested to better understand how the secondary indexes perform when
> there is a very low distribution of values that are indexed. For example,
> lets say I have a bucket with 1 million objects that I create a secondary
> index on. Now lets say the index is on a value that has an uneven
> distribution where one of the values is not selective while the others are,
> such that 60% of the values fall into a single indexed value, while the
> remaining 40% have a good distribution.
>
> For example, I have a record (i.e. object) where the indexed field is
> ‘foobar_bin'. I have 1 million objects in the bucket that have 100 unique
> ‘foobar’ values distributed over the 1 million objects. One of the values
> repeats for 60% of the records (600K) and the rest have an even distribution
> of about 4%.
>
> How will the secondary indexes perform with this and is this an appropriate
> use of the secondary indexes? Finally, what I have read is not completely
> clear on what happens if the indexed value is updated when the value has
> such a low degree of selectivity?
>
> We have less than 512 partitions and are using the erlang client.
>
> Thanks in advance - any insights will be much appreciated!
>
>
> Cheers,
> Bryan
>
> ----
>
> Bryan Hughes
> Go Factory
> http://www.go-factory.net
>
> _______________________________________________
> riak-users mailing list
> riak-users at lists.basho.com
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>



-- 
Sean Cribbs <sean at basho.com>
Software Engineer
Basho Technologies, Inc.
http://basho.com/




More information about the riak-users mailing list