Riak Secondary Index Limits

Sean Cribbs sean at basho.com
Fri Aug 29 12:17:15 EDT 2014


Correct, there is a key in LevelDB for each Riak key that has the
index term attached. This is somewhat mitigated by Snappy compression
(600K records might very well compress into a single block), but it is
nowhere near the storage efficiency of something like Solr's indexes.
It still has to scan.

On Fri, Aug 29, 2014 at 11:01 AM, Bryan <bryan at go-factory.net> wrote:
> Hi Sean,
>
> Sweet! Thanks for the explanation. Much appreciated and very helpful.  Just a bit more clarification, on an equality lookup, where the ‘foobar’ key has a value ‘barfoo’ that is the very low-cardinality, are those indexed objects individually stored as a key/value term which then is enumerated over, versus a key/list_of_values term? In other words, if I have 600K records where foobar=barfoo, will this be 600K reads, or 1 read with a list of 600K entries which are then enumerated over in memory?
>
> Cheers,
> Bryan
> ----
>
> Bryan Hughes
> Go Factory
>
> http://www.go-factory.net
>
> On Aug 29, 2014, at 6:46 AM, Sean Cribbs <sean at basho.com> wrote:
>
>> I made a minor mistake in my example, the PrimaryKey is part of the
>> index key, whereas the value contains nothing. It's more like this:
>>
>> {i, IndexName, IndexTerm, PrimaryKey} => <<>>
>>
>> So for the initial seek, we construct a key like so:
>>
>> {i, <<"foobar_bin">>, <<"baz">>, <<>>}
>>
>> On Fri, Aug 29, 2014 at 8:44 AM, Sean Cribbs <sean at basho.com> wrote:
>>> Hi Bryan,
>>>
>>> Index entries are just keys in LevelDB like normal values are. So,
>>> performance is relatively constant at write time but is O(N) at read
>>> (because you are scanning the index). The high-cardinality term will
>>> definitely be expensive to enumerate, but the low-cardinality terms
>>> will be much less so. The index entries in LevelDB look roughly like
>>> this:
>>>
>>> {i, IndexName, IndexTerm} => PrimaryKey
>>>
>>> Since the keys are encoded with sext, we encode the start key of the
>>> range (or the equality query), start an iterator and then seek to that
>>> start key and start reading values. When a key is reached that exceeds
>>> the range or equality query, we stop iterating.
>>>
>>> Of course, that is an oversimplification, there are issues with
>>> backpressure from the request coordinator process against the
>>> iterators, streaming merge sort if you want the results back in order
>>> or paginated, etc. Also, all 2I queries are run via coverage, which
>>> ensures that the entire keyspace is "covered" but hits most if not all
>>> nodes in the cluster.
>>>
>>> On Thu, Aug 28, 2014 at 9:18 PM, Bryan <bryan at go-factory.net> wrote:
>>>> Hi Everyone,
>>>>
>>>> Apologies as this has probably been asked before. Unfortunately I have not
>>>> been able to parse through the list serve to find a reasonable answer and
>>>> the Basho wiki docs seem to be missing this information. I have read up on
>>>> the secondary index docs.
>>>>
>>>> I am interested to better understand how the secondary indexes perform when
>>>> there is a very low distribution of values that are indexed. For example,
>>>> lets say I have a bucket with 1 million objects that I create a secondary
>>>> index on. Now lets say the index is on a value that has an uneven
>>>> distribution where one of the values is not selective while the others are,
>>>> such that 60% of the values fall into a single indexed value, while the
>>>> remaining 40% have a good distribution.
>>>>
>>>> For example, I have a record (i.e. object) where the indexed field is
>>>> ‘foobar_bin'. I have 1 million objects in the bucket that have 100 unique
>>>> ‘foobar’ values distributed over the 1 million objects. One of the values
>>>> repeats for 60% of the records (600K) and the rest have an even distribution
>>>> of about 4%.
>>>>
>>>> How will the secondary indexes perform with this and is this an appropriate
>>>> use of the secondary indexes? Finally, what I have read is not completely
>>>> clear on what happens if the indexed value is updated when the value has
>>>> such a low degree of selectivity?
>>>>
>>>> We have less than 512 partitions and are using the erlang client.
>>>>
>>>> Thanks in advance - any insights will be much appreciated!
>>>>
>>>>
>>>> Cheers,
>>>> Bryan
>>>>
>>>> ----
>>>>
>>>> Bryan Hughes
>>>> Go Factory
>>>> http://www.go-factory.net
>>>>
>>>> _______________________________________________
>>>> riak-users mailing list
>>>> riak-users at lists.basho.com
>>>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>>>>
>>>
>>>
>>>
>>> --
>>> Sean Cribbs <sean at basho.com>
>>> Software Engineer
>>> Basho Technologies, Inc.
>>> http://basho.com/
>>
>>
>>
>> --
>> Sean Cribbs <sean at basho.com>
>> Software Engineer
>> Basho Technologies, Inc.
>> http://basho.com/
>



-- 
Sean Cribbs <sean at basho.com>
Software Engineer
Basho Technologies, Inc.
http://basho.com/




More information about the riak-users mailing list