Secondary Indexes - Feedback?

Nate Lawson nate at
Wed Nov 16 16:04:17 EST 2011

On Nov 16, 2011, at 11:41 AM, Rusty Klophaus wrote:

>> 2. We need a guaranteed order of inputs from a 2I query. If we select on a range, each key we get on a given node in the M-R job should be ordered according to the 2I values. Of course we understand that keys won't be ordered across nodes (that would require a reduce phase to merge sort) but it would be good for each node to know that its portion of the covering set is ordered.
> Can you elaborate on this? Trying to think of a problem where it helps to have the results ordered by node but not necessarily overall, and drawing a blank.

For each key returned, we check some other data in the value. Assume a 2I for "date created" on each key and the ability for each node in the M-R job to get an ordered list of keys by date range.

During M-R, each node sees an ordered subset of the total keys, but the subset is sparse (meaning keys are not necessarily contiguous by their date on a given node). Each node then walks through its keys in date order and retrieves the value for each. It does some operations on the value. If the value is found to not be a member of the desired relevant set, it would terminate its map phase without fetching values for other keys.

Example key subsets per node, showing 2I values in descending order:

Node 1: (8, 2, 1)
Node 2: (8, 7, 6, 5)
Node 3: (9, 2)

>  4. Need any of the nodes to be able to abort a M-R job in the middle with a successful completion status. Combined with the above, you could return partial results (say, top 10 matches) without having to generate all the keys in a given range.
> Pagination is one clear case where this would be useful. Are you facing any other scenarios where this is necessary?
> Also, are you always looking for top X results, where X is known before the query starts? Or does X change depending on the results?

It is like pagination, but X is not known a priori. Fetching and examining the value for each key determines if we have reached the end of that subset of the data. Once all nodes have reached the end of their sorted subset, the job completes with a reduce phase.

We can't yet do this in Riak so we're using something else for this. I think features 1-4 would be needed for us to do it in Riak.


More information about the riak-users mailing list