forked riaksearch for better perf

Gary William Flake gary at flake.org
Fri May 13 12:09:04 EDT 2011


Our team has a fork of the riaksearch source that we believe adds a very simple but highly desirable improvement.  We've reached out to Basho directly with the idea but, as they say, code talks.  The fork can be found at https://github.com/gpascale/riak_search and we will be issuing a pull request today for it.  Moreover, we have a single line fork of riak-js as well located at https://github.com/gpascale/riak-js.

In our use case, we need to access documents from a search index in time based order (not relevance).  Moreover, we need to paginate over these results.  As some of you know, bug 867 - https://issues.basho.com/show_bug.cgi?id=867 - prevents sort, start, and rows from being correctly applied.  The current blessed work around is to write a M/R job to handle it.  We found, however, that the degradation in performance for having to do a M/R for each query was problematic.

For very typical scenarios where you need to jump to a slice of results that are ordered by something besides relevance, we are seeing a 10-100x performance gain per query within our fork.  We believe the technique we've implemented can be used in several other examples.  For example:

1. You need a bucket to act like an enormous container or queue.  With our fork you can quickly ask for the most (or least) recent N items, process them, then remove them.

2. Your data is hierarchical (like employees in a company) and you need search results to be ordered by seniority within the hierarchy. 

3. You are building a twitter clone and you want search of tweets to show the most recent.

4. Your data has a "natural" order such as price or amount that is always preferred to TFIDF based relevance.

If you have one of these scenarios, I would encourage you to read further.  In the rest of this note I'll explain (A) why bug 867 is a show stopper for you, (B) a representation hack that you can use to speed up your M/R calls if you want to work around this with the current builds of riaksearch, and (C) show you how to use our fork to see the most significant performance gains.

(A). Bug 867

Suppose you have a query that will normally return 300 results.  You desire the results to be ordered by a key within your results.  Moreover, you want the middle 100 results after the sort has been applied.  The semantics of the SOLR API for riaksearch are such that start=100&rows=100&sort=KEY is supposed to do the job.

Bug 867 prevents this from working because riaksearch performs the slice and sort in the wrong order, yielding you 100 results, but the 100 that are in the middle of the results if they are sorted by relevance and THEN reordered by KEY afterwards.

Fixing this bug is non-trivial because the most straightforward implementation requires one to retrieve all of the documents in order to perform the sort by KEY, whereas the relevance sort only needs the information in the the index.

The work around for this issue is to do something like the following (shown in riak-js):

> db.addSearch(bucketName, query)
>     .map('Riak.mapValuesJson')
>     .reduce('Riak.reduceSort', 'function(a,b){return a.key-b.key;}')
>     .reduce('Riak.reduceSlice', [start, start + count])
>     .run(callback);

Here, you are seeding a M/R phase with the search results, pulling out the records for all results, doing the sort as a reduce with your own comparison function, then doing the slice.

For a query that returns a few thousand results, you can expect such a M/R job to take a couple of seconds on typical hardware, which is hardly acceptable for interactive applications.


(B). A representation hack

The M/R phase above suffers in that it has to retrieve the documents for all valid results, not just the ones that are part of the final sliced result set.

We can improve upon this by putting our sort key as a prefix to the document ID.  Care must be taken to properly pad the keys so that they are of fixed length.  Moreover, you'll want your synthetic document keys to have something unique as well.  We set our document iDs to be the time stamp of object creation (properly padded) plus a GUID.  Thus, each document is unique, but sorts over keys do what we want.  With this change, we can rewrite the above M/R to:

> db.addSearch(bucketName, query)
>     .map('function (v) { return [v.key]; }')
>     .reduce('Riak.reduceSort', 'function(a,b){return (a)-(b)}')
>     .reduce('Riak.reduceSlice', [start, start + count])
>     .run(callback);


The difference is that our new map() does not pull out all of the documents.  Instead, this M/R completes with with a list of documents that are in the final result (which we may still need to retrieve for our application).

We typically see 3-10x performance gains over the previous method, but this is highly dependent on typical result sizes as well as typical document sizes.


(C).  Our fork of Riak search.

The change we've introduced today takes the idea behind these synthetic keys, but puts all of the work back into the search core, thereby eliminating the M/R.

Where the search core performs a sort by relevance (prior to doing the slice specified by the SOLR arguments), we replace this with a sort by key.  That's the entire essence of the change.

By avoiding the M/R in it's entirety, we now often see performance gains of 3-10x over the previous method and, therefore, 10-100x gains over the most simplistic approach described first.

We've also introduced an option called "presort" which can be be set to "key" or "score".  The former triggers our new behavior.  The latter preserves the original behavior.  Leaving the option unspecified is the same as setting it to "score", so our fork should not introduce any breaking changes.


Anyhow, these have turned out to be essential changes for us, which is why we wanted to make them available to others.

Best,
-- GWF

Founder, Clipboard, Inc.
 




-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.basho.com/pipermail/riak-users_lists.basho.com/attachments/20110513/b0f76621/attachment.html>


More information about the riak-users mailing list