Solr search performance

sean mcevoy sean.mcevoy at
Wed Sep 21 10:48:04 EDT 2016

Hi Fred,

Thanks for the pointer! 'cursorMark' is a lot more performant alright,
though apparently it doesn't suit our use case.

I've written a loop function using OTP's httpc that reads each page, gets
the cursorMark and repeats, and it returns all 147 pages with consistent
times in the 40-60ms bracket which is an excellent improvement!

I would have been asking about the effort involved in making the protocol
buffers client support this, but instead our GUI guys insist that they need
to request a page number as sometimes they want to start in the middle of a
set of data.

So I'm almost back to square one.
Can you shed any light on the internal workings of SOLR that produce the
slow-down in my original question?
I'm hoping I can find a way to restructure my index data without having to
change the higher-level API's that I support.


On Mon, Sep 19, 2016 at 10:00 PM, Fred Dushin <fdushin at> wrote:

> All great questions, Sean.
> A few things.  First off, for result sets that are that large, you are
> probably going to want to use Solr cursor marks [1], which are supported in
> the current version of Solr we ship.  Riak allows queries using cursor
> marks through the HTTP interface.  At present, it does not support cursors
> using the protobuf API, due to some internal limitations of the server-side
> protobuf library, but we do hope to fix that in the future.
> Secondly, we have found sorting with distributed queries to be far more
> performant using Solr 4.10.4.  Currently released versions of Riak use Solr
> 4.7, but as you can see on github [2], Solr 4.10.4 support has been merged
> into the develop-2.2 branch, and is in the pipeline for release.  I can't
> say when the next version of Riak is that will ship with this version
> because of indeterminacy around bug triage, but it should not be too long.
> I would start to look at using cursor marks and measure their relative
> performance in your scenario.  My guess is that you should see some
> improvement there.
> -Fred
> [1]
> [2]
> f64e19cef107d982082f5b95ed598da96fb419b0
> > On Sep 19, 2016, at 4:48 PM, sean mcevoy <sean.mcevoy at> wrote:
> >
> > Hi All,
> >
> > We have an index with ~548,000 entries, ~14,000 of which match one of
> our queries.
> > We read these in a paginated search and the first page (of 100 hits)
> returns quickly in ~70ms.
> > This response time seems to increase exponentially as we walk through
> the pages:
> > the 4th page takes ~200ms,
> > the 8th page takes ~1200ms
> > the 12th page takes ~2100ms
> > the 16th page takes ~6100ms
> > the 20th page takes ~24000ms
> >
> > And by the time we're searching for the 22nd page it regularly times out
> at the default 60 seconds.
> >
> > I have a good unsderstanding of riak KV internals but absolutely nothing
> of Lucene which I think is what's most relevant here. If anyone in the know
> can point me towards any relevant resource or can explain what's happening
> I'd be much obliged :-)
> > As I would also be if anyone with experience of using Riak/Lucene can
> tell me:
> > - Is 500K a crazy number of entries to put into one index?
> > - Is 14K a crazy number of entries to expect to be returned?
> > - Are there any methods we can use to make the search time more constant
> across the full search?
> > I read one blog post on inlining but it was a bit old & not very obvious
> how to implement using riakc_pb_socket calls.
> >
> > And out of curiosity, do we not traverse the full range of hits for each
> page? I naively thought that because I'm sorting the returned values we'd
> have to get them all first and then sort, but the response times suggests
> otherwise. Does Lucene store the data sorted by each field just in case a
> query asks for it? Or what other magic is going on?
> >
> >
> > For the technical details, we use the "_yz_default" schema and all the
> fields stored are strings:
> > - entry_id_s: unique within the DB, the aim of the query is to gather a
> list of these
> > - type_s: has one of 2 values
> > - sub_category_id_s: in the query described above all 14K hits will
> match on this, in the DB of ~500K entries there are ~43K different values
> for this field, withe each category typically having 2-6 sub categories
> > - category_id_s: not matched in this query, in the DB of ~500K entries
> there are ~13K different values for this field
> > - status_s: has one of 2 values, in the query described baove all hits
> will have the value "active"
> > - user_id_s: unique within the DB but not matched in this query
> > - first_name_s: almost unique within the DB, this query will sort by
> this field
> > - last_name_s: almost unique within the DB, this query will sort by this
> field
> >
> > This search query looks like:
> > <<"sub_category_id_s:test_1 AND status_s:active AND
> type_s:sub_category">>
> >
> > Our options parameter has the sort directive:
> > {sort, <<"first_name_s asc, last_name_s asc">>}
> >
> > The query was run on a 5-node cluster with n_val of 3.
> >
> > Thanks in advance fo rany pointers!
> > //Sean.
> >
> > _______________________________________________
> > riak-users mailing list
> > riak-users at
> >
> _______________________________________________
> riak-users mailing list
> riak-users at
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

More information about the riak-users mailing list