riaksearch performace when numFound is high

Greg Pascale greg at clipboard.com
Fri Jul 15 16:01:21 EDT 2011


Ryan,

Thanks for the detailed reply.

We're a node.js shop and we're hitting riak via the riak-js library, which
AFAIK just uses the http client.

We haven't actually scaled out to the point where we're seeing huge #s of
results returned, but if we extrapolate from the numbers we are seeing, it's
clear that we'll start to hit these limits. We have a lot of #2 type queries
because we support full text search, and unfortunately our "text" tends to
be a lot longer than 140 characters, so I really don't think inlining it is
practical.

-- Greg
Clipboard <http://www.clipboard.com/> is hiring<http://www.clipboard.com/jobs>
!


On Fri, Jul 15, 2011 at 11:49 AM, Ryan Zezeski <rzezeski at basho.com> wrote:

> Greg,
>
> I'm curious to know how you are querying search (i.e. which client/API) and
> what your setup looks like (i.e. # of nodes, disk, network between them,
> etc).  How many of these #2 query types are you seeing and what's the
> average # of results being returned?  Is your search instance opened up to
> user queries?
>
> Ignoring any other inefficiencies in the Search code I'd say that there are
> two main points working against you in query #2
>
> 1.) Search uses a _global index_ [1] which means that, as Rusty would say,
> it's partitioned by term (as opposed to a _local index_ which would
> partition by document).  It has been found, in cases where the query
> contains less terms than the processor count, that a global index scales out
> better as it has better concurrency characteristics [1].  However, a local
> index can balance load better and in the case of #2 could possibly drop
> latency times.  When you perform that "text:guitar AND date:07/14/11" each
> term is queried by one, and only one, vnode.  These results are then
> combined by a coordinator (called a broker in [1]).  How much using a local
> index would change your observed latency is unknown to me, but it would have
> other effects that may offset any latency drops.
>
> 2.) Search is not limiting the results returned.  Instead of sorting the
> results based on relevance and only taking the top N from each query term
> it's getting all data from each query, passing it to the coordinator and
> saying "hey, you figure this out."  When you run that #2 query, as you said,
> you are bounded by your largest result set because that's the only way for
> search to know it's correctly answered the query.  Said another way, you may
> only have 10 entries for the 14th but you still need to search all 100K
> guitar entries to determine which ones fall on that date.  Performing some
> sort of relevance calculation beforehand and limiting the result
> would certainly help but it also means you won't get the full result set.  I
> don't think there's any way to have your cake and eat it too unless you can
> afford to put everything in memory.
>
> At this point in time I think inline fields will be your best bet.  Inline
> fields help dramatically here because you are essentially bringing part of
> your document into each term entry in the index.  This means you can query
> on high cardinality terms and filter on lower ones, i.e. low latency (high
> and low being relative here).  If you're worried about disk space then you
> can make use of the `only` value for the inline attribute.  This tells
> Search to store this field's value inline but _don't_ index it.  If you're
> only using the `text` field to filter results then this is exactly what you
> should do.  In fact, I would recommend you do that because any search
> against the `text` field for a corpus of tweets is probably going to have a
> large result set.
>
>
> HTH,
> -Ryan
>
> [1]: C.S. Badue.  Distributed query processing using partitioned inverted
> files.  Master's thesis, Federal University of Minas Gerais, Belo Horizonte,
> Minas Gerias, Brazil, March 2001.
>
> On Thu, Jul 14, 2011 at 6:20 PM, Greg Pascale <greg at clipboard.com> wrote:
>
>> Hi Ryan,
>>
>> Yes we are using 14.2.
>>
>> I think I get what you are saying about inline fields. It looks like it
>> will fix some of our problems, but not all of them. If you'll indulge me in
>> a little contrived example, I think I can explain what I mean.
>>
>> Let's say I'm implementing a simple twitter clone. In my system, tweets
>> have a user, text, date and can be either public or private - so one might
>> look like
>>
>> {
>>     username: greg,
>>     public: true,
>>     text: "I bought a new guitar!",
>>     date: 07/14/11
>> }
>>
>> Let's say that my service is popular - I have 100k users, each of whom has
>> published exactly 100 tweets, and exactly half of the tweets are public, the
>> rest are private. So I have 10 million tweets total and 5 million of them
>> are public.
>>
>> Query 1: "username:greg AND public:false" - this finds all tweets by greg
>> that are private. In this case "username:greg" matches 100 results and
>> "public:false" matches 5 million. My tests have shown that the performance
>> of this compound query will be roughly equivalent to the worse of the two
>> ("public:false") - so this query will be way too slow.
>> However, since "public" values are nice and small, I can inline it for a
>> big win.
>>
>> Query 2: "text:guitar AND date:07/14/11" - this finds all tweets from
>> today that contain "guitar". Suppose there are 100k tweets that contain
>> "guitar", but it's early in the day so there have only been 1k tweets in
>> total on 07/14/11. My result set is therefore no bigger than 1k (probably
>> much smaller unless all my users bought new guitars today), but this query
>> is still bounded by the "text:guitar" piece which matches 100k results. In
>> this case, inlining date wouldn't help because it's not the slow part, and
>> indexing on the text field isn't practical - it's too big.
>>
>> If you could follow that, am I understanding this correctly? In our
>> system, we have some queries like #1 above, but since we support full text
>> search, many like #2 as well. Do you have any suggestions for what we could
>> do to make queries like #2 performant?
>>
>> Thanks,
>> -Greg
>>
>>
>>
>> On Wed, Jul 13, 2011 at 7:19 PM, Ryan Zezeski <rzezeski at basho.com> wrote:
>>
>>> Greg,
>>>
>>> I'm assuming you are using 14.2.
>>>
>>> 1) There is a bug in 14.2 that will cause a (potentially very fast
>>> growing) memory leak when using AND.  This is unfortunate, sorry.  The good
>>> news I have since patched it [1].
>>>
>>> 2) This is your best course of action, and you were so close but you've
>>> actually crossed your fields.  That is, the inline field should be the one
>>> that contains the more common term (i.e. the 'text' field).  So you should
>>> perform a range query on your date with a filter on the text inline field.
>>>  Obviously, the more terms in this field the more the index will inflate
>>> (space-wise), but if you can live with that then it should reduce your
>>> latency substantially (famous last words).  Please try this and get back to
>>> me.
>>>
>>> 3) That is a very well written article, props to the author.  However, I
>>> would leave this as a last resort.  Try what I mentioned in #2, and if
>>> that's not enough to get you by then let's brainstorm.
>>>
>>> [1]:
>>> https://github.com/basho/riak_search/commit/cd910c2519f94e9d7e8a8e21894db9d0eecdd5b4
>>>
>>>
>>> On Wed, Jul 6, 2011 at 2:43 PM, Greg Pascale <greg at clipboard.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> I'm looking at ways to improve riaksearch queries that produce a lot of
>>>> matches.
>>>>
>>>> In my use case, I only ever want the top 20 results for any query, and
>>>> results should be ordered by date (which is encoded in the key). For
>>>> searches with few matches (numFound < ~1000), performance is great. For
>>>> searches with more matches (numFound > ~10000), performance starts to lag
>>>> even though I only ever want the top 20. I assume this is because the system
>>>> needs to fetch and sort all of the results to know what the top 20 are, but
>>>> I'm hoping I can exploit the constraints of my use case in some way to
>>>> increase performance. I've looked at the following approaches.
>>>>
>>>> 1) AND the "text:" term with a small date range (e.g. text:<common word>
>>>> AND date:[<yesterday to today>]). This reduces the result set, but
>>>> performance does not improve. At best, the performance is as good as simply
>>>> doing the "text:<common word>" search without the date range, and in some
>>>> cases worse.
>>>>
>>>> 2) Same as above, but make the date an inline field. From what I could
>>>> find on the topic, it sounded like this is exactly what inline fields or
>>>> for, but I was disappointed to discover it performed far worse than even the
>>>> compound query above.
>>>>
>>>> 3) In this article <http://blog.inagist.com/searching-with-riaksearch>,
>>>> which I was linked to from somewhere on the basho site, the author describes
>>>> a technique in which he calls search_fold directly and stops after he's
>>>> received enough results. He claims this is possible in his case because
>>>> results are returned in key order, and he's chosen his keys to match the
>>>> desired ordering of his results. My keys have the same property, as I'm
>>>> already using the presort=key option. Is this behavior of search_fold a
>>>> lucky side-effect, or is this actually guaranteed to work?
>>>>
>>>> Am I simply expecting too much of riaksearch here, or is there a way to
>>>> make this work? If all else fails, I suppose I could divide my data into
>>>> more buckets, but I'm hoping to avoid that as it would make querying much
>>>> more complex.
>>>>
>>>> Thanks,
>>>> -Greg
>>>>
>>>> _______________________________________________
>>>> riak-users mailing list
>>>> riak-users at lists.basho.com
>>>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>>>>
>>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.basho.com/pipermail/riak-users_lists.basho.com/attachments/20110715/810ad4ad/attachment.html>


More information about the riak-users mailing list