riaksearch performace when numFound is high
greg at clipboard.com
Fri Jul 15 16:01:21 EDT 2011
Thanks for the detailed reply.
We're a node.js shop and we're hitting riak via the riak-js library, which
AFAIK just uses the http client.
We haven't actually scaled out to the point where we're seeing huge #s of
results returned, but if we extrapolate from the numbers we are seeing, it's
clear that we'll start to hit these limits. We have a lot of #2 type queries
because we support full text search, and unfortunately our "text" tends to
be a lot longer than 140 characters, so I really don't think inlining it is
Clipboard <http://www.clipboard.com/> is hiring<http://www.clipboard.com/jobs>
On Fri, Jul 15, 2011 at 11:49 AM, Ryan Zezeski <rzezeski at basho.com> wrote:
> I'm curious to know how you are querying search (i.e. which client/API) and
> what your setup looks like (i.e. # of nodes, disk, network between them,
> etc). How many of these #2 query types are you seeing and what's the
> average # of results being returned? Is your search instance opened up to
> user queries?
> Ignoring any other inefficiencies in the Search code I'd say that there are
> two main points working against you in query #2
> 1.) Search uses a _global index_  which means that, as Rusty would say,
> it's partitioned by term (as opposed to a _local index_ which would
> partition by document). It has been found, in cases where the query
> contains less terms than the processor count, that a global index scales out
> better as it has better concurrency characteristics . However, a local
> index can balance load better and in the case of #2 could possibly drop
> latency times. When you perform that "text:guitar AND date:07/14/11" each
> term is queried by one, and only one, vnode. These results are then
> combined by a coordinator (called a broker in ). How much using a local
> index would change your observed latency is unknown to me, but it would have
> other effects that may offset any latency drops.
> 2.) Search is not limiting the results returned. Instead of sorting the
> results based on relevance and only taking the top N from each query term
> it's getting all data from each query, passing it to the coordinator and
> saying "hey, you figure this out." When you run that #2 query, as you said,
> you are bounded by your largest result set because that's the only way for
> search to know it's correctly answered the query. Said another way, you may
> only have 10 entries for the 14th but you still need to search all 100K
> guitar entries to determine which ones fall on that date. Performing some
> sort of relevance calculation beforehand and limiting the result
> would certainly help but it also means you won't get the full result set. I
> don't think there's any way to have your cake and eat it too unless you can
> afford to put everything in memory.
> At this point in time I think inline fields will be your best bet. Inline
> fields help dramatically here because you are essentially bringing part of
> your document into each term entry in the index. This means you can query
> on high cardinality terms and filter on lower ones, i.e. low latency (high
> and low being relative here). If you're worried about disk space then you
> can make use of the `only` value for the inline attribute. This tells
> Search to store this field's value inline but _don't_ index it. If you're
> only using the `text` field to filter results then this is exactly what you
> should do. In fact, I would recommend you do that because any search
> against the `text` field for a corpus of tweets is probably going to have a
> large result set.
> : C.S. Badue. Distributed query processing using partitioned inverted
> files. Master's thesis, Federal University of Minas Gerais, Belo Horizonte,
> Minas Gerias, Brazil, March 2001.
> On Thu, Jul 14, 2011 at 6:20 PM, Greg Pascale <greg at clipboard.com> wrote:
>> Hi Ryan,
>> Yes we are using 14.2.
>> I think I get what you are saying about inline fields. It looks like it
>> will fix some of our problems, but not all of them. If you'll indulge me in
>> a little contrived example, I think I can explain what I mean.
>> Let's say I'm implementing a simple twitter clone. In my system, tweets
>> have a user, text, date and can be either public or private - so one might
>> look like
>> username: greg,
>> public: true,
>> text: "I bought a new guitar!",
>> date: 07/14/11
>> Let's say that my service is popular - I have 100k users, each of whom has
>> published exactly 100 tweets, and exactly half of the tweets are public, the
>> rest are private. So I have 10 million tweets total and 5 million of them
>> are public.
>> Query 1: "username:greg AND public:false" - this finds all tweets by greg
>> that are private. In this case "username:greg" matches 100 results and
>> "public:false" matches 5 million. My tests have shown that the performance
>> of this compound query will be roughly equivalent to the worse of the two
>> ("public:false") - so this query will be way too slow.
>> However, since "public" values are nice and small, I can inline it for a
>> big win.
>> Query 2: "text:guitar AND date:07/14/11" - this finds all tweets from
>> today that contain "guitar". Suppose there are 100k tweets that contain
>> "guitar", but it's early in the day so there have only been 1k tweets in
>> total on 07/14/11. My result set is therefore no bigger than 1k (probably
>> much smaller unless all my users bought new guitars today), but this query
>> is still bounded by the "text:guitar" piece which matches 100k results. In
>> this case, inlining date wouldn't help because it's not the slow part, and
>> indexing on the text field isn't practical - it's too big.
>> If you could follow that, am I understanding this correctly? In our
>> system, we have some queries like #1 above, but since we support full text
>> search, many like #2 as well. Do you have any suggestions for what we could
>> do to make queries like #2 performant?
>> On Wed, Jul 13, 2011 at 7:19 PM, Ryan Zezeski <rzezeski at basho.com> wrote:
>>> I'm assuming you are using 14.2.
>>> 1) There is a bug in 14.2 that will cause a (potentially very fast
>>> growing) memory leak when using AND. This is unfortunate, sorry. The good
>>> news I have since patched it .
>>> 2) This is your best course of action, and you were so close but you've
>>> actually crossed your fields. That is, the inline field should be the one
>>> that contains the more common term (i.e. the 'text' field). So you should
>>> perform a range query on your date with a filter on the text inline field.
>>> Obviously, the more terms in this field the more the index will inflate
>>> (space-wise), but if you can live with that then it should reduce your
>>> latency substantially (famous last words). Please try this and get back to
>>> 3) That is a very well written article, props to the author. However, I
>>> would leave this as a last resort. Try what I mentioned in #2, and if
>>> that's not enough to get you by then let's brainstorm.
>>> On Wed, Jul 6, 2011 at 2:43 PM, Greg Pascale <greg at clipboard.com> wrote:
>>>> I'm looking at ways to improve riaksearch queries that produce a lot of
>>>> In my use case, I only ever want the top 20 results for any query, and
>>>> results should be ordered by date (which is encoded in the key). For
>>>> searches with few matches (numFound < ~1000), performance is great. For
>>>> searches with more matches (numFound > ~10000), performance starts to lag
>>>> even though I only ever want the top 20. I assume this is because the system
>>>> needs to fetch and sort all of the results to know what the top 20 are, but
>>>> I'm hoping I can exploit the constraints of my use case in some way to
>>>> increase performance. I've looked at the following approaches.
>>>> 1) AND the "text:" term with a small date range (e.g. text:<common word>
>>>> AND date:[<yesterday to today>]). This reduces the result set, but
>>>> performance does not improve. At best, the performance is as good as simply
>>>> doing the "text:<common word>" search without the date range, and in some
>>>> cases worse.
>>>> 2) Same as above, but make the date an inline field. From what I could
>>>> find on the topic, it sounded like this is exactly what inline fields or
>>>> for, but I was disappointed to discover it performed far worse than even the
>>>> compound query above.
>>>> 3) In this article <http://blog.inagist.com/searching-with-riaksearch>,
>>>> which I was linked to from somewhere on the basho site, the author describes
>>>> a technique in which he calls search_fold directly and stops after he's
>>>> received enough results. He claims this is possible in his case because
>>>> results are returned in key order, and he's chosen his keys to match the
>>>> desired ordering of his results. My keys have the same property, as I'm
>>>> already using the presort=key option. Is this behavior of search_fold a
>>>> lucky side-effect, or is this actually guaranteed to work?
>>>> Am I simply expecting too much of riaksearch here, or is there a way to
>>>> make this work? If all else fails, I suppose I could divide my data into
>>>> more buckets, but I'm hoping to avoid that as it would make querying much
>>>> more complex.
>>>> riak-users mailing list
>>>> riak-users at lists.basho.com
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the riak-users