Riak Search analyzers

Rusty Klophaus rusty at basho.com
Mon Oct 18 14:14:06 EDT 2010


Hi Dmitry,

On Wed, Oct 13, 2010 at 11:11 PM, Dmitry Demeshchuk <demeshchuk at gmail.com>wrote:

> Greetings.
>
> I have a couple of questions regarding the analyzers, mainly the Java ones.
>
> 1. Which platform is preferable for use: OpenJDK or Sun's Java? Say, I
> won't have any uses for JVM so it will be used just for analyzers.
>

We have not seen any appreciable difference between the platforms, either
one should be fine. Search isn't relying on the JVM to do anything overly
complicated.


> 2. Could you please give a brief description of the difference between
> the analyzers?
>

Sure:

*com.basho.search.analysis.DefaultAnalyzerFactory* uses Lucene's
StandardTokenizer, filters out words less than 3 characters, converts tokens
to lower case, and filters out the stopwords listed in Lucene's
StopAnalyzer.java (
http://www.koders.com/java/fid5FBD7DCAFB544D74598A9B1D82A341CD648DA47F.aspx?s=java
)

*com.basho.search.analysis.WhitespaceAnalyzerFactory* uses Lucene's
Whitespace tokenizer.

*com.basho.search.analysis.IntegerAnalyzerFactory *parses the field as
integers and by default pads to 10 places.

*{erlang, text_analyzers, default_analyzer_factory} *parses words as having
characters 0-9, a-z, or A-Z, filters out words less than 3 characters,
converts tokens to lower case, and filters out the same list of stopwords as
DefaultAnalyzerFactory.

Two things to note:
- You can create your own analyzers in Java or Erlang, see the source code
under apps/qilr/java_src
- Due to a regression bug, field-level analyzer settings are not used when
running a query. Whatever default analyzer you set for the schema is used
for all fields.



> 3. I guess you have already made some benchmarks regarding the
> analyzers, haven't you?
>


We have made some rudimentary benchmarks which shows that Erlang analyzers
are currently faster than Java-based analyzers due to the communication
overhead. We will be working on this in future iterations.


> I remember that you are going to add a special page into wiki about
> the subject. Hope this will also help you to gather up the information
> a bit.
>

Absolutely, we will continue to update the wiki with more information about
Search going forward.

Hope that helps!

Best,
Rusty
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.basho.com/pipermail/riak-users_lists.basho.com/attachments/20101018/e4625ea6/attachment.html>


More information about the riak-users mailing list