"Failed to compact" in RiakSearch

Rusty Klophaus rusty at basho.com
Fri Apr 15 10:40:20 EDT 2011


Hi Morten,

It looks like at least one of the offending fields is "photos_value", which,
as you said, should be skipped according to your schema. This makes me think
that either the schema isn't set correctly, or that one or more nodes is
caching an old value of the schema.

Can you try running "search-cmd show-schema INDEXNAME" on each node to
verify that the schema is set correctly? Also, you can run "search-cmd
clear-schema-cache" to clear the schema cache across all nodes.

Also, the number of segments is *way* higher than it should be, and this is
the reason for the "too many DB tables" error. It appears that these
problems are linked, the compaction errors are preventing the system from
combining segments, leading to a large number of segments, so solving one
problem should solve the other.

Best,
Rusty

On Fri, Apr 15, 2011 at 4:27 AM, Morten Siebuhr <sbhr+lists at sbhr.dk> wrote:

> Hi Rusty,
>
> On Thu, Apr 14, 2011 at 8:00 PM, Rusty Klophaus <rusty at basho.com> wrote:
> > Hi Morten,
> > Thanks for sending the log files. I was able to figure out, at least
> > partially, what's going on here.
>
> Fantastic - thanks!
>
> > The "Failed to compact" message is a result of trying to index a token
> > that's greater than 32kb in size. (The index storage engine, called
> > merge_index, assumes tokens sizes smaller than 32kb.) I was able to
> decode
> > part of the term in question by pulling data from the log file, and it
> looks
> > like you may be indexing HTML with base64 encoded inline images, ie: <img
> > src="..."> The inline image is being
> treated
> > as a single token, and it's greater than 32kb.
>
> That's odd - in the search schema, I asked it to ignore everything
> besides a few specific fields:
>
> {
>        schema,
>        [
>                {version, "0.1"},
>                {default_field, "_owner"},
>                {n_val, 1}
>        ],
>        [
>                %% Don't parse _id and _owner, just treat it as single token
>                {field, [
>                                {name, "id"},
>                                {required, true},
>                                {analyzer_factory, {erlang, text_analyzers,
> noop_analyzer_factory}}
>                        ]},
>                {field, [
>                                {name, "_owner"},
>                                {required, true},
>                                {analyzer_factory, {erlang, text_analyzers,
> noop_analyzer_factory}}
>                        ]},
>
>                %% Parse Name fields for full-text indexing
>                {field, [
>                                {name, "displayName"},
>                                {aliases, ["nickname", "preferredUsername",
> "name_formatted",
> "name_displayName"]},
>                                {analyzer_factory, {erlang, text_analyzers,
> standard_analyzer_factory}}
>                        ]},
>
>                {field, [
>                                {name, "emails_value"},
>                                {analyzer_factory, {erlang, text_analyzers,
> standard_analyzer_factory}}
>                        ]},
>
>                %% Add modification dates
>                {field, [
>                                {name, "published"},
>                                {aliases, ["updated"]},
>                                {type, date}
>                        ]},
>
>                %% Skip all else...
>                {dynamic_field, [
>                                {name, "*"},
>                                {skip, true}
>                        ]}
>        ]
> }.
>
> (We're indexing Portable Contacts, where the user images reside in a
> 'image'-field.)
>
> > The short term workaround is to either:
> > 1) Preprocess your data to avoid this situation.
> > 2) Or, create a custom analyzer that limits the size of terms
> > (See http://wiki.basho.com/Riak-Search---Schema.html for more
> information
> > about analyzers and custom analyzers.)
> > The long term solution is for us to increase the maximum token size in
> > merge_index. I've filed a bugzilla issue for this, trackable
> > here: https://issues.basho.com/show_bug.cgi?id=1069
> > Still investigating the "Too many db tables" error. This is being caused
> by
> > the system opening too many ETS tables. It *may* be related to the
> > compaction error described above, but I'm not sure.
> > Search (specifically merge_index) uses ETS tables heavily, and the number
> of
> > tables is affected by a few different factors. Can you send me some more
> > information to help debug, specifically:
> >
> > How many partitions (vnodes) are in your cluster? (If you haven't changed
> > any settings, then the default is 64.)
>
> It's 64 (no defaults changed at all).
>
> > How many machines are in your cluster?
>
> Four.
>
> > How many segments are on the node where you are seeing these errors?
> > (Run: "find DATAPATH/merge_index/*/*.data | wc -l", replacing DATAPATH
> with
> > the path to your Riak data directory for that node.)
>
> foreach srv ( nosql1 nosql2 nosql4 nosql5 )
> echo -n "$srv "; ssh $srv sh -c 'find
> /var/lib/riaksearch/merge_index/*/*.data | wc -l'
> end
> nosql1 32434
> nosql2 14170
> nosql4 15480
> nosql5 13501
>
> (nosql1 is the one the error log is lifted from - but the errors
> seemed to come of all of the servers.)
>
> > Approximately how much data are you loading (# Docs and # MB), and how
> > quickly are you trying to load it?
>
> ~17m records, weighing in just shy of four GB.
>
> While I didn't do the loading, I believe we did it with 25 concurrent
> threads, using the four machines in round-robin fashion.
>
> /Siebuhr
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.basho.com/pipermail/riak-users_lists.basho.com/attachments/20110415/dba37f12/attachment.html>


More information about the riak-users mailing list