"Failed to compact" in RiakSearch

Morten Siebuhr sbhr+lists at sbhr.dk
Fri Apr 15 04:27:30 EDT 2011


Hi Rusty,

On Thu, Apr 14, 2011 at 8:00 PM, Rusty Klophaus <rusty at basho.com> wrote:
> Hi Morten,
> Thanks for sending the log files. I was able to figure out, at least
> partially, what's going on here.

Fantastic - thanks!

> The "Failed to compact" message is a result of trying to index a token
> that's greater than 32kb in size. (The index storage engine, called
> merge_index, assumes tokens sizes smaller than 32kb.) I was able to decode
> part of the term in question by pulling data from the log file, and it looks
> like you may be indexing HTML with base64 encoded inline images, ie: <img
> src="..."> The inline image is being treated
> as a single token, and it's greater than 32kb.

That's odd - in the search schema, I asked it to ignore everything
besides a few specific fields:

{
	schema,
	[
		{version, "0.1"},
		{default_field, "_owner"},
		{n_val, 1}
	],
	[
		%% Don't parse _id and _owner, just treat it as single token
		{field, [
				{name, "id"},
				{required, true},
				{analyzer_factory, {erlang, text_analyzers, noop_analyzer_factory}}
			]},
		{field, [
				{name, "_owner"},
				{required, true},
				{analyzer_factory, {erlang, text_analyzers, noop_analyzer_factory}}
			]},

		%% Parse Name fields for full-text indexing
		{field, [
				{name, "displayName"},
				{aliases, ["nickname", "preferredUsername", "name_formatted",
"name_displayName"]},
				{analyzer_factory, {erlang, text_analyzers, standard_analyzer_factory}}
			]},

		{field, [
				{name, "emails_value"},
				{analyzer_factory, {erlang, text_analyzers, standard_analyzer_factory}}
			]},

		%% Add modification dates
		{field, [
				{name, "published"},
				{aliases, ["updated"]},
				{type, date}
			]},

		%% Skip all else...
		{dynamic_field, [
				{name, "*"},
				{skip, true}
			]}
	]
}.

(We're indexing Portable Contacts, where the user images reside in a
'image'-field.)

> The short term workaround is to either:
> 1) Preprocess your data to avoid this situation.
> 2) Or, create a custom analyzer that limits the size of terms
> (See http://wiki.basho.com/Riak-Search---Schema.html for more information
> about analyzers and custom analyzers.)
> The long term solution is for us to increase the maximum token size in
> merge_index. I've filed a bugzilla issue for this, trackable
> here: https://issues.basho.com/show_bug.cgi?id=1069
> Still investigating the "Too many db tables" error. This is being caused by
> the system opening too many ETS tables. It *may* be related to the
> compaction error described above, but I'm not sure.
> Search (specifically merge_index) uses ETS tables heavily, and the number of
> tables is affected by a few different factors. Can you send me some more
> information to help debug, specifically:
>
> How many partitions (vnodes) are in your cluster? (If you haven't changed
> any settings, then the default is 64.)

It's 64 (no defaults changed at all).

> How many machines are in your cluster?

Four.

> How many segments are on the node where you are seeing these errors?
> (Run: "find DATAPATH/merge_index/*/*.data | wc -l", replacing DATAPATH with
> the path to your Riak data directory for that node.)

foreach srv ( nosql1 nosql2 nosql4 nosql5 )
echo -n "$srv "; ssh $srv sh -c 'find
/var/lib/riaksearch/merge_index/*/*.data | wc -l'
end
nosql1 32434
nosql2 14170
nosql4 15480
nosql5 13501

(nosql1 is the one the error log is lifted from - but the errors
seemed to come of all of the servers.)

> Approximately how much data are you loading (# Docs and # MB), and how
> quickly are you trying to load it?

~17m records, weighing in just shy of four GB.

While I didn't do the loading, I believe we did it with 25 concurrent
threads, using the four machines in round-robin fashion.

/Siebuhr




More information about the riak-users mailing list