Follow-up: Riak / Map, Reduce - error [preflist_exhausted]

Bryan Fink bryan at basho.com
Tue Jun 12 13:17:19 EDT 2012


On Tue, Jun 5, 2012 at 9:51 AM,  <claudef at br.ibm.com> wrote:
> Any idea how to adjust the processing capacity of the Riak JavaScript in the
> Map / Reduce process to avoid the error [preflist_exhausted] ?
> Checking the user posts, I also see a similar error report sent at 29/05 by
> Mr. Sati, Hohit.

Hello, Claude and Mohit.  I apologize for the delay in this response.
Hopefully I can help you sort out this JS VM vs. preflist_exhausted
situation.

The number of JS VMs you will need in the map pool to be sure of
*never* having contention produce preflist_exhausted is:

     (number of expected concurrent MR queries)
   * (number of map phases in each query)
   * (number of partitions a node has claimed)

The reason is that each node might start up a map worker for each
partition it has claimed, for each map phase, in each query.

The size needed for the reduce JS VM pool, if you're not using
pre-reduce, is:

     (number of concurrent MR queries)
   * (number of reduce phases in each query)

This doesn't depend on partition claim count, because reduce is
performed by exactly one worker.  If you *are* using pre-reduce, you
have to add a factor for that distribution:

     (number of concurrent MR queries)
   * (number of reduce phases in each query)
   * (1 + number of partitions a node has claimed)

So, as an example, if you're doing only one MR query at a time, with
just one map phase, on a 4-node, 64-partition cluster (so each node
has 16 partitions claimed), you will need 1 * 1 * 16 = 16 JS VMs in
the map pool to be assured of never seeing preflist_exhausted.  If
you're doing 100 MR queries concurrently, each with three map phases,
on a single node claiming 64 partitions, you will need 100 * 3 * 64 =
19200 JS VMs in the map pool to be assured of never seeing
preflist_exhausted.

That large second example sounds crazy, and it is.  Most of the time,
the contention should not be bad enough that every single map-phase
worker process will need its own JS VM to work with.  If each
computation they're doing is quick, a single JS VM should be able to
be picked up and released by many workers in succession.

These numbers are upper bounds to guarantee that there is never a
dearth of JS VMs.  A lower number should be able to be found via
monitoring and experimentation.  There is also a hard upper bound you
should never have to cross for either pool, which is:

      (the configured value of riak_pipe worker_limit)
    * (number of partitions a node has claimed)

Riak Pipe will refuse to start more workers than the configured limit,
and thus there should never be more than that number of JS VMs in use.
The default worker_limit (set in the riak_pipe section of your
app.config) is 50.

Unfortunately, I think this situation may be complicated by a similar
behavior as is found in https://github.com/basho/riak_kv/issues/290.
That is, while a worker is waiting on a JS VM to free up, its work
queue is filling up.  If any worker times out while waiting on a JS
VM, it will try to forward that input to one of its peer workers, in
hopes that it will succeed in finding a free JS VM.  If that peer has
already been waiting long enough for its input queue to fill up, Riak
Pipe will fail the forwarding operation and raise an error.  Methods
for tweaking those queue sizes are not exposed in the MapReduce
interface.

Of course, none of this precludes some other bug causing trouble, but
if each of you could check your JS VM settings against the formulas
above, I'd be interested in hearing what you find.

-Bryan




More information about the riak-users mailing list