Follow-up: Riak / Map, Reduce - error [preflist_exhausted]

Sati, Mohit Mohit.Sati at ask.com
Tue Jun 12 14:57:15 EDT 2012


Hi Bryan,

I was able to get around this problem more than 2 weeks ago. Initially I wasn't able to query data for more than 1 day. The business requirement was to query data for 1 month. I was using default configuration which is low. I tried with increasing the mapper count to 16,32,64,128,256 and 512. I started getting results at 512 for one month's data. The reducer count was increased to 384. I did not change any memory configuration.

I added more nodes to the cluster which also helped and I could go beyond one month.

Thanks for your help...

Regards
Mohit

 

-----Original Message-----
From: Bryan Fink [mailto:bryan at basho.com] 
Sent: Tuesday, June 12, 2012 10:17 AM
To: claudef at br.ibm.com; Sati, Mohit
Cc: Riak-Users
Subject: Re: Follow-up: Riak / Map, Reduce - error [preflist_exhausted]

On Tue, Jun 5, 2012 at 9:51 AM,  <claudef at br.ibm.com> wrote:
> Any idea how to adjust the processing capacity of the Riak JavaScript 
> in the Map / Reduce process to avoid the error [preflist_exhausted] ?
> Checking the user posts, I also see a similar error report sent at 
> 29/05 by Mr. Sati, Hohit.

Hello, Claude and Mohit.  I apologize for the delay in this response.
Hopefully I can help you sort out this JS VM vs. preflist_exhausted situation.

The number of JS VMs you will need in the map pool to be sure of
*never* having contention produce preflist_exhausted is:

     (number of expected concurrent MR queries)
   * (number of map phases in each query)
   * (number of partitions a node has claimed)

The reason is that each node might start up a map worker for each partition it has claimed, for each map phase, in each query.

The size needed for the reduce JS VM pool, if you're not using pre-reduce, is:

     (number of concurrent MR queries)
   * (number of reduce phases in each query)

This doesn't depend on partition claim count, because reduce is performed by exactly one worker.  If you *are* using pre-reduce, you have to add a factor for that distribution:

     (number of concurrent MR queries)
   * (number of reduce phases in each query)
   * (1 + number of partitions a node has claimed)

So, as an example, if you're doing only one MR query at a time, with just one map phase, on a 4-node, 64-partition cluster (so each node has 16 partitions claimed), you will need 1 * 1 * 16 = 16 JS VMs in the map pool to be assured of never seeing preflist_exhausted.  If you're doing 100 MR queries concurrently, each with three map phases, on a single node claiming 64 partitions, you will need 100 * 3 * 64 =
19200 JS VMs in the map pool to be assured of never seeing preflist_exhausted.

That large second example sounds crazy, and it is.  Most of the time, the contention should not be bad enough that every single map-phase worker process will need its own JS VM to work with.  If each computation they're doing is quick, a single JS VM should be able to be picked up and released by many workers in succession.

These numbers are upper bounds to guarantee that there is never a dearth of JS VMs.  A lower number should be able to be found via monitoring and experimentation.  There is also a hard upper bound you should never have to cross for either pool, which is:

      (the configured value of riak_pipe worker_limit)
    * (number of partitions a node has claimed)

Riak Pipe will refuse to start more workers than the configured limit, and thus there should never be more than that number of JS VMs in use.
The default worker_limit (set in the riak_pipe section of your
app.config) is 50.

Unfortunately, I think this situation may be complicated by a similar behavior as is found in https://github.com/basho/riak_kv/issues/290.
That is, while a worker is waiting on a JS VM to free up, its work queue is filling up.  If any worker times out while waiting on a JS VM, it will try to forward that input to one of its peer workers, in hopes that it will succeed in finding a free JS VM.  If that peer has already been waiting long enough for its input queue to fill up, Riak Pipe will fail the forwarding operation and raise an error.  Methods for tweaking those queue sizes are not exposed in the MapReduce interface.

Of course, none of this precludes some other bug causing trouble, but if each of you could check your JS VM settings against the formulas above, I'd be interested in hearing what you find.

-Bryan


More information about the riak-users mailing list