Bad MapReduce job brings the Riak to a screeching halt?

Kelly McLaughlin kelly at basho.com
Thu Aug 30 18:16:17 EDT 2012


On Aug 29, 2012, at 9:07 PM, Brad Heller <brad at cloudability.com> wrote:
> 
> So my question is: Why did this completely kill Riak? This makes me pretty nervous--a bug in our app has the potential to bring down the ring! Is there anything we can do to protect against this?
> 

Riak 1.2 had a lot of changes to leveldb and one of those was a change to using flock() instead of fcntl(SET_FL) to try and make the locking a bit saner. Previously, using fcntl, multiple processes in the erlang VM could get a lock to the same leveldb instance and this could obviously lead to some conflicts. However, a result of the change to using flock is that when the vnode crashes the resources can still be locked by the previous process and this results in this message:

	2012-08-29 19:45:41.785 [error] <0.23924.70>@riak_kv_vnode:init:265 Failed to start riak_kv_multi_backend Reason: [{riak_kv_eleveldb_backend,{db_open,"IO error: lock ../../tmp/riak/instance1/leveldb/0/LOCK: Resource temporarily unavailable"}}]

Currently we do not attempt to wait or retry the vnode restart and this can cause the node to crash. I can understand you being a little nervous, but we are aware of this and are taking steps on two fronts to address it. First, as Bryan mentioned previously, we're looking at fixing these error conditions that cause the vnode to crash that really should not do so. Second, we're looking at a way to add some retry logic when the vnode does crash and the resources are locked. Thanks for the report!

Kelly



More information about the riak-users mailing list