Bad MapReduce job brings the Riak to a screeching halt?

Brad Heller brad at cloudability.com
Thu Aug 30 18:57:43 EDT 2012


Hey Kelly, Bryan,

Thanks for the replies. Good to hear this is being worked on! And sorry I didn't elaborate on "crashed." In this instance crashed meant "stopped taking connections on the HTTP interface." I didn't check to see if the Beam processes died (I think they did as load decreased).

I bumped my ulimit -n based on previous suggestions and that seemed to help. If/when I run in to this again I will indeed post more details!

Thanks,
Brad

On Aug 30, 2012, at 3:16 PM, Kelly McLaughlin <kelly at basho.com> wrote:

> 
> On Aug 29, 2012, at 9:07 PM, Brad Heller <brad at cloudability.com> wrote:
>> 
>> So my question is: Why did this completely kill Riak? This makes me pretty nervous--a bug in our app has the potential to bring down the ring! Is there anything we can do to protect against this?
>> 
> 
> Riak 1.2 had a lot of changes to leveldb and one of those was a change to using flock() instead of fcntl(SET_FL) to try and make the locking a bit saner. Previously, using fcntl, multiple processes in the erlang VM could get a lock to the same leveldb instance and this could obviously lead to some conflicts. However, a result of the change to using flock is that when the vnode crashes the resources can still be locked by the previous process and this results in this message:
> 
> 	2012-08-29 19:45:41.785 [error] <0.23924.70>@riak_kv_vnode:init:265 Failed to start riak_kv_multi_backend Reason: [{riak_kv_eleveldb_backend,{db_open,"IO error: lock ../../tmp/riak/instance1/leveldb/0/LOCK: Resource temporarily unavailable"}}]
> 
> Currently we do not attempt to wait or retry the vnode restart and this can cause the node to crash. I can understand you being a little nervous, but we are aware of this and are taking steps on two fronts to address it. First, as Bryan mentioned previously, we're looking at fixing these error conditions that cause the vnode to crash that really should not do so. Second, we're looking at a way to add some retry logic when the vnode does crash and the resources are locked. Thanks for the report!
> 
> Kelly





More information about the riak-users mailing list