Riak crash on node restarts

Jeremy Raymond jeraymond at gmail.com
Thu Dec 1 17:47:45 EST 2011


I haven't had a chance to track down all the logs, ring state, etc and reproduce this issue, but I've updated my deploy script to just reload the mapped module while Riak is running rather than bouncing the nodes.

I did however recently upgrade from Riak 1.0.0 to 1.0.2 and saw the same 3rd node in the cluster go down when I brought the first node down for the rolling upgrade. Anyone else see this type of thing happen?

On 2011-11-18, at 5:06 PM, Leonid Riaboshtan wrote:

> Well, depends on your database size really and on large variety of other things. Our database is like 40 GB of pure data(n_val 3 on most of data), and usually it takes 5-10 minutes for handoffs to complete on 256 vnodes ring. Handoff concurrency is set to 1 btw. I guess it's strange when handoffs going for several ours, maybe I'm wrong.
> 
> Some offtopic, sorry:
> And about your way of reloading mapred erlang scripts with node restart. I guess it's not a good idea really, because handoffs takes a lot of cluster time. And actually starting node after crash is quite problematic under load too. So it would be really great to have a way to reload erlang mapred like javascript mapred with something like erlang_reload or something(js_reload, is there one for erlang?). I'm using riak on a production service and when node goes down it's better to keep it down until load is gone and then safely put it back up(riak is really good at fault tolerance, you simply don't notice it).
> 
> On Fri, Nov 18, 2011 at 3:55 PM, Jeremy Raymond <jeraymond at gmail.com> wrote:
> Something else I tried to give the cluster more time to settle was to wait until riak-admin transfers reported no pending transfers between updating nodes. I've had cases where the transfers didn't complete within at least a couple of hours of waiting. What would be typical amount of time for pending transfers to complete?
> 
> --
> Jeremy
> 
> 
> 
> On Fri, Nov 18, 2011 at 6:48 AM, Jeremy Raymond <jeraymond at gmail.com> wrote:
> Hello,
> 
> I'll setup my deploy script to capture this information and send you the info off-list (probably sometime next week).
> 
> --
> Jeremy
> 
> On 2011-11-15, at 1:16 PM, Jon Meredith wrote:
> 
>> Hi Joel,
>> 
>> That's not a message I'd expect to see on a clean restart.  We'll need some more information to diagnose it.  Next time it crashes, could you provide the contents of your ring file (you can just grab the most recent one out of /var/lib/riak/ring - location may vary depending on your platform) and it would be very helpful if you could modify your deploy script to capture the file list for the leveldb directory on *all* of your nodes immediately before you bounce riak to do the update.   When it crashes, the console.log from all the nodes would also be useful.  If any of those files contain sensitive information, please contact me off list.
>> 
>> BR, Jon
>> 
>> On Tue, Nov 15, 2011 at 6:48 AM, Jeremy Raymond <jeraymond at gmail.com> wrote:
>> I'm using Riak 1.0.1 and I have a script that deploys updates to each of my 3 nodes to update the Erlang mapred modules. What I do is stop a node, deploy the new mapred modues, restart the node, wait for the riak_kv service to start, then move onto the next node. Sometimes when I do this one of the nodes that is not the current one being updated will go down. Each time this has happened thus far it's been the same node that will go down (the last one). I see this error in the logs:
>> 
>> [error] Failed to start riak_kv_eleveldb_backend Reason: {db_open,"IO error: /var/lib/riak/leveldb/913438523331814323877303020447676887284957839360/MANIFEST-000002: No such file or directory"}
>> 
>> If I manually restart the node, things go back to normal. Any ideas on what's going on? I've attached the error log.
>> 
>> --
>>  Jeremy
>> 
>> _______________________________________________
>> riak-users mailing list
>> riak-users at lists.basho.com
>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>> 
>> 
>> 
>> 
>> -- 
>> Jon Meredith
>> Platform Engineering Manager
>> Basho Technologies, Inc.
>> jmeredith at basho.com
>> 
> 
> 
> 
> _______________________________________________
> riak-users mailing list
> riak-users at lists.basho.com
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
> 
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.basho.com/pipermail/riak-users_lists.basho.com/attachments/20111201/5ff239ed/attachment.html>


More information about the riak-users mailing list