Riak crash on node restarts

Jeremy Raymond jeraymond at gmail.com
Fri Nov 18 21:08:10 EST 2011


My db is only a few gigs. You are right it may be better to just load the
updated mapred module without bouncing the node since we can just reload
the individual module without bringing Riak down.

--
Jeremy

On Nov 18, 2011, at 5:06 PM, Leonid Riaboshtan <perfecthumanorama at gmail.com>
wrote:

Well, depends on your database size really and on large variety of other
things. Our database is like 40 GB of pure data(n_val 3 on most of data),
and usually it takes 5-10 minutes for handoffs to complete on 256 vnodes
ring. Handoff concurrency is set to 1 btw. I guess it's strange when
handoffs going for several ours, maybe I'm wrong.

Some offtopic, sorry:
And about your way of reloading mapred erlang scripts with node restart. I
guess it's not a good idea really, because handoffs takes a lot of cluster
time. And actually starting node after crash is quite problematic under
load too. So it would be really great to have a way to reload erlang mapred
like javascript mapred with something like erlang_reload or
something(js_reload, is there one for erlang?). I'm using riak on a
production service and when node goes down it's better to keep it down
until load is gone and then safely put it back up(riak is really good at
fault tolerance, you simply don't notice it).

On Fri, Nov 18, 2011 at 3:55 PM, Jeremy Raymond <jeraymond at gmail.com> wrote:

> Something else I tried to give the cluster more time to settle was to wait
> until riak-admin transfers reported no pending transfers between updating
> nodes. I've had cases where the transfers didn't complete within at least a
> couple of hours of waiting. What would be typical amount of time for
> pending transfers to complete?
>
> --
> Jeremy
>
>
>
> On Fri, Nov 18, 2011 at 6:48 AM, Jeremy Raymond <jeraymond at gmail.com>wrote:
>
>> Hello,
>>
>> I'll setup my deploy script to capture this information and send you the
>> info off-list (probably sometime next week).
>>
>> --
>> Jeremy
>>
>> On 2011-11-15, at 1:16 PM, Jon Meredith wrote:
>>
>> Hi Joel,
>>
>> That's not a message I'd expect to see on a clean restart.  We'll need
>> some more information to diagnose it.  Next time it crashes, could you
>> provide the contents of your ring file (you can just grab the most recent
>> one out of /var/lib/riak/ring - location may vary depending on your
>> platform) and it would be very helpful if you could modify your deploy
>> script to capture the file list for the leveldb directory on *all* of your
>> nodes immediately before you bounce riak to do the update.   When it
>> crashes, the console.log from all the nodes would also be useful.  If any
>> of those files contain sensitive information, please contact me off list.
>>
>> BR, Jon
>>
>> On Tue, Nov 15, 2011 at 6:48 AM, Jeremy Raymond <jeraymond at gmail.com>wrote:
>>
>>> I'm using Riak 1.0.1 and I have a script that deploys updates to each of
>>> my 3 nodes to update the Erlang mapred modules. What I do is stop a node,
>>> deploy the new mapred modues, restart the node, wait for the riak_kv
>>> service to start, then move onto the next node. Sometimes when I do this
>>> one of the nodes that is not the current one being updated will go down.
>>> Each time this has happened thus far it's been the same node that will go
>>> down (the last one). I see this error in the logs:
>>>
>>> [error] Failed to start riak_kv_eleveldb_backend Reason: {db_open,"IO
>>> error:
>>> /var/lib/riak/leveldb/913438523331814323877303020447676887284957839360/MANIFEST-000002:
>>> No such file or directory"}
>>>
>>> If I manually restart the node, things go back to normal. Any ideas on
>>> what's going on? I've attached the error log.
>>>
>>> --
>>>  Jeremy
>>>
>>> _______________________________________________
>>> riak-users mailing list
>>> riak-users at lists.basho.com
>>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>>>
>>>
>>
>>
>> --
>> Jon Meredith
>> Platform Engineering Manager
>> Basho Technologies, Inc.
>> jmeredith at basho.com
>>
>>
>>
>
> _______________________________________________
> riak-users mailing list
> riak-users at lists.basho.com
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.basho.com/pipermail/riak-users_lists.basho.com/attachments/20111118/cf4f4ba2/attachment.html>


More information about the riak-users mailing list