Severe problems when adding a new node

John Axel Eriksson john at insane.se
Wed Nov 9 04:50:54 EST 2011


To illustrate ONE problem we have (another problem is that the data returned is sometimes garbage):

john at app-001:~$ curl -I http://localhost:8098/luwak/a5bbc21f0bcfcea4d51c4eedbc9ee5596b4cc6f1
HTTP/1.1 200 OK
Vary: Accept-Encoding
Transfer-Encoding: chunked
Server: MochiWeb/1.1 WebMachine/1.9.0 (participate in the frantic)
Last-Modified: Mon, 20 Dec 2010 16:23:11 GMT
Date: Wed, 09 Nov 2011 09:28:00 GMT
Content-Type: application/postscript
Connection: close

ok good it exists according to riak


john at app-001:~$ curl -O http://localhost:8098/luwak/a5bbc21f0bcfcea4d51c4eedbc9ee5596b4cc6f1
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:--  0:00:05 --:--:--     0
  
nothing saved to disk

john at app-001:~$ curl -i http://localhost:8098/luwak/a5bbc21f0bcfcea4d51c4eedbc9ee5596b4cc6f1
HTTP/1.1 200 OK
Vary: Accept-Encoding
Transfer-Encoding: chunked
Server: MochiWeb/1.1 WebMachine/1.9.0 (participate in the frantic)
Last-Modified: Mon, 20 Dec 2010 16:23:11 GMT
Date: Wed, 09 Nov 2011 09:34:37 GMT
Content-Type: application/postscript
Connection: close

john at app-001:~$

just an empty response - seriously how does this happen? Doing this several times yields the same result so there doesn't seem
to be any read-repair going on.

Is there nothing we can do to get riak in a consistent state again? ( Other than going through all the 40 000 files and trying to determine
which ones aren't there anymore or are just garbage…).

John


8 nov 2011 kl. 11:35 skrev John Axel Eriksson:

> Thanks for the emails detailing this issue - private and to the list. I've got a question for the list on our situation:
> 
> As stated we did an upgrade from 0.14.2 to 1.0.1 and after that we added a new node to our cluster. This
> really messed things up and nodes started crashing. In the end I opted to remove the added node and after
> quite a short while things settled down. The cluster is responding again. What we see now are corrupted files.
> 
> We've tried to determine how many of them there are but it's been a bit difficult. What we know is that there ARE
> corrupted files(or at least returned in an inconsistent state). I was wondering if there is anything we can do to get
> the cluster in a proper state again without having to manually delete everything that's corrupted? Is it possible that
> the data is actually there but not returned in a proper state by riak? I think it's only the larger files stored in luwak
> that have this problem.
> 
> John
> 
> 
> 29 okt 2011 kl. 01:03 skrev John Axel Eriksson:
> 
>> I've got the utmost respect for developers such as yourselves(Basho) and we've had great success using Riak - we have been using it
>> in production since 0.11. We've had our share of problems with it during this whole time but none as big as this. I can't understand why
>> this wasn't posted somewhere using the blink tag and big red bold text. I mean if I try to fsck a mounted disk in use in Linux I get:
>> 
>> "WARNING!!!  The filesystem is mounted.   If you continue you ***WILL***
>> cause ***SEVERE*** filesystem damage."
>> 
>> I understand why I don't get a warning like that when trying to run "riak-admin join riak at my.node.com" on Riak 1.0.1 but something similar to
>> it happens.
>> 
>> It goes against the whole idea of Riak being an ops-dream, distributed, fault-tolerant system having a bug such as this without disclosing it
>> more openly than an entry in a bug tracking system. I don't want to be afraid of adding nodes to my cluster but that is the result of this bug and
>> the lack of communication of same bug. The 1.0.1 release should have been pulled in my opinion.
>> 
>> To sum it up, this was a nightmare for us, I didn't get much sleep last night and I woke up in hell. All that, corrupted data, downtime and lost customer
>> confidence could have been avoided by better communication.
>> 
>> I don't want to be too hard on you fine people of Basho and you provide a really great system in Riak and I understand what you're aiming for, but if
>> anything as bad as this ever happens in the future you might want to communicate it better and consider pulling the release.
>> 
>> Thanks,
>> John
>> 
>> 
>> 28 okt 2011 kl. 17:51 skrev Kelly McLaughlin:
>> 
>>> John,
>>> 
>>> It appears you've run into a race condition with adding and leaving nodes that's present in 1.0.1. The problem happens during handoff and can cause bitcask directories to be unexpectedly deleted. We have identified the issue and we are in the process of correcting it, testing, and generating a new point release containing the fix. In the meantime, we apologize for the inconvenience and irritation this has caused. 
>>> 
>>> Kelly
>>> 
>>> 
>>> On Oct 28, 2011, at 9:14 AM, John Axel Eriksson wrote:
>>> 
>>>> Last night we did two things. First we upgraded our entire cluster from riak-search 0.14.2 to 1.0.1. This process went
>>>> pretty well and the cluster was responding correctly after this was completed.
>>>> 
>>>> In our cluster we have around 40 000 files stored in Luwak (we also have about the same amount of keys, or more, in riak which is mostly
>>>> the metadata for the files in Luwak). The files are in sizes ranging from around 50K to  around 400MB, most of the files are pretty small though. I
>>>> think we're up to a total of around 30GB now.
>>>> 
>>>> Anyway, upon adding a new node to the now 1.0.1 cluster I saw the beam.smp processes on all the servers, including the new one, taking
>>>> up almost all available cpu. It stayed in this state for around an hour and the cluster was slow to respond and occasionally timed out. During the
>>>> process Riak crashed on random nodes from time to time and I had to restart it. After about an hour things settled down. I added this
>>>> new node to our load-balancer so it too could serve requests. When testing our apps against the cluster we still got lots of timeouts and something
>>>> seemed very very wrong.
>>>> 
>>>> After a while I did a "riak-admin leave" on the node that was added (kind of a panic move I guess). Around 20 minutes after I did this, the cluster started
>>>> responding correctly again. All was not well though - files seemed to be corrupted(not sure what percentage but could be 1 % or more). I have no idea how
>>>> that could happen but files that we had accessed before now contained garbage. I haven't thoroughly researched exactly WHAT garbage they contain but
>>>> they're not in a usable state anymore. Is this something that could happen under any circumstances in Riak?
>>>> 
>>>> I'm afraid of adding a node at all now since it resulted in downtime and corruption when I tried it. I checked and rechecked the configuration files and really - they're
>>>> the same on all the nodes (except for vm.args where they have different names of course). Has anyone ever seen anything like this? Could it somehow be related to
>>>> the fact that I did an upgrade from 0.14.2 to 1.0.1 and maybe an hour later added a new 1.0.1 node?
>>>> 
>>>> Thanks for any input!
>>>> 
>>>> John
>>>> _______________________________________________
>>>> riak-users mailing list
>>>> riak-users at lists.basho.com
>>>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>>> 
>> 
> 





More information about the riak-users mailing list