Multiple disks

Nico Meyer nico.meyer at adition.com
Wed Mar 23 19:56:13 EDT 2011


Hi Greg,

I don't think the vnodes will always die. I have seen some situations 
(disk full, filesystem becoming read only due to device errors, 
corrupted bitcask files after a mashine crash) where the vnode did not 
crash, but the get and/or put requests returned errors.
Even if the process crashes, it will just be restarted, possibly over 
and over again.
Also, the handoff logic only operates on the level of a whole node, not 
individual vnodes, which makes monitoring and detecting disk failures 
very important.

We were also thinking about how to use multiple disk per node. But its 
not a very pressing problem for us, since we have a lot of relatively 
small entries (~1000 bytes), so the RAM used by bitcask is a problem 
long before we can even fill one disk.

Cheers,
Nico


On 23.03.2011 23:50, Greg Nelson wrote:
> Hi Joe,
>
> With a few hours of investigation today, your patch is looking
> promising. Maybe you can give some more detail on what you did in your
> experiments a few months ago?
>
> What I did was set up a Ubuntu VM with three loopback file systems. Then
> built Riak 0.14.1 with your patch, configured as you described to spread
> across the three disks. I ran a single node, and it correctly spread
> partitions across the disks.
>
> I then corrupted the file system on one of the disks (by zeroing out the
> loop device), and did some more GETs and PUTs against Riak. In the logs
> it looks like the vnode processes that had bitcasks on that disk died,
> as expected, and the other vnodes continued to operate.
>
> I need to do a bit more investigation with more than one node, but given
> how well it handled this scenario, it seems like we're on the right track.
>
> Oh, one thing I noticed is that while Riak starts up, if there's a bad
> disk then it will shutdown (the whole node), at this line:
>
> https://github.com/jtuple/riak_kv/blob/jdb-multi-dirs/src/riak_kv_bitcask_backend.erl#L103
>
> That makes sense, but I'm wondering if it's possible to let the node
> start since some of its vnodes would be able to open their bitcasks just
> fine. I wonder if it's as simple as removing that line?
>
> Greg
>
> On Tuesday, March 22, 2011 at 9:54 AM, Joseph Blomstedt wrote:
>
>> You're forgetting how awesome riak actually is. Given how riak is
>> implemented, my patches should work without any operational headaches
>> at all. Let me explain.
>>
>> First, there was the one issue from yesterday. My initial patch didn't
>> reuse the same partition bitcask on the same node. I've fixed that in
>> a newer commit:
>> https://github.com/jtuple/riak_kv/commit/de6b83a4fb53c25b1013f31b8c4172cc40de73ed
>>
>> Now, about how this all works in operation.
>>
>> Let's consider a simple scenario under normal riak. The key concept
>> here is to realize that riak's vnodes are completely independent, and
>> that failure and partition ownership changes are handled through
>> handoff alone.
>>
>> Let's say we have an 8-partition ring with 3 riak nodes:
>> n1 owns partitions 1,4,7
>> n2 owns partitions 2,5,8
>> n3 owns partitions 3,6
>> ie: Ring = (0/n1, 1/n2, 2/n3, 3/n1, 4/n2, 5/n3, 6/n1, 7/n2, 8/n3)
>>
>> Each node runs an independent vnode for each partition it owns, and
>> each vnode will setup it's own bitcask:
>>
>> vnode 0/1: {n1-root}/data/bitcask/1
>> vnode 0/4: {n1-root}/data/bitcask/4
>> ...
>> vnode 2/2: {n2-root}/data/bitcask/2
>> ...
>> vnode 3/6: {n3-root}/data/bitcask/6
>>
>> Reads/writes are routed to the appropriate vnodes and to the
>> appropriate bitcasks. Under failure, hinted handoff comes into play.
>>
>> Let's have a write to preflist [1,2,3] while n2 is down/split. Since
>> n2 is down, riak will send the write meant for partition 2 to another
>> node, let's say n3. n3 will spawn a new vnode for partition 2 which is
>> initially empty:
>>
>> vnode 3/2: {n3-root}/data/bitcask/2
>>
>> and, write the incoming write to the new bitcask.
>>
>> Later, when n2 rejoins, n3 will eventually engage in handoff, and send
>> all (k,v) in its data/bitcask/2 to n2, which writes them into its
>> data/bitcask/2. After handing off data, n3 will shutdown it's 3/2
>> vnode and delete the bitcask directory {n3-root}/data/bitcask/2.
>>
>> Under node rebalancing / ownership changes, a similar event occurs.
>> For example, if a new node n4 takes ownership of partition 4, then n1
>> will handoff it's data to n4 and then shutdown its vnode and delete
>> its {n1-root}/data/bitcask/4.
>>
>> If you take the above scenario, and change all the directories of the
>> form:
>> {NODE-root}/data/bitcask/P
>> to:
>> /mnt/DISK-N/NODE/bitcask/P
>>
>> and allow DISK-N to be any randomly chosen directory in /mnt, then the
>> scenario plays out exactly the same provided that riak always selects
>> the same DISK-N for a given P on a given node (across nodes doesn't
>> matter, vnodes are independent). My new commit handles this. A simple
>> configuration could be:
>>
>> n1-vars.config:
>> {bitcask_data_root, {random, ["/mnt/bitcask/disk1/n1",
>> "/mnt/bitcask/disk2/n1", "/mnt/bitcask/disk3/n1"]}}
>> n2-vars.config:
>> {bitcask_data_root, {random, ["/mnt/bitcask/disk1/n2",
>> "/mnt/bitcask/disk2/n2", "/mnt/bitcask/disk3/n2"]}}
>> (...etc...)
>>
>> There is no inherent need for symlinks, or needing to pre-create any
>> initial links per partition index. riak already creates and deletes
>> partition bitcask directories on demand. If a disk fails, then all
>> vnodes with bitcasks on that disk fail in the same manner as a disk
>> failure under normal riak. Standard read repair, handoff, and node
>> replacement apply.
>>
>> -Joe
>>
>> On Tue, Mar 22, 2011 at 9:53 AM, Alexander Sicular <siculars at gmail.com
>> <mailto:siculars at gmail.com>> wrote:
>>> Ya, my original message just highlighted the standard 0,1,5 that most
>>> people/hardware should know/be able to support. There are better
>>> options and
>>> 10 would be one of them.
>>>
>>>
>>> @siculars on twitter
>>> http://siculars.posterous.com
>>> Sent from my iPhone
>>> On Mar 22, 2011, at 8:43, Ryan Zezeski <rzezeski at gmail.com
>>> <mailto:rzezeski at gmail.com>> wrote:
>>>
>>>
>>>
>>> On Tue, Mar 22, 2011 at 10:01 AM, Alexander Sicular
>>> <siculars at gmail.com <mailto:siculars at gmail.com>>
>>> wrote:
>>>>
>>>> Save your ops dudes the headache and just use raid 5 and be done
>>>> with it.
>>>
>>> Depending on the number of disks available I might even argue running
>>> software RAID 10 for better throughput and less chance of data loss
>>> (as long
>>> as you can afford to cut your avail storage in half on every
>>> machine). It's
>>> not too hard to setup on modern Linux distros (mdadm); at least I was
>>> doing
>>> it 5 years ago and I'm no sys admin.
>>> -Ryan
>>
>> _______________________________________________
>> riak-users mailing list
>> riak-users at lists.basho.com <mailto:riak-users at lists.basho.com>
>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>
>
>
> _______________________________________________
> riak-users mailing list
> riak-users at lists.basho.com
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com




More information about the riak-users mailing list