Multiple disks

Greg Nelson grourk at dropcam.com
Wed Mar 23 18:50:33 EDT 2011


Hi Joe,

With a few hours of investigation today, your patch is looking promising. Maybe you can give some more detail on what you did in your experiments a few months ago?

What I did was set up a Ubuntu VM with three loopback file systems. Then built Riak 0.14.1 with your patch, configured as you described to spread across the three disks. I ran a single node, and it correctly spread partitions across the disks.

I then corrupted the file system on one of the disks (by zeroing out the loop device), and did some more GETs and PUTs against Riak. In the logs it looks like the vnode processes that had bitcasks on that disk died, as expected, and the other vnodes continued to operate.

I need to do a bit more investigation with more than one node, but given how well it handled this scenario, it seems like we're on the right track.

Oh, one thing I noticed is that while Riak starts up, if there's a bad disk then it will shutdown (the whole node), at this line:

https://github.com/jtuple/riak_kv/blob/jdb-multi-dirs/src/riak_kv_bitcask_backend.erl#L103

That makes sense, but I'm wondering if it's possible to let the node start since some of its vnodes would be able to open their bitcasks just fine. I wonder if it's as simple as removing that line?

Greg
On Tuesday, March 22, 2011 at 9:54 AM, Joseph Blomstedt wrote:
You're forgetting how awesome riak actually is. Given how riak is
> implemented, my patches should work without any operational headaches
> at all. Let me explain.
> 
> First, there was the one issue from yesterday. My initial patch didn't
> reuse the same partition bitcask on the same node. I've fixed that in
> a newer commit:
> https://github.com/jtuple/riak_kv/commit/de6b83a4fb53c25b1013f31b8c4172cc40de73ed
> 
> Now, about how this all works in operation.
> 
> Let's consider a simple scenario under normal riak. The key concept
> here is to realize that riak's vnodes are completely independent, and
> that failure and partition ownership changes are handled through
> handoff alone.
> 
> Let's say we have an 8-partition ring with 3 riak nodes:
> n1 owns partitions 1,4,7
> n2 owns partitions 2,5,8
> n3 owns partitions 3,6
> ie: Ring = (0/n1, 1/n2, 2/n3, 3/n1, 4/n2, 5/n3, 6/n1, 7/n2, 8/n3)
> 
> Each node runs an independent vnode for each partition it owns, and
> each vnode will setup it's own bitcask:
> 
> vnode 0/1: {n1-root}/data/bitcask/1
> vnode 0/4: {n1-root}/data/bitcask/4
> ...
> vnode 2/2: {n2-root}/data/bitcask/2
> ...
> vnode 3/6: {n3-root}/data/bitcask/6
> 
> Reads/writes are routed to the appropriate vnodes and to the
> appropriate bitcasks. Under failure, hinted handoff comes into play.
> 
> Let's have a write to preflist [1,2,3] while n2 is down/split. Since
> n2 is down, riak will send the write meant for partition 2 to another
> node, let's say n3. n3 will spawn a new vnode for partition 2 which is
> initially empty:
> 
> vnode 3/2: {n3-root}/data/bitcask/2
> 
> and, write the incoming write to the new bitcask.
> 
> Later, when n2 rejoins, n3 will eventually engage in handoff, and send
> all (k,v) in its data/bitcask/2 to n2, which writes them into its
> data/bitcask/2. After handing off data, n3 will shutdown it's 3/2
> vnode and delete the bitcask directory {n3-root}/data/bitcask/2.
> 
> Under node rebalancing / ownership changes, a similar event occurs.
> For example, if a new node n4 takes ownership of partition 4, then n1
> will handoff it's data to n4 and then shutdown its vnode and delete
> its {n1-root}/data/bitcask/4.
> 
> If you take the above scenario, and change all the directories of the form:
> {NODE-root}/data/bitcask/P
> to:
> /mnt/DISK-N/NODE/bitcask/P
> 
> and allow DISK-N to be any randomly chosen directory in /mnt, then the
> scenario plays out exactly the same provided that riak always selects
> the same DISK-N for a given P on a given node (across nodes doesn't
> matter, vnodes are independent). My new commit handles this. A simple
> configuration could be:
> 
> n1-vars.config:
> {bitcask_data_root, {random, ["/mnt/bitcask/disk1/n1",
> "/mnt/bitcask/disk2/n1", "/mnt/bitcask/disk3/n1"]}}
> n2-vars.config:
> {bitcask_data_root, {random, ["/mnt/bitcask/disk1/n2",
> "/mnt/bitcask/disk2/n2", "/mnt/bitcask/disk3/n2"]}}
> (...etc...)
> 
> There is no inherent need for symlinks, or needing to pre-create any
> initial links per partition index. riak already creates and deletes
> partition bitcask directories on demand. If a disk fails, then all
> vnodes with bitcasks on that disk fail in the same manner as a disk
> failure under normal riak. Standard read repair, handoff, and node
> replacement apply.
> 
> -Joe
> 
> On Tue, Mar 22, 2011 at 9:53 AM, Alexander Sicular <siculars at gmail.com> wrote:
> > Ya, my original message just highlighted the standard 0,1,5 that most
> > people/hardware should know/be able to support. There are better options and
> > 10 would be one of them.
> > 
> > 
> > @siculars on twitter
> > http://siculars.posterous.com
> > Sent from my iPhone
> > On Mar 22, 2011, at 8:43, Ryan Zezeski <rzezeski at gmail.com> wrote:
> > 
> > 
> > 
> > On Tue, Mar 22, 2011 at 10:01 AM, Alexander Sicular <siculars at gmail.com>
> > wrote:
> > > 
> > > Save your ops dudes the headache and just use raid 5 and be done with it.
> > 
> > Depending on the number of disks available I might even argue running
> > software RAID 10 for better throughput and less chance of data loss (as long
> > as you can afford to cut your avail storage in half on every machine). It's
> > not too hard to setup on modern Linux distros (mdadm); at least I was doing
> > it 5 years ago and I'm no sys admin.
> > -Ryan
> 
> _______________________________________________
> riak-users mailing list
> riak-users at lists.basho.com
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
> 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.basho.com/pipermail/riak-users_lists.basho.com/attachments/20110323/c95516a0/attachment.html>


More information about the riak-users mailing list