Multiple disks

Joseph Blomstedt Joseph.Blomstedt at gmail.com
Tue Mar 22 12:54:35 EDT 2011


You're forgetting how awesome riak actually is. Given how riak is
implemented, my patches should work without any operational headaches
at all. Let me explain.

First, there was the one issue from yesterday. My initial patch didn't
reuse the same partition bitcask on the same node. I've fixed that in
a newer commit:
https://github.com/jtuple/riak_kv/commit/de6b83a4fb53c25b1013f31b8c4172cc40de73ed

Now, about how this all works in operation.

Let's consider a simple scenario under normal riak. The key concept
here is to realize that riak's vnodes are completely independent, and
that failure and partition ownership changes are handled through
handoff alone.

Let's say we have an 8-partition ring with 3 riak nodes:
n1 owns partitions 1,4,7
n2 owns partitions 2,5,8
n3 owns partitions 3,6
ie: Ring = (0/n1, 1/n2, 2/n3, 3/n1, 4/n2, 5/n3, 6/n1, 7/n2, 8/n3)

Each node runs an independent vnode for each partition it owns, and
each vnode will setup it's own bitcask:

vnode 0/1: {n1-root}/data/bitcask/1
vnode 0/4: {n1-root}/data/bitcask/4
...
vnode 2/2: {n2-root}/data/bitcask/2
...
vnode 3/6: {n3-root}/data/bitcask/6

Reads/writes are routed to the appropriate vnodes and to the
appropriate bitcasks. Under failure, hinted handoff comes into play.

Let's have a write to preflist [1,2,3] while n2 is down/split. Since
n2 is down, riak will send the write meant for partition 2 to another
node, let's say n3. n3 will spawn a new vnode for partition 2 which is
initially empty:

vnode 3/2: {n3-root}/data/bitcask/2

and, write the incoming write to the new bitcask.

Later, when n2 rejoins, n3 will eventually engage in handoff, and send
all (k,v) in its data/bitcask/2 to n2, which writes them into its
data/bitcask/2. After handing off data, n3 will shutdown it's 3/2
vnode and delete the bitcask directory {n3-root}/data/bitcask/2.

Under node rebalancing / ownership changes, a similar event occurs.
For example, if a new node n4 takes ownership of partition 4, then n1
will handoff it's data to n4 and then shutdown its vnode and delete
its {n1-root}/data/bitcask/4.

If you take the above scenario, and change all the directories of the form:
{NODE-root}/data/bitcask/P
to:
/mnt/DISK-N/NODE/bitcask/P

and allow DISK-N to be any randomly chosen directory in /mnt, then the
scenario plays out exactly the same provided that riak always selects
the same DISK-N for a given P on a given node (across nodes doesn't
matter, vnodes are independent). My new commit handles this. A simple
configuration could be:

n1-vars.config:
{bitcask_data_root, {random, ["/mnt/bitcask/disk1/n1",
"/mnt/bitcask/disk2/n1", "/mnt/bitcask/disk3/n1"]}}
n2-vars.config:
{bitcask_data_root, {random, ["/mnt/bitcask/disk1/n2",
"/mnt/bitcask/disk2/n2", "/mnt/bitcask/disk3/n2"]}}
(...etc...)

There is no inherent need for symlinks, or needing to pre-create any
initial links per partition index. riak already creates and deletes
partition bitcask directories on demand. If a disk fails, then all
vnodes with bitcasks on that disk fail in the same manner as a disk
failure under normal riak. Standard read repair, handoff, and node
replacement apply.

-Joe

On Tue, Mar 22, 2011 at 9:53 AM, Alexander Sicular <siculars at gmail.com> wrote:
> Ya, my original message just highlighted the standard 0,1,5 that most
> people/hardware should know/be able to support. There are better options and
> 10 would be one of them.
>
>
> @siculars on twitter
> http://siculars.posterous.com
> Sent from my iPhone
> On Mar 22, 2011, at 8:43, Ryan Zezeski <rzezeski at gmail.com> wrote:
>
>
>
> On Tue, Mar 22, 2011 at 10:01 AM, Alexander Sicular <siculars at gmail.com>
> wrote:
>>
>>  Save your ops dudes the headache and just use raid 5 and be done with it.
>>
>
> Depending on the number of disks available I might even argue running
> software RAID 10 for better throughput and less chance of data loss (as long
> as you can afford to cut your avail storage in half on every machine).  It's
> not too hard to setup on modern Linux distros (mdadm); at least I was doing
> it 5 years ago and I'm no sys admin.
> -Ryan




More information about the riak-users mailing list