Linux MD Devices in a ZFS Pool

Can you use MD devices in a ZPool? (tl;dr)

Yes. Yes you can and it works like you would expect. Create the underlying MDs first, then create the zpool containing them. Seems to work pretty well. The real question is why would you do this?

The situation that got me here:

I was building a media server combining purchased equipment with leftovers in the spare parts bin. The main goal was to create a large filesystem to share out, with at least some resiliency. None of the data was irreplaceable but we didn’t want to trash it all with a single failed drive. One giant raid 0 stripe was out. The challenge was to efficiently use a pile of different size hard drives in that one large filesystem. The drives in question were one 6TB, three 4TB and three 2TB.

The failed approaches:

I am a big Ceph fan, and one of the guys suggested a single host Ceph cluster. I found some references on the topic so figured it was worth a try. That ended up being a bit tricky (crushmap changes are necessary before creating any pools) and it performed poorly. Maybe a topic for a later post if anyone is that much of a glutton for punishment (or needs to lab test something).

I am also a big ZFS fan and assumed it would allow you to assemble several stripes into a raidz1. You certainly can assemble mirrors into a stripe pool (raid 10), I have a few systems running with that arrangement. Alas, you cannot do the same with stripes in a z.

So, MD with ZFS is [the|an] answer.

My next idea was to create three stripes out of the 4TB and 2TB drives, giving us three 6TB MDs, then build a filesystem out of that. (I did make an attempt at creating ZFS stripes but couldn’t add them to a raidz1. You can do that with mirrors but this application valued space efficiency over performance.) In hindsight it seems so obvious.  First, we created the underlying mds. The disk assortment was:

[0:0:0:0] disk ATA Hitachi HDS72302 A5C0 /dev/sda  <-- 2TB
[0:0:1:0] disk ATA WDC WD40EZRZ-00W 0A80 /dev/sdb  <-- 4TB
[0:0:2:0] disk ATA ST32000641AS CC13 /dev/sdc      <-- 2TB
[0:0:3:0] disk ATA WDC WD40EZRZ-00W 0A80 /dev/sdd  <-- 4TB
[0:0:4:0] disk ATA WDC WD40EZRZ-00G 0A80 /dev/sde  <-- 4TB
[0:0:5:0] disk ATA ST2000DM001-1CH1 CC27 /dev/sdf  <-- 2TB
[0:0:6:0] disk ATA HGST HDN726060AL T517 /dev/sdg  <-- 6TB
[1:0:0:0] disk ATA SSD2SC120G1SA754 4B /dev/sdh  <-- Boot Drive

So creating the first MD is as simple as:

# mdadm --create /dev/md0 --level=stripe --raid-devices=2 /dev/sda /dev/sdb

Repeat with the other two and four TB drives for /dev/md1 and /dev/md2. Then zpool them:

# zpool create tank raidz /dev/md0 /dev/md1 /dev/md2 /dev/sdg
# zpool status
  pool: tank
 state: ONLINE
  scan: none requested
 tank        ONLINE       0     0     0
   raidz1-0  ONLINE       0     0     0
     md0     ONLINE       0     0     0    
     md1     ONLINE       0     0     0
     md2     ONLINE       0     0     0
     sdg     ONLINE       0     0     0
errors: No known data errors

And there we have it, a raidz1 pool with ~16TB of usable space. And so far it has performed well, or at least as well as we have demanded. The MDs have increased risk of failure, being a plain stripe, but data wouldn’t be in jeopardy by any single disk loss.

Ceph: key for mgr.HOST exists but cap mds does not match

Somewhere along the lines, maybe during the upgrade to Luminous one of my larger Ceph clusters got borked up. Everything was running fine, but my two dedicated MDSes which also act as MONs weren’t running the MGR daemon. Easy enough to fix with ceph-deploy:

$ ceph-deploy mgr create HOST
[HOST][INFO ] Running command: sudo ceph --cluster ceph --name client.bootstrap-mgr --keyring /var/lib/ceph/bootstrap-mgr/ceph.keyring auth get-or-create mgr.HOST mon allow profile mgr osd allow * mds allow * -o /var/lib/ceph/mgr/ceph-HOST/keyring
[HOST][ERROR ] Error EINVAL: key for mgr.HOST exists but cap mds does not match
[HOST][ERROR ] exit code from command was: 22
[ceph_deploy.mgr][ERROR ] could not create mgr
[ceph_deploy][ERROR ] GenericError: Failed to create 1 MGRs

It took a little digging but the solution wasn’t too difficult to find. First, we compare the auth caps for a working MGR to our troubled host.

$ ceph auth get mgr.HOST
exported keyring for mgr.HOST
 key = [REDACTED]==
 caps mon = "allow profile mgr"
$ ceph auth get mgr.OTHER_HOST_THAT_WORKS
 key: [REDACTED]==
 caps: [mds] allow *
 caps: [mon] allow profile mgr
 caps: [osd] allow *

It seems like “allow profile mgr” is what we would need, but there is no cap for mds at all. Sure enough, that is where the auth command bombed out. Manually setting the caps feels like a reasonable idea. Remember that setting the caps overwrites all previous caps, it is not additive.

$ ceph auth caps mgr.HOST mon 'allow profile mgr' mds 'allow *' osd 'allow *'
updated caps for mgr.HOST
$ ceph auth get mgr.HOST
exported keyring for mgr.HOST
 key = [REDACTED]==
 caps mds = "allow *"
 caps mon = "allow profile mgr"
 caps osd = "allow *"

Now ceph-deploy mgr create works as expected!