ZFS Encrypted zfs-on-root won't boot when storage drives are plugged in, request assistance

grep2grok · Oct 9, 2015

[edit] For anyone just arriving to this thread, I have gone through and clarified some things where I was definitely more confused than I am now.

I ran into a problem yesterday (1, 2) wherein I had to get a new disk controller to build a storage system. This is a super-budget, but mission-critical rescue of a lab's image data. The machine is a Dell Precision T7400. Because that's all we have. And I could use some help.

1) The new disk controller is messing with my encrypted zfs-on-root disk assignments (specifically, da20 becomes ada4, da22 becomes ada5 [edit: these prior assignments appear to be coming from nowhere. geom disk list reports them even for drives that have never been in the computer), and p4 on both refuses to decrypt), so I've been advised to learn about labels, which I have been, though I haven't found ZFS specifically addressed. I can boot when the additional drives aren't plugged in, so I think I can fix it if it's fixable. But google suggests people smarter than me can run into problems. And the system has nothing else on it, so, as a practical matter, I'm fine reinstalling the OS. Thus, first question, before I commit unknown hours: can you use labels to fix the problem of a zfs-on-root refusing to decrypt after additional drives are plugged in? If so or if not, any additional suggestions?

2) I need a minimum of 12TB storage, but certainly I should set up a brand new array with more than I need right now

I have lots of brand new 6TB disks. I have a new disk controller (a Siig based on the Marvell 88SE9230) that has 4 connectors, and the motherboard has 7 that are apparently capped at 2TB, and some may be capped at 1 TB (spec sheet, and 2 are committed to the OS disks, so 5 available but not terribly useful?). Would it be better to

a) set up a RAIDZ1 with 4 drives (~14 TB using this calculator) or
b) would a PCIe controller run in a PCIe2 slot (it seems to fit...)? [edit: yes it does work] If so I could do

1. RAIDZ1 with 5 drives (~19 TB)
2. RAIDZ2 with 5 drives (~14 TB)
3. RAIDZ1 with 6 drives (~23 TB)
4. RAIDZ2 with 6 drives (~19 TB)
5. [edit:] 3 mirrors of 2 (~16.3 TB)

Laying the options out, I think the RAIDZ2 with 5 looks like my best option if it will work, and fall back on RAIDZ1 with 4. One saving grace is I actually have 10 of these drives, and the plan is to migrate the dataset to a new machine that won't arrive until long after the institute kicks our project out of their array (they are forcing several users out to get down to their own required minimum free storage ... they have a 400TB problem

) But that also means I would like to set aside at least half the drives (5) for that new machine.

asteriskRoss · Oct 10, 2015

Your immediate issue is your data rescue, so I'm going to totally ignore the second question in your thread.

It sounds like your issue is with the decryption rather than ZFS. For your encryption I imagine you are using GELI. It sounds like you used the FreeBSD installer to configure your partitioning rather than doing it by hand*. Regarding labels, whilst they are invaluable when moving disks around I have not been able to get them to play nicely with mounting GELI encrypted partitions at boot. However, the last time I tried was with FreeBSD 9.2 so things may have moved on.

Booting to the FreeBSD installation CD/USB stick, could you select the Live CD option and log in (root, no password). Could you then post:

Output of gpart show
If you are able to identify and mount the appropriate partition, the contents of /boot/loader.conf

*Ad: I will confess here that I set up my filesystem by hand so I'm not familiar with the layout the FreeBSD installer provides.

grep2grok · Oct 10, 2015

Ok, will do in the AM, but to be clear, this system I'm building is bare iron, no data. The 12 TB of data is on the institute's array. This system I'm building is the life preserver.

asteriskRoss · Oct 10, 2015

grep2grok, apologies that the start of my reply yesterday was unintentionally a little abrupt. While you gather the information let me offer some thoughts on the second part of your question.

I noticed the specification sheet for your system says "Chassis supports up to five internal drives in a SATA boot plus four SATA drive configuration (5.0 TB maximum storage capacity)". Will this chassis limit prevent you from using 6 drives? Do you know where this 5TB limit comes from and whether it refers to the total capacity across all drives?

Regarding RAIDZ configuration, are you concerned about read and/or write performance? Will the machine be regularly monitored for disk failures and do you need the additional fault tolerance of RAIDZ2? Or is the spare capacity more important?

Do the Xeon processors you have support the AES-NI instruction set? If not, then using GELI for encryption is likely to mean your performance takes quite a hit. If so, make sure that either support is compiled into the kernel (if you have compiled your own kernel) or that you are loading the kernel module with the following line in /boot/loader.conf

Code:

aesni_load="YES"

grep2grok · Oct 10, 2015

First, really appreciate your help. Seriously.

asteriskRoss said:
Will this chassis limit prevent you from using 6 drives? Do you know where this 5TB limit comes from and whether it refers to the total capacity across all drives?

It currently has 6 drives in it, and has, at one point, been happy to have all those drives (just the SAS was under-reporting them, which is why I got the PCIe controller. [edit: turns out that since these disks came from MyBooks they were pre-formated with MBR and a 700GB partition, so MBR was the problem: 2^32] The motherboard has seven SATA ports built in (SATA0, SATA1, SATA2, HDD0, HDD1, HDD2, HDD3) the "HDD"s being on the built-in SAS controller, which shows 2 TB each, so I think that "5 TB limit" is Dell underpromising, but the 2TB/drive limit is on the SAS (an LSI 1068e), which I'm no longer using.

asteriskRoss said:
Regarding RAIDZ configuration, are you concerned about read and/or write performance? Will the machine be regularly monitored for disk failures and do you need the additional fault tolerance of RAIDZ2? Or is the spare capacity more important?

I think I'm going to try RAIDZ1 with 5 if I can get it running, using the 4 on my controller, and the three SATA ports on the motherboard.

Code:

# gpart show
=>       34  976773101  ada0  GPT  (466G)
         34          6        - free -  (3.0K)
         40       1024     1  freebsd-boot  (512K)
       1064        984        - free -  (492K)
       2048    4194304     2  freebsd-zfs  (2.0G)
    4196352    4194304     3  freebsd-swap  (2.0G)
    8390656  968382464     4  freebsd-zfs  (462G)
  976773120         15        - free -  (7.5K)

=>       34  976773101  ada1  GPT  (466G)
         34          6        - free -  (3.0K)
         40       1024     1  freebsd-boot  (512K)
       1064        984        - free -  (492K)
       2048    4194304     2  freebsd-zfs  (2.0G)
    4196352    4194304     3  freebsd-swap  (2.0G)
    8390656  968382464     4  freebsd-zfs  (462G)
  976773120         15        - free -  (7.5K)

Code:

# cat /boot/loader.conf
geli_ada0p4_keyfile0_load="YES"
geli_ada0p4_keyfile0_type="ada0p4:geli_keyfile0"
geli_ada0p4_keyfile0_name="/boot/encryption.key"
geli_ada1p4_keyfile0_load="YES"
geli_ada1p4_keyfile0_type="ada1p4:geli_keyfile0"
geli_ada1p4_keyfile0_name="/boot/encryption.key"
aesni_load="YES"
geom_eli_load="YES"
geom_eli_passphrase_prompt="YES"
vfs.root.mountfrom="zfs:zroot/ROOT/default"
kern.geom.label.gptid.enable="0"
zpool_cache_load="YES"
zpool_cache_type="/boot/zfs/zpool.cache"
zpool_cache_name="/boot/zfs/zpool.cache"
zfs_load="YES"

The processors are xeon E5420, and I will skip the OS encryption this time for multiple reasons!

Right now, I can't find my thumb drive that had FreeBSD on it, so I can only boot into the real system (two attempts to make another thumb drive have failed, I think because I can't get the EFI partitions off these damn SanDisk thumb drives). I will leave it up though, so I can ssh in as needed today.

asteriskRoss · Oct 11, 2015

Great. I'm assuming the gpart(8) output is from booting without the additional drives plugged in based on your first post.

If you want to keep the encryption you can:

Boot without the additional drives plugged in
Change the GELI settings in loader.conf(5) to use the designations the drives will become when you add the additional drives
Shut down
Install the additional drives
Start up the machine, hopefully with everything working.

Note that if your machine doesn't start with the new configuration you will need your FreeBSD installation media to rescue it so if you need to download it again to write the image to a new USB memory stick you should do that first!

In terms of the changes to /boot/loader.conf, you said in your first post:

grep2grok said:
da20 becomes ada4, da22 becomes ada5, and p4 on both refuses to decrypt

However in your gpart(8) and /boot/loader.conf I see only ada0 and ada1 so I'm a bit confused about what the new disks' device assignments will be.
The important lines are:

Code:

geli_ada0p4_keyfile0_load="YES"
geli_ada0p4_keyfile0_type="ada0p4:geli_keyfile0"
geli_ada0p4_keyfile0_name="/boot/encryption.key"
geli_ada1p4_keyfile0_load="YES"
geli_ada1p4_keyfile0_type="ada1p4:geli_keyfile0"
geli_ada1p4_keyfile0_name="/boot/encryption.key"

At a minimum you need to change the critical settings "ada0p4:geli_keyfile0" and "ada1p4:geli_keyfile0" to match the new drive assignments. If it were me I would also change the variable names so they did something sensible like match the GPT partition labels, though the names in the variables can be anything you like. If your new drives' assignments were ada98 and ada99 your changes could look something like:

Code:

geli_encrypted0_keyfile0_load="YES"
geli_encrypted0_keyfile0_type="ada98p4:geli_keyfile0"
geli_encrypted0_keyfile0_name="/boot/encryption.key"
geli_encrypted1_keyfile0_load="YES"
geli_encrypted1_keyfile0_type="ada99p4:geli_keyfile0"
geli_encrypted1_keyfile0_name="/boot/encryption.key"

When creating another ZFS pool for your data drives, I agree with what you have been advised already; use an identifier that isn't the device assignment when you add devices to the pool. If you are using raw disks that could be the disk ID. If using GPT partitions (what I would recommend) it could be the partition label. For example to create your new pool, rather than typing zpool create newpool /dev/ada123, you would type zpool create newpool /dev/gpt/mypartitionlabel.

grep2grok · Oct 11, 2015

I reinstalled but I will make the zpool as you suggest. Can I ask you one last question?

I'm going to serve the storage over iSCSI. Let's say my zpool is 'tank'. Can I point ctld(8) directly at tank or do I need to make volumes (and folders?) first? The data is currently in two folders on the institute's array (mounted over one iSCSI target/initiator connection), 'images' and 'images2'. So, would it better to make zvol 'data' with folders 'images' and 'images2', or two zvols 'images' and 'images2'?

asteriskRoss · Oct 11, 2015

I'm afraid I set up my network shares with a mixture of NFS and Samba before iSCSI was supported by ZFS on FreeBSD (from 10.0 I believe) so I don't have any experience with it though I see there is a section on iSCSI in the FreeBSD handbook. Someone else reading this on the forum may be able to offer more experienced guidance.

Slightly confusingly, if tank is a ZFS pool, that pool has a dataset also referred to as tank. I always set the canmount property to false for datasets at the root of a pool as it's not possible to manipulate them (specifically to delete them) in the same way as other datasets. As I understand it tank itself could never be a ZFS volume; a volume must be contained within a pool. So for iSCSI sharing you will need to create at least one volume within your pool.

Regarding creating directories, a ZFS volume behaves very differently from a dataset in that it is a block device that needs to be formatted with a filesystem. You will therefore need to format your ZFS volume as whatever filesystem your client machines understand before adding directories and files.

I would consider a few things before choosing whether to create one or two shares, though none of this is specific to iSCSI:

Why were two shared directories rather than one created originally?
Will my users be confused by a change to the existing configuration?
Will any of my users' scripts, logon scripts, shortcuts to recently used files etc break with a configuration change?
Are the access rights on images and images2 identical or is the separation required for security?
Do I want to reserve specific space for each file share, or should there be flexibility in the relative sizes?

grep2grok · Oct 11, 2015

I'm told that as long as the UNC path is valid, the app should be fine. If I change the path, they can run a script to update all the database entries.

So my goal is to totally preserve the UNC path. My plan is to create the target volume, point ctld at it, have the app server mount it, format it, and copy the two folders (I already tested this much works with a small test machine while waiting for procurement to get drives). Then unmount the original target and map the new target to the same mount point (I see no reason this wouldn't work, the app server is running Windows Server 2003) Hopefully this is transparent the app.

I believe, based on dates and conversations, the two folders are a legacy of running out of local space around 2008 and migrating this dataset to the institute's array, and all new storage after the migration went to images2.

Also, I would like to use zfs to encrypt the data at rest and I understand that I should always have zfs compress. Do you foresee any way I could cause myself a problem down the line by doing either of those?

asteriskRoss · Oct 11, 2015

ZFS on FreeBSD doesn't support encryption natively. However, it is possible to create an encrypted container using GELI (see the geli(8) man page) and then create the ZFS pool that uses the encrypted container or multiple encrypted containers. When you did your first install, the installer did all that for you. Setting it up obviously adds a degree of complexity. For a server you also need to consider key management as being required to enter a password at the console every reboot isn't very practical and storing the keys on the encrypted disk defeats the objective -- a bit like having a lock on your front door but leaving the key in the slot. Consider what you are concerned about (disk loss, theft or not being sanitised before disposal?) -- disk encryption is not the answer to all security issues. I briefly discussed this as part of a HOWTO I posted on the forum: Thread howto-freebsd-10-1-amd64-uefi-boot-with-encrypted-zfs-root-using-geli.51393. Would I encrypt a disk in a laptop travelling all over the world? Absolutely. Would I encrypt a disk of an Internet facing e-commerce server in a physically secure room which I controlled? Provided I had procedures for securely destroying that disk at end-of-life then probably not -- I would be more concerned about remote attacks than someone breaking in to steal the hard disk out of the server.

As I mentioned in my earlier post, there are performance implications. If your CPU supports AES-NI instructions, the hardware encryption and decryption is faster than software, which would use many more CPU instructions to do the same thing. Note that the AES-NI instructions only cover encryption and decryption, not integrity verification (disabled by default with GELI) which is achieved through relatively slow hashing functions. Since you are setting up an iSCSI network share, your network speed may be the limiting factor anyway.

I definitely recommend using compression with ZFS as compression/decompression is fast relative to disk read/write and especially encryption/decryption. However, you could run some benchmarks on your hardware to see what works best for you.

FreeBSD is a solid platform. I am however cautious with new features that may not yet have been widely used. For me, iSCSI would fall into that category and I would want to test it before relying on it for transmitting critical data.

grep2grok · Oct 11, 2015

I'm not sure where you are, but PM me a city and next time I'm there I will buy you beer!

Ok, so I will not encrypt, but will compress. Next, just to be clear, I'm using whole SATA disks. It looks like they ship with a partition? [edit: this should have been a clue to the 2TB "limit" I saw earlier]

Code:

# ls /dev/diskid
DISK-WD-WX11D943583E DISK-WD-WX11D94358J5 DISK-WD-WX11D9435R0Z DISK-WD-WX11DA49H8PV
DISK-WD-WX11D943583Es1 DISK-WD-WX11D94358J5s1 DISK-WD-WX11D9435R0Zs1 DISK-WD-WX11DA49H8PVs1

So, should I use the disks or the partitions?

Code:

# zpool create tank raidz /dev/diskid/DISK-WD-WX11D943583E /dev/diskid/DISK-WD-WX11D94358J5 /dev/diskid/DISK-WD-WX11D9435R0Z /dev/diskid/DISK-WD-WX11DA49H8PV

I ask because when I googled, none other than a FreeBSD foundation admin got burned by using the whole disks! (Mobile right now and can't find link, will update in a hour or so)

asteriskRoss · Oct 11, 2015

The short answer is that I recommend using partitions.

I had exactly the same question about whether to use whole disks or partitions when I first started using ZFS. You will have seen the web is full of people asking the question. Fortunately, FreeBSD is awesome when it comes to documentation and since then an excellent section on ZFS has been added to the FreeBSD handbook.

From the zpool(8) page:

ZFS can use individual slices or partitions, though the recommended mode of operation is to use whole disks.

However, there is no explanation as to why this is recommended. Indeed, the new section 19.8. ZFS Features and Terminology of the handbook says:

On FreeBSD, there is no performance penalty for using a partition rather than the entire disk. This differs from recommendations made by the Solaris documentation.

What made my decision for me was dvl@'s blog entry ZFS: do not give it all your HDD. He points out that the zpool(8) man page also says of the command zpool replace [-f] [I]pool[/I] [I]device[/I] [[I]new[/I]_[I]device[/I]]:

Replaces old_device with new_device.
[...]
The size of new_device must be greater than or equal to the minimum
size of all the devices in a mirror or raidz configuration.
[...]
This form of replacement is useful after an existing disk has failed and has been physically replaced.
[...]

This means that if you are using a whole disk in a ZFS pool and it fails, the replacement cannot be even slightly smaller than the smallest existing disk. Learning from other people's bad experiences seems like a good idea so I've adopted dvl@'s method of using partitions and leaving a small amount of unused space at the end of each drive. I've only had to replace one disk to date which went absolutely fine. That was in a mirror not a RAIDZ configuration though.

Another advantage that cannot be underrated is that if the disk ever went astray during server maintenance, if it is partitioned and labelled then it is easily identifiable, even if it ends up plugged in to a system running a different operating system. I would expect that a raw ZFS disk would not be recognised by Windows. OS X would probably even offer to "initialize" it for you -- goodbye data!

Beer is always welcome :beer:

grep2grok · Oct 12, 2015

for asteriskRoss: I sprinkled some edits in my posts above to clarify points of previous ignorance]

Ok, I decided to go with 3 mirrors, using 6 matched disks and two matched Siig controllers so even the controllers can't be a single point of failure. I decided to go with whole disks because, after discussing with the my colleague, we're going to use this as the definitive array, so I have 4 spare disks, and they match, at least down to the last seven digits; in some cases down to the last 2 digits. And they all came off the same palette. And I was already late for dinner

controller 1
|- ada0: /dev/diskid/DISK-WD-WX11D9435R0Z
|- ada1: /dev/diskid/DISK-WD-WX11DA49H8PV
|- ada2: /dev/diskid/DISK-WD-WX11D943577S

controller 2
|- ada3: /dev/diskid/DISK-WD-WX11D94358J5
|- ada4: /dev/diskid/DISK-WD-WX11D943583E
|- ada5: /dev/diskid/DISK-WD-WX11DA40HNK7

First I cleared out all the existing partition tables (these are 6TB Western Digital Green drives ripped out of MyBooks that were on sale at Costco...

) e.g.:

Code:

# gpart show ada5
=>        63  4294967232  ada5  MBR  (5.5T)
          63         193        - free -  (97K)
         256  1465122048     1  ntfs  (699G)
  1465122304  2829844991        - free -  (1.3T)

# gpart delete -i1 ada5
ada5s1 deleted
# gpart destroy ada5
ada5 destroyed
# gpart create -s gpt ada5
ada5 created
# gpart show ada5
=>         34  11720979566  ada5  GPT  (5.5T)
           34  11720979566        - free -  (5.5T)

and made my pool

Code:

# zpool create tank \
mirror /dev/diskid/DISK-WD-WX11D9435R0Z /dev/diskid/DISK-WD-WX11D94358J5 \
mirror /dev/diskid/DISK-WD-WX11DA49H8PV /dev/diskid/DISK-WD-WX11D943583E \
mirror /dev/diskid/DISK-WD-WX11D943577S /dev/diskid/DISK-WD-WX11DA40HNK7

# zpool status
  pool: tank
state: ONLINE
  scan: none requested
config:

    NAME                             STATE     READ WRITE CKSUM
    tank                             ONLINE       0     0     0
     mirror-0                       ONLINE       0     0     0
       diskid/DISK-WD-WX11D9435R0Z  ONLINE       0     0     0
       diskid/DISK-WD-WX11D94358J5  ONLINE       0     0     0
     mirror-1                       ONLINE       0     0     0
       diskid/DISK-WD-WX11DA49H8PV  ONLINE       0     0     0
       diskid/DISK-WD-WX11D943583E  ONLINE       0     0     0
     mirror-2                       ONLINE       0     0     0
       diskid/DISK-WD-WX11D943577S  ONLINE       0     0     0
       diskid/DISK-WD-WX11DA40HNK7  ONLINE       0     0     0

errors: No known data errors

and check how much storage we have:

Code:

# zpool list
NAME    SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
tank   16.3T   300K  16.3T         -     0%     0%  1.00x  ONLINE  -
zroot   460G  1.19G   459G         -     0%     0%  1.00x  ONLINE  -

We want to keep 10% in emergency back up reserve, so we have 14.6T to work with. But there's other needs and we want to make efficient use of space, so let's use compression.

Code:

# zfs create -V 14T tank/data
# zfs set compression=lz4 tank/data

Ok, so now we need to set up iSCSI (no-auth while I get this set up; no this isn't on the internet)

Code:

auth-group ag0 {
    chap user secret
}

portal-group pg0 {
    discovery-auth-group no-authentication
    listen 0.0.0.0
    listen [::]
}

target iqn.2015-10.com.example.storage:data {
    auth-group no-authentication
    portal-group pg0

    lun 0 {
        path /dev/zvol/tank/data
        size 14T
    }
}

Add iscsid_enable="YES" to /etc/rc.conf, run service ctld start, and it's running!

And that's it! The Windows Server 2003 iSCSI initiator finds it and formats it. I still don't know how Windows calculates GB, so I had to reformat the first partition to make sure my target's new partition was bigger than the existing target's old partition, but that was trivial.

Now, despite ostensibly having 1 Gbps ether along the entire route, I'm only getting 11.7 MB/s actual (using systat -ifstat 1). I originally asked to put this box right next to the app server so we could cut out hops, but the IT guys thought I'd be fine. Seems it's not so fine.

asteriskRoss · Oct 12, 2015

I'm pleased you're up and running.

grep2grok said:
[...] using 6 matched disks[...]. And they all came off the same palette.

Matched disks are great though be quick if one fails as the others from the same batch may not be far behind

Since the issue you have now (network/iSCSI/ZFS performance) is very different from the original issue in the thread title I suggest starting a new thread. You are more likely to attract useful input from others. In the meantime, it would be worth narrowing down where the bottleneck is by, for example, trying another network service to test the speed. You could use scp(1) to copy a large file or a utility like benchmarks/iperf from ports.