This started out as a "why isn't this working?" question, but after a full day of trial and error, I think it might help more people to discuss "what's the best approach in today's world? (FreeBSD 12.2)"
How should disks (or vdev's) be identified when creating ZFS pools in 2021?
(and implicitly, what conventions are obsolete and should be avoided)?
i.e.
TL;DR, is the current wisdom still "wrap your ZFS vdevs in labelled GPT partitions" or is there another alternative that I've missed?
What I picked up over the years:
Even so, this is what the FreeBSD Handbook uses (for simplicity?). The examples in section 20.3.6 at least show the "
They seemed like an elegant, 1:1 mapping between hardware and software.
This no longer seems to be true? (see below)
4. alternative, use
Rationale: gpart stores your label in the partition table on disk, so these labels are also semi-permanent (i.e.: until you re-format the disk).
Also, you can set the label to anything you like, e.g.: drive-bay numbers or short, easily identifiable ID's which you can stick on the outside of each drive.
This means that it will find and mount your pools no matter what you do, but the name of each vdev may be very different to the name you used when creating the pool.
I've been using #1 and #3 for the last 10 years and it's been great.
Whenever a disk had a problem,
The thing that triggered this excursion for me was, I bought some SSD's the other day, to replace my old HDD's...
however, whereas my 10-year old HDD's each had permanent diskid's, the SSD are showing very strange behaviour.
The new drives are SATA SSD's from two different manufacturers, each encased in an individual USB/SATA enclosure, also from two different manufacturers.
This gives me 4 possible case/SSD combinations, which I spent the last day trying out in order to get stable ID's that I could label the disks with...
- The delock 42617 SSD cases all(!) show up with the same diskid, namely DISK-000000123DE9 (123?! really?!) - so diskids are useless when using these USB enclosures.
=> #3 above is obsolete - don't use /dev/diskid/... any more!
(or the diskid mechanism needs to be updated to cope with these, bizarre, devices - whack-a-mole anyone?)
- Also, the serial numbers reported in /var/log/messages are only sometimes related to the actual serial number on the outside of the disk, but are often quite different (e.g. completely different prefix or the last character might be a number instead of a letter. This also seems to be the same, in my view, broken mechanism used by
- Turning OFF diskid's by adding
- I tried using /dev/da[0-9]+ - but was quickly able to produce situations where it was not obvious which physical disk was in trouble.
- The numerical ID's provided by ZFS are of absolutely no help! (e.g.
- Setting up a GPT partition with a name derived from the serial number seems to be the way to go, but getting the actual serial number was not obvious (to me at least).
- I eventually found:
- I thought the new SSD's also introduced a problem that corrupted the 'secondary GPT table' during boot - but after much experimentation, this seems to be a hardware issue one one of my machines, as I was not able to reproduce this behaviour on my second, identical machine (of course I started on the one that had problems... Murphy... grrr).
Using GPT labels seems to be the only way to go, but it also feels like individually wrapping bananas in plastic, since ZFS would be quite happy to take the whole drive - but how do I tell
So, is the current wisdom still "wrap your ZFS vdevs in GPT partitions" or is there another alternative that I've missed?
Thanks in advance,
Jauh
PS: Here are some of the useful commands I found to help debug my situation:
- obviously, /var/log/messages is the first place to look, e.g.
-
-
-
-
-
How should disks (or vdev's) be identified when creating ZFS pools in 2021?
(and implicitly, what conventions are obsolete and should be avoided)?
i.e.
zpool create {pool} [raidz[123]]
{what goes here?...}TL;DR, is the current wisdom still "wrap your ZFS vdevs in labelled GPT partitions" or is there another alternative that I've missed?
What I picked up over the years:
1. don't bother formatting disks for ZFS, just give ZFS the whole disk
Rationale: ZFS takes care of everything2. don't use /dev/da[0-9]+
Rationale: device numbers can change over time (esp. USB devices etc.), making it difficult to identify which device number refers to which physical disk when things go wrong.Even so, this is what the FreeBSD Handbook uses (for simplicity?). The examples in section 20.3.6 at least show the "
<random number> UNAVAIL ... was /dev/da0
" problem but they don't go into the topic of identifying which physical hardware /dev/da0 is referring to.3. do use /dev/diskid/<id>
Rationale: diskid's are hardware-derived (typically from the serial number) and thus always remain consistent.They seemed like an elegant, 1:1 mapping between hardware and software.
This no longer seems to be true? (see below)
4. alternative, use gpart
to label a partition on each disk and then use /dev/gpt/<label>
Rationale: gpart stores your label in the partition table on disk, so these labels are also semi-permanent (i.e.: until you re-format the disk).Also, you can set the label to anything you like, e.g.: drive-bay numbers or short, easily identifiable ID's which you can stick on the outside of each drive.
5. alternative, you could use /dev/gptid/<id>
Rationale: they should also be consistent - but I've never used them, they don't appear to relate to anything found on the devices I have and they are anything but human-friendly. Where do they even come from?6. finally, regardless of how you create your pool, ZFS will find the vdevs, if they are there
ZFS apparently scans all devices in /dev/... when looking for pools to import.This means that it will find and mount your pools no matter what you do, but the name of each vdev may be very different to the name you used when creating the pool.
I've been using #1 and #3 for the last 10 years and it's been great.
Whenever a disk had a problem,
zpool status
said "OFFLINE ... was /dev/diskid/<id>" and I knew exactly which drive to clear, online or replace and all was well. (I labelled each disk with the last few characters of the serial number, which matched the diskid).The thing that triggered this excursion for me was, I bought some SSD's the other day, to replace my old HDD's...
however, whereas my 10-year old HDD's each had permanent diskid's, the SSD are showing very strange behaviour.
The new drives are SATA SSD's from two different manufacturers, each encased in an individual USB/SATA enclosure, also from two different manufacturers.
This gives me 4 possible case/SSD combinations, which I spent the last day trying out in order to get stable ID's that I could label the disks with...
What I have discovered with my new SSD's and/or USB enclosures:
- The delock 42617 SSD cases all(!) show up with the same diskid, namely DISK-000000123DE9 (123?! really?!) - so diskids are useless when using these USB enclosures.
=> #3 above is obsolete - don't use /dev/diskid/... any more!
(or the diskid mechanism needs to be updated to cope with these, bizarre, devices - whack-a-mole anyone?)
- Also, the serial numbers reported in /var/log/messages are only sometimes related to the actual serial number on the outside of the disk, but are often quite different (e.g. completely different prefix or the last character might be a number instead of a letter. This also seems to be the same, in my view, broken mechanism used by
diskinfo
(see below))- Turning OFF diskid's by adding
kern.geom.label.disk_ident.enable=0
to /boot/loader.conf at least removes the diskid confusion.- I tried using /dev/da[0-9]+ - but was quickly able to produce situations where it was not obvious which physical disk was in trouble.
- The numerical ID's provided by ZFS are of absolutely no help! (e.g.
zpool status -g
)- Setting up a GPT partition with a name derived from the serial number seems to be the way to go, but getting the actual serial number was not obvious (to me at least).
- I eventually found:
label=$( camcontrol identify da0 | sed -n 's/.*serial number.*\(.\{4\}\)$/\1/p' )
which works nicely- I thought the new SSD's also introduced a problem that corrupted the 'secondary GPT table' during boot - but after much experimentation, this seems to be a hardware issue one one of my machines, as I was not able to reproduce this behaviour on my second, identical machine (of course I started on the one that had problems... Murphy... grrr).
Conclusion
This all seems quite hacky, non-obvious and brittle, just to get stable, consistent device names.Using GPT labels seems to be the only way to go, but it also feels like individually wrapping bananas in plastic, since ZFS would be quite happy to take the whole drive - but how do I tell
zpool
which physical piece of hardware I'm referring to, in a way that will still be consistent in 5-10 years time, after any number of reboots, relocations, motherboard replacements etc?So, is the current wisdom still "wrap your ZFS vdevs in GPT partitions" or is there another alternative that I've missed?
Thanks in advance,
Jauh
PS: Here are some of the useful commands I found to help debug my situation:
- obviously, /var/log/messages is the first place to look, e.g.
grep -i '\(boot\|geom\|da[0-9]*:\|usb\)' /var/log/messages
-
gpart status
and glabel status
to see which disk is mounted where (only works with GPT formatted disks)-
geom -t
also gives a nice overview of your storage devices-
camcontrol identify <device>
for reading detailed information such as serial numbers, make and model, as well as several SMART parameters.-
usbconfig list
to see which USB devices are available. (Sidenote: sometimes, after rebooting, mine only get HIGH speed (480Mbps) instead of SUPER speed (5.0Gbps) - worth keeping an eye on)-
diskinfo -v <device>
and diskinfo -s <device>
also shows information about the disks, like their serial number, but it also picks up the 000000123DE9
ID's fudged by the delock 42617 cases - in other words, don't rely on this!