Solved move ZFS to disks by ID?

benoitc · Nov 12, 2022

Currently I have setup the OS on the last disks da6 and da7 but it appears that on the HPE dl380 gen9 these disk order may change since these disks are 2 disks on the rear. If I add 2 new disks to the front, the disk order will change (the last 2 disks will become da8/da9). Is there a way to change the pool to access to the disks by IDs instead?

Code:

$ zpool status
  pool: zdata
 state: ONLINE
config:

    NAME        STATE     READ WRITE CKSUM
    zdata       ONLINE       0     0     0
      mirror-0  ONLINE       0     0     0
        da0     ONLINE       0     0     0
        da1     ONLINE       0     0     0
      mirror-1  ONLINE       0     0     0
        da2     ONLINE       0     0     0
        da3     ONLINE       0     0     0
      mirror-2  ONLINE       0     0     0
        da4     ONLINE       0     0     0
        da5     ONLINE       0     0     0

errors: No known data errors

  pool: zroot
 state: ONLINE
config:

    NAME        STATE     READ WRITE CKSUM
    zroot       ONLINE       0     0     0
      mirror-0  ONLINE       0     0     0
        da6p4   ONLINE       0     0     0
        da7p4   ONLINE       0     0     0

errors: No known data errors

gpw928 · Nov 12, 2022

GPT labels are applied to partitions, not whole disks.
A partition is going to be smaller than the whole disk.
When you attach a mirror VDEV it needs to be at least as large as the original VDEV.
Therefore you can't just break each mirror, and re-attach a smaller partition using a GPT label.
I don't know of any way to shrink a ZFS vdev, without destroying it.
It's possible that shrinking a VDEV may be added as a "new feature" some time in the future.
But I can't see a way the do what you want with da0 through da5, without starting again.

You could switch the zroot pool to use labels, because it's already using partitions.
It might be possible to do it without re-silvering.
But I have never done that, and I would have to test the process.
In any event, this should work:

detach one side of the mirror;
label partition 4 of the free disk appropriately using gpart modify;
re-attach the free disk to the mirror, specifying the GPT label; and
when re-silvering is complete, repeat for the other side of the mirror.

T-Daemon · Nov 13, 2022

You can glabel(8) whole disks. Labeled disks will create under /dev/label device nodes. Those devices can be used in zpool(8) operations. Example:

Synopsis: glabel label [-v] name dev

Code:

glabel label  da8  /dev/da8
glabel label  da9  /dev/da9
zpool create newpool mirror label/da8 label/da9

As name any description for the disks can be used.

The kernel module geom_label.ko must be set to automatic load in /boot/loader.conf to import the labeled disks pool automaticaly.

Present whole disks from a pool can be labeled also, after temporarily removed, labeled and introduced with the label name to the pool.

bob2112 · Nov 18, 2022

T-Daemon said:
Present whole disks from a pool can be labeled also, after temporarily removed, labeled and introduced with the label name to the pool.

I'm not sure what you are suggesting here, but bear in mind that a labelled device is one sector smaller than the raw disk device.

T-Daemon · Nov 19, 2022

bob2112 said:
bear in mind that a labelled device is one sector smaller than the raw disk device.

If you are worried this could have an effect on a pools health, apparently it doesn't:

Test environment: VirtualBox VM

EFI enabled
FreeBSD system 13.1-RELEASE-p3 - ZFS 2 disks mirror (AHCI controller)
Data pool (zdata) - 6 disks raid10 (raid 1+0 n x 2-way mirrors) (LsiLogic SAS controller)
'zdata' pool created with da* device names, some data copied from FreeBSD system to 'zdata', afterwards each device, one after another, detached from pool, glabel(8)'ed, attached to the pool with the label name.

zpool-status(8) doesn't report any errors.

Some disks labeled:

All disks labeled:

After scrub:

Note:

T-Daemon said:
The kernel module geom_label.ko must be set to automatic load in /boot/loader.conf to import the labeled disks pool automaticaly.

That module is not needed if a GENERIC kernel is used, it's compiled in:

src/sys/amd64/conf/GENERIC

Code:

options     GEOM_LABEL        # Provides labelization

Alain De Vos · Nov 19, 2022

glabel label ada2p3 /dev/ada2p3
glabel: Can't store metadata on /dev/ada2p3: Operation not permitted.

Alain De Vos · Nov 19, 2022

Probably because of mounting I can not change anything ...

Alain De Vos · Nov 19, 2022

#gpart modify -i 2 -l ada0s2 ada0
gpart: Invalid argument

T-Daemon · Nov 19, 2022

Alain De Vos said:
glabel label ada2p3 /dev/ada2p3
glabel: Can't store metadata on /dev/ada2p3: Operation not permitted.

Have you exported the pool before glabel(8)?

Alain De Vos said:
#gpart modify -i 2 -l ada0s2 ada0
gpart: Invalid argument

gpart(8) can't label MBR slices. Use glabel(8).

gpw928 · Nov 19, 2022

T-Daemon said:
If you are worried this could have an effect on a pools health, apparently it doesn't:

A zpool-scrub(8) examines "all data in the specified pools to verify that it checksums correctly". There is no assertion that hitherto unused sectors will be examined.

The manual for glabel(8) is pretty clear that:

A label can be set up on a GEOM provider in two ways: “manual” or “automatic”. When using the “manual” method ( glabel create), no metadata are stored on the devices, so a label has to be configured by hand every time it is needed. The “automatic” method ( glabel label) uses on-disk metadata to store the label [stored in a provider's last sector] and detect it automatically in the future.

It's not sensible to assign the last sector of a provider to both ZFS and GEOM. Doing so may corrupt your disk any time ZFS decides to use the last sector. I think it's profoundly unwise to use glabel to retrofit a permanent label on a whole disk without first shrinking the ZFS provider by one sector.

T-Daemon · Nov 22, 2022

gpw928 said:
It's not sensible to assign the last sector of a provider to both ZFS and GEOM. Doing so may corrupt your disk any time ZFS decides to use the last sector.

I've mailed the author of glabel(8) to ask him if a retroactively added on-disk metadata represents indeed a danger to the ZFS.

He was also involved in porting the Solaris ZFS to FreeBSD together with Kirk McKusick.

sko · Nov 22, 2022

benoitc said:
Currently I have setup the OS on the last disks da6 and da7 but it appears that on the HPE dl380 gen9 these disk order may change since these disks are 2 disks on the rear. If I add 2 new disks to the front, the disk order will change (the last 2 disks will become da8/da9). Is there a way to change the pool to access to the disks by IDs instead?

The visual representation from zpool status has nothing to do with the actual metadata zfs uses to reassemble a pool at boot.
Actually, you can change what labels are shown in the zpool status via the sysctls 'kern.geom.label.disk_ident.enable', 'kern.geom.label.gptid.enable' and 'kern.geom.label.gpt.enable'. E.g. I have in /boot/loader.conf on my storage hosts:

Code:

kern.geom.label.disk_ident.enable=0
kern.geom.label.gptid.enable=0

So zpool status will show only gpt labels (if present).
The nice thing: even if set to 0 the others are used as falback - so if no gpt label is present, the disklabel is shown; if this also isn't set, the disk identifier is shown:

Code:

  pool: jails
 state: ONLINE
  scan: scrub repaired 0 in 0 days 00:10:02 with 0 errors on Fri Nov  4 01:10:02 2022
config:

        NAME        STATE     READ WRITE CKSUM
        jails       ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            da7     ONLINE       0     0     0
            da2     ONLINE       0     0     0

errors: No known data errors

pool: stor1
 state: ONLINE
  scan: scrub repaired 0 in 0 days 08:29:31 with 0 errors on Tue Nov  1 13:44:31 2022
config:

        NAME                    STATE     READ WRITE CKSUM
        stor1                   ONLINE       0     0     0
          raidz1-0              ONLINE       0     0     0
            label/cca255027d59  ONLINE       0     0     0
            label/cca255027f11  ONLINE       0     0     0
            da8                 ONLINE       0     0     0
        logs
          mirror-1              ONLINE       0     0     0
            gpt/slog-IN896a9    ONLINE       0     0     0
            gpt/slog-IN89db9    ONLINE       0     0     0
        cache
          gpt/l2arc-IN896a9     ONLINE       0     0     0
          gpt/l2arc-IN89db9     ONLINE       0     0     0

errors: No known data errors

  pool: stor2
 state: ONLINE
  scan: none requested
config:

        NAME                STATE     READ WRITE CKSUM
        stor2               ONLINE       0     0     0
          mirror-0          ONLINE       0     0     0
            label/VRJX09DK  ONLINE       0     0     0
            label/VRK0M8BK  ONLINE       0     0     0
          mirror-1          ONLINE       0     0     0
            label/VRK0M91K  ONLINE       0     0     0
            label/VRK1JT4K  ONLINE       0     0     0

errors: No known data errors

  pool: zroot
 state: ONLINE
  scan: scrub repaired 0 in 0 days 00:04:45 with 0 errors on Tue Nov  1 05:05:45 2022
config:

        NAME                 STATE     READ WRITE CKSUM
        zroot                ONLINE       0     0     0
          mirror-0           ONLINE       0     0     0
            gpt/zfs-IN89dca  ONLINE       0     0     0
            gpt/zfs-IN2566d  ONLINE       0     0     0

errors: No known data errors

I'm usually using GPT labels for partitioned drives and glabels when using whole disks. The 'jails' pool is rather ancient and I used no labels back when I created that pool (or I just forgot), so only the disk identifier is shown.

If you move around the disks or put them into another host, the pools are correctly reassembled no matter in what order the disks are connected/recognized. All those labels don't matter to zfs - they are merely for us humans to (hopefully) identify the correct disk to rip out ( sesutil show is very helpful in that regard...)

benoitc · Nov 27, 2022

sko I see, thanks for your answer, (also all) this is solving my concern

driesm · Nov 27, 2022

I had a raidz2 pool not so long ago consisting of 8 disks all identified by /dev/da*.
I also looked for solutions, and its actually not that hard if you want to use something else that doesn't change between reboots. What I did was export the pool, and then import it again but specify "-d /dev/diskid/" to zpool import, this will change the /dev/da* entries to /dev/diskid/* entries. This approach works great if you specified the full disk for ZFS to use (no GPT headers / partitions / labels).

Code:

[/home/dries]$ zpool status
  pool: storage
 state: ONLINE
  scan: scrub repaired 0B in 03:42:20 with 0 errors on Tue Oct 25 07:04:05 2022
config:

        NAME                      STATE     READ WRITE CKSUM
        storage                   ONLINE       0     0     0
          raidz2-0                ONLINE       0     0     0
            diskid/DISK-W6A1XYMY  ONLINE       0     0     0
            diskid/DISK-W6A1WL8Y  ONLINE       0     0     0
            diskid/DISK-W6A1XDXW  ONLINE       0     0     0
            diskid/DISK-W6A1XB8T  ONLINE       0     0     0
            diskid/DISK-W6A1XBHT  ONLINE       0     0     0
            diskid/DISK-W6A1XZ7B  ONLINE       0     0     0
            diskid/DISK-W6A1M7A2  ONLINE       0     0     0
            diskid/DISK-W6A1M7ET  ONLINE       0     0     0

If you boot from the pool in question you will have to boot from USB to export and import your root pool but then I would opt for GPT labels as you never let ZFS use the full disk as you have a UEFI partition, possibly a swap partition etc. Label the needed partitions with "gpart modify -i 4 -l zroot1 /dev/da6", and "gpart modify -i 4 -l zroot2 /dev/da7". Boot from USB, zpool export zroot, zpool import -d /dev/gpt/ zroot.

veg · Dec 30, 2022

T-Daemon said:
I've mailed the author of glabel(8) to ask him if a retroactively added on-disk metadata represents indeed a danger to the ZFS.

He was also involved in porting the Solaris ZFS to FreeBSD together with Kirk McKusick.

Please share the answer with us, if any!

I'd rather use raw disks rather than partitions for storage, to keep things simple. However, I don't find it trivial navigating the pros & cons of each approach (or understanding why "FreeBSD Mastery: Advanced ZFS" leans towards the use of GPT labels while mentioning that using raw disks has advantages in the first book's intro). Thanks!

mer · Dec 30, 2022

veg I ran across this http://www.freebsddiary.org/zfs-with-gpart.php shortly after it was written and have been partitioning my ZFS devices ever since.
To me it boils down to "not all < insert size of device > drives are the same size". When you need to replace a provider ZFS wants it to be the same size or bigger, you can't put a smaller one in, so explicitly creating a partition of say 99% (assuming whole disk, no boot partitions) makes it easy to create things the same size.
I think using the raw device may have implications for physically moving drives to other machines, some good some not so good.

But above is my opinion, it works for me.

ralphbsz · Dec 30, 2022

veg said:
I don't find it trivial navigating the pros & cons of each approach ...

It's a tradeoff.

Objective measurable advantages of using raw disks:

You get a tiny bit more capacity. I think the overhead for GPT is in the dozens of KB; compared to the TB capacity of modern disks, this is irrelevant.
You might get a tiny bit more performance, because somewhere in the kernel, the byte offsets needs to be translated from partition to device. That's a memory access and an addition, or about a ns. Compared to the ms to dozen of ms it takes to do a disk IO, it's also irrelevant.

The rest is unmeasurable.

If you use raw disks only, you never have to learn the gpart command, which is "brain efficiency". In practice, if you manage a system large enough to have multiple disks and using whole disks, you have to know a lot of disk management stuff (gpart, camcontrol, zpool, ...) anyway.
As discussed above, if you use disks in redundant groups (RAID or mirroring), during disk replacement you have to make sure the new volume is larger or the same size than the old one. Partitions give you the flexibility to accomplish that, while using new larger disks efficiently.
And the final one completely trumps all the others: By using gpart, you can put human-readable string labels on the partitions. You don't need to remember that "this is the old Hitachi with the green dot on the label", and you don't need to write down disk serial numbers to figure out which data is which, you can just read the partition table and have the information right there. For example, the partition label might say (quite abbreviated): This is RalphBSz's disk, it is the Hitachi bought in 2016Q3, this partition is used in the Zpool that creates the /home filesystem using RAID-Z2, and that other partition is the /temp_backup volume that's not RAIDed.

In my (not at all humble) opinion, the advantage of making the disk drive self-describing trumps everything else.

veg · Dec 31, 2022

ralphbsz said:
It's a tradeoff.
[…]
In my (not at all humble) opinion, the advantage of making the disk drive self-describing trumps everything else.

Thanks a bunch for that detailed write up, ralphbsz, very much appreciated recap & interesting details.

One last thing if I may: is $(glabel)ing raw disks any cause for concern or truly inferior an option, if the main point is to make disks self describing (assuming one is not too concerned about difficulty finding same size or larger disks in the foreseeable future)?

mer · Dec 31, 2022

My opinion, if you are going to use raw disks, glabeling is a good idea. Do it first, then use that label when creating the zpools.

ralphbsz · Dec 31, 2022

veg said:
is $(glabel)ing raw disks any cause for concern or truly inferior an option,

I don't know of any drawback, other than the trivial loss of a few kilobytes for the header containing the label.

T-Daemon · Jan 6, 2023

veg said:
Please share the answer with us, if any!

I would have, but unfortunately there was no response from the author. E-mailed on November 22 2022, after that much time has passed I no longer expect an answer.

gpw928 · May 4, 2023

This issue has been nagging at me for some time.

So I have been reading about GEOM design and Classes.

GEOM Classes are broadly used to stack software modules between storage hardware managed by the kernel ("disks") and what is presented to "user" space.

So a simple example would be to start with two physical disks. These may be presented as a mirror using the GEOM MIRROR class. The flexibility of the concept is demonstrated by the ability to encrypt the mirror simply by inserting a GEOM ELI Class (geli(8)) into the stack.

So the physical disk driver is a "provider". Within the GEOM Class stack, the module above it is "consumer" to module below, and a "provider" to the module above, ad nauseam.

There's a lot (but not all) of the GEOM Classes enumerated in GEOM(4). Well known ones include disks, partitions, memory disks, concats, stripes, geli, mirrors, multipaths, ZFS VDEVS, labels, ...

GEOMs are both extensible (anyone, following the rules, can write a new one), and topologically agnostic (can be stacked arbitrarily). They are an exceptionally flexible and powerful design feature of FreeBSD.

Each GEOM Class requires a kernel module to be loaded to implement the Class.

For each GEOM Class, the metadata of the Class are located in the last sector of the provider (and thus must fit in it). Not all GEOM Classes require permanent metadata, but most do.

Since GEOMs can be stacked, it follows that, as we traverse the GEOM stack upwards from the kernel, successive providers will each (usually) consume the last sector of the previous provider for storing metadata.

Since GEOMs can be stacked arbitrarily, it follows that we can't possibly know how many "last" sectors will be consumed by an unknown (and possibly significant) number of providers. But we must assume that each provider may consume one more sector, and thus the sector count presented will generally drop by one with each step up the GEOM Class stack.

Is it OK to retrofit a "automatic" (permanent) glabel(8) to a "disk" (provider) already in use by a file system? We know that GEOM LABEL Class will want to use the last sector of the provider for storage of metadata. But this space is in the possession of the file system.

What if we also decide to encrypt the file system using the GEOM ELI Class? Now we need ANOTHER last sector!

My guess is that a GEOM consumer is entitled to use all the sectors offered by the provider, simply because it can not guess how many extra providers may be added to the Class stack at some time in the future. That would prevent the retrofitting extra classes into an existing Class stack (because the last sector is owned by the consumer, and not a new provider).

If my guess is right, the stack of GEOM modules can't, in general, be changed after they are initially established, because the metadata would be trashed. However, there are exceptions since some Classes don't necessarily keep metadata (the "manual" glabel(8) is an example).

I suspect that if I wrote a file system, I might not use the last sector for a number of reasons... leaving room for the GEOM metadata for a retrospective disk label being one of them.

So I still don't have an answer to the question above... But I do at least know why I would be acutely reluctant to do it.