ZFS 2x3tb as one raidz member

robot468 · Jul 15, 2022

Jose said:
I just realized that he's trying to BIOS boot from a GPT partition. GEOM mirrors are incompatible with GPT partitions. I wonder if gconcat(8) has the same limitation? Perhaps he should try zfsboot(8).

I tried in vmware: root partition on zfs mirror of gpt partitions on two separate small disks(emulating ssd or sd card).main pool with /usr, /var/ and etc on 20gb+10gb+10gb, gconcat 10gb+10gb, gpt scheme on two 20gb disks (native and gconcat'ed), zfs stripe of 2x20gb gpt partition.
gtpzfsboot on 2 small disks. it works.

Why do you think GEOM mirrors is incompatible with BIOS?

Jose · Jul 15, 2022

robot468 said:
Why do you think GEOM mirrors is incompatible with BIOS?

Because the Handbook says so:

gmirror(8) stores one block of metadata at the end of the disk. As GPT partition schemes also store metadata at the end of the disk, mirroring entire GPT disks with gmirror(8) is not recommended. MBR partitioning is used here because it only stores a partition table at the start of the disk and does not conflict with the mirror metadata.

Chapter 19. GEOM: Modular Disk Transformation Framework

In FreeBSD, the GEOM framework permits access and control to classes, such as Master Boot Records and BSD labels, through the use of providers, or the disk devices in /dev.

docs.freebsd.org

robot468 · Jul 15, 2022

Jose said:
Because the Handbook says so:

Ah, ok. gmirror incompatible with gpt, not BIOS. I will use zfs mirror(if I do), not gmirror. gconcat, I think, has no such problems. The resulting block device apparently does not contain service sectors, which are written to the source devices.

mer · Jul 15, 2022

I think that "order of operations" matters a lot here with gpt and gmirror. Michael W Lucas, FreeBSD Mastery: Storage Essentials actually goes through this. Basically:

gpart create -s gpt dev1
gpart create -s gpt dev2
At this point you have marked dev1 and dev2 with gpt schemes but have not yet created any partions.

gmirror load
gmirror label RootMirror dev1
gmirror insert RootMirror dev2
Now you have a gmirror of dev1 and dev2 but no partitions.

gpart create -s gpt mirror/RootMirror To create a gpt partitioning scheme on the mirror.
gpart add -t freebsd-boot -l boot -s 512K mirror/RootMirror
add more partions here.
What you've done is create the mirror and then partition on top of the mirror device. That takes care of any label in the last block issue .
Of course that only applies to the gmirror class, not gconcat or anything else.

Jose · Jul 16, 2022

mer said:
I think that "order of operations" matters a lot here with gpt and gmirror.

Yup, and it looks like this is exactly backwards.

robot468 · Jul 16, 2022

Jose said:
Yup, and it looks like this is exactly backwards.

I don't really know what you're talking about, but I checked with myself

Code:

# gmirror status
      Name    Status  Components
mirror/gm0  COMPLETE  vtbd1 (ACTIVE)

# diskinfo -v mirror/gm0
mirror/gm0
        512             # sectorsize
        21474835968     # mediasize in bytes (20G)
        41943039        # mediasize in sectors
        131072          # stripesize
        0               # stripeoffset
        No              # TRIM/UNMAP support
        Unknown         # Rotation rate in RPM

# diskinfo -v vtbd1
vtbd1
        512             # sectorsize
        21474836480     # mediasize in bytes (20G)
        41943040        # mediasize in sectors
        131072          # stripesize
        0               # stripeoffset
                        # Disk descr.
        BHYVE-96F6-C625-B109    # Disk ident.
                        # Attachment
        No              # TRIM/UNMAP support
        Unknown         # Rotation rate in RPM

# gpart show mirror/gm0
=>      40  41942960  mirror/gm0  GPT  (20G)
        40       216              - free -  (108K)
       256  41942528           1  freebsd-zfs  (20G)
  41942784       216              - free -  (108K)

The size of the mirror device is one sector smaller than the original disk. So you can easily create a partition table on the mirror device and there will be no conflicts.

mer · Jul 16, 2022

Jose said:
Yup, and it looks like this is exactly backwards.

It looks backwards from what most people would consider normal, but it makes sense. gmirror class passes command down to the underlying device, but is presenting a modified view to the layers above it. As robot468 points out the mirror is one sector smaller, so gmirror is protecting it's metadata.

I've never done it, so I clearly stated my source and it was exactly the situtation being talked about (gpt and gmirror not being compatible) so I put it out there.

robot468 · Jul 16, 2022

Back to the subject of the post. I was thinking that there is another difference between the options with one 8x6Tb raidz2 pool and two pools (keep the old one and make a new one) - IOPS. Two pools would give me 2xIOPS if I correctly split the i/o operations. After all, the disadvantage of large pools is that everything(space, r/w bandwidth) grows except IOPS.

Jose · Jul 16, 2022

robot468 said:
I don't really know what you're talking about, but I checked with myself

What does gpart show report for the devices that make up the mirror?

mer said:
It looks backwards from what most people would consider normal, but it makes sense. gmirror class passes command down to the underlying device, but is presenting a modified view to the layers above it. As robot468 points out the mirror is one sector smaller, so gmirror is protecting it's metadata.

No. The GPT partition is on top of the gmirror device. What you suggested was to create two GPT partitions and then create a gmirror on top of the devices (not the partitions) that contain those partitions. That will not work, if the handbook is to be believed.

robot468 · Jul 16, 2022

Jose said:
What does gpart show report for the devices that make up the mirror?

I have not checked and have already deleted this test stand. I suspect it will show that one of the backup gpt metadata is corrupt.

I think I understand what you mean. That I won't be able to boot from one of the devices that make up the mirror (because at this moment geom_mirror.ko is not loaded and we don't have a mirror as such)?

This seems wrong to me, gpt stores only the backup of its metadata in the last sector. It will probably warn that the backup is corrupted, but it will work. I can check it if you insist.

mer · Jul 16, 2022

Jose said:
What you suggested was to create two GPT partitions and then create a gmirror on top of the devices (not the partitions) that contain those partitions.

That is not what the commands are doing.

mer said:
gpart create -s gpt dev1
gpart create -s gpt dev2
At this point you have marked dev1 and dev2 with gpt schemes but have not yet created any partions.

Like I said, this is directly from the book, my supposition on this step is perhaps it makes it easier for GEOM to taste the devices.

The steps that I next say create partitions on top of the gmirror, which is the way around the gpt/gmirror last sector thing.
Creating a mirror AFTER gpt partitioning of course causes issues, but that's not what is happening.

Booting a gmirror is similar to booting a zfs mirror. All devices in the mirror get boot blocks, they all get "loader", loader.conf has a line that says to load gmirror. BIOS finds a bootable device, one of the devices in the mirror, starts booting off it. That device loads the gmirror module, which causes the mirror to come up and by the time the system hits single user the mirror is complete. One typically can't do anything with the system before single user anyway so not having the mirror complete in the loader is typically a non issue. There are probably cases where you would want the mirror complete in the loader but I think those are rare. If the BIOS sees say ada0 and ada1 as bootable devices, and they are configured as a mirror, you should be able to stop in the BIOS, select either ada0 or ada1 as the boot device and boot the system correctly.
I know it works that way for a ZFS mirror (I've physically tested it) so I'm assuming gmirror would do the same thing.

Jose · Jul 16, 2022

robot468 said:
I have not checked and have already deleted this test stand. I suspect it will show that one of the backup gpt metadata is corrupt.

I think I understand what you mean. That I won't be able to boot from one of the devices that make up the mirror (because at this moment geom_mirror.ko is not loaded and we don't have a mirror as such)?

This seems wrong to me, gpt stores only the backup of its metadata in the last sector. It will warn that the backup is corrupted, but it will work. I can check it if you insist.

Suit yourself. I'm using MBR partitions on my gmirror. The whole point of using mirrors is fault tolerance. I'm not OK with losing metadata backups in a setup that's supposed to be fault tolerant.

Then again, what you've done is what the Handbook explicitly says won't work. It could be the Handbook is wrong. It's happened before.

Jose · Jul 16, 2022

mer said:
Like I said, this is directly from the book, my supposition on this step is perhaps it makes it easier for GEOM to taste the devices.

With all due respect to Mr. Lucas, I think the Handbook should be authoritative. That said, it is sometimes wrong (see above).

mer said:
gpart create -s gpt dev1
gpart create -s gpt dev2
At this point you have marked dev1 and dev2 with gpt schemes but have not yet created any partions.

~~The man page for gpart(8) claims otherwise:~~

create Create a new partitioning scheme on a provider given by
provider. The scheme to use must be specified with the -s
scheme option.

Edit: Heh, I should actually read what I quote. How is this "scheme" stored on disk? Are you sure it doesn't write any metadata for it? Metadata whose backup could be overwritten by gmirror?
Edit2: The Handbook claims it is precisely this "scheme" that stores metadata at the end of the disk

robot468 · Jul 16, 2022

Jose said:
Suit yourself. I'm using MBR partitions on my gmirror. The whole point of using mirrors is fault tolerance.

ok, see attachmets. I installed system on mirror with installer. It boots and works. Not even a warning.

Jose said:
I'm not OK with losing metadata backups in a setup that's supposed to be fault tolerant.

You will NOT loosing anything. Warnings may appear because the system is reading the source device and not the mirror during boot.

Jose · Jul 16, 2022

robot468 said:
ok, see attachmets. I installed system on mirror with installer. It boots and works. Not even a warning.

It works for now. If that's good enough for you, fair enough.

robot468 said:
You will NOT loosing anything. Warnings may appear because the system is reading the source device and not the mirror during boot.

As long as you don't lose the GPT metadata at the beginning of the disk, sure. Once you do you'll be screwed because the backup was overwritten by gmirror. Or not 'cause you have a backup of it on the second disk of the mirror. I guess then you could re-partition the failed disk and rebuild the mirror.

Or you could just use MBR partitions and not worry about it.

robot468 · Jul 16, 2022

Jose said:
As long as you don't lose the GPT metadata at the beginning of the disk, sure. Once you do you'll be screwed because the backup was overwritten by gmirror. Or not 'cause you have a backup of it on the second disk of the mirror. I guess then you could re-partition the failed disk and rebuild the mirror.

You don't understand the scheme. All copies of the metadata are in place. Imagine you have an image of disk1.img with a gpt table, it has all the metadata copies, everything is fine. I take that file and append my 512 bytes to the end of that file, where I keep the information that that file is part of the mirror.

If you show this file to a program that expects it to be just a gpt disk image, it will give you a warning because it looks in the last sector and sees something other than what it expects. But that does not mean that the metadata is missing somewhere.

UPD: This is of course correct if you first create a mirror and then create a gpt table on it.

Jose · Jul 16, 2022

robot468 said:
You don't understand the scheme.

I'm pretty sure I do.

16 KB (minus 512 bytes) from the end	Secondary GPT table - It is byte-for-byte identical to the Primary table. Used mainly for recovery operations.
Last 512 bytes	Secondary GPT Header - Contains the Unique Disk GUID, Location of the Secondary Partition Table, No. of possible entries in the partition table, CRC32 checksums of itself and the Secondary Partition Table, Location of the Primary GPT Header. This header can be used to recover GPT info in case the primary header is corrupted.

GUID Partition Table - ParabolaWiki

wiki.parabola.nu

robot468 said:
I take that file and append my 512 bytes to the end of that file, where I keep the information that that file is part of the mirror.

~~So you destroy the secondary GPT header.~~ How are you going to make an additional 512 bytes magically appear at the end of a physical disk?

robot468 said:
If you show this file to a program that expects it to be just a gpt disk image, it will give you a warning because it looks in the last sector and sees something other than what it expects. But that does not mean that the metadata is missing somewhere.

Yes, it does. ~~It means you've overwritten the copy of the metadata that's used for recovery.~~ All GPT utilities expect to find the recovery data in the last 512 bytes. By moving that up, you've made it so it will not be found, and the GPT scheme on that disk is not recoverable. That's why it warns you.

Again, I'm not sure this is actually happening. Maybe gmirror has been fixed to respect GPT secondary headers and the Handbook has not been updated.

robot468 · Jul 16, 2022

Jose said:
How are you going to make an additional 512 bytes magically appear at the end of a physical disk?

I won't.
Instead I will create a gpt table one sector less.

When you create a mirror on a device with size N sectors, the mirror device appears with size N-1 sectors, I wrote about it above. And gpart works with the device where the last sector is actually the penultimate sector on the physical device.

Jose · Jul 16, 2022

robot468 said:
I won't.
Instead I will create a gpt table one sector less.

When you create a mirror on a device with size N sectors, the mirror device appears with size N-1 sectors, I wrote about it above. And gpart works with the device where the last sector is actually the penultimate sector on the physical device.

Yup, you're right. I checked on my mirror and it's one sector less than the devices it's made up from. It does look like gmirror was fixed to work with GPT partitions and the Handbook was not updated.

Jose · Jul 16, 2022

robot468 said:
Back to the subject of the post. I was thinking that there is another difference between the options with one 8x6Tb raidz2 pool and two pools (keep the old one and make a new one) - IOPS. Two pools would give me 2xIOPS if I correctly split the i/o operations. After all, the disadvantage of large pools is that everything(space, r/w bandwidth) grows except IOPS.

Yup. Very in-depth discussion here:

How I Learned to Stop Worrying and Love RAIDZ

The popularity of OpenZFS has spawned a great community of users, sysadmins, architects and developers, contributing a wealth of advice, tips…

www.delphix.com

robot468 · Jul 16, 2022

Jose said:
Yup. Very in-depth discussion here:

They suggest building a pool of multiple raidz vdevs? I think the problem with this approach is that if I lose one raidz, I lose the whole pool.

mer · Jul 16, 2022

robot468 said:
They suggest building a pool of multiple raidz vdevs? I think the problem with this approach is that if I lose one raidz, I lose the whole pool.

I did not read the link, but I think it depends on what type of pool. If a stripe, then I think losing one raidz would affect the whole thing, but what does it take to lose a complete raidz vdev? That's multiple hardware failures on the single vdev.

I think one can take things to the extreme:
Could you take a bunch of disks, create mirror pairs, then create a raidz out of those mirror pairs while at the same time using another bunch of disks, create mirror pairs, create the same raidz and then mirror the two raidz?
I don't know, it hurts my head to think about it and I'm not sure what it would give you.

In the end I think a lot of what is "best" is defined by the workload. Generically, "more vdevs give better IOPs", different vdevs bias towards read perf, write perf, space, redundancy (how many you can lose before it doesn't work). I think searching around you can find some good links talking about that. Basically "If I have 8 disks to arrange in a zpool what configuration will give me the best read performance with good redundancy".
I don't know the answer to that but I'm sure it's out there (in at least 12 different configurations).

Jose said:
How are you going to make an additional 512 bytes magically appear at the end of a physical disk?

Jose, I did not mean to imply that Lucas was more authoratative than the Handbook but in practice he's actually doing a lot of the examples in his books so he may be more accurate than the Handbook. Enough said, I'm not debating that any more.
The quote above, robot468 is referring specifically to a disk image file so its trivial to append. But: the spirit of it is "take two devices, create a gmirror with them, then use gpart to create partitions on the mirror device"
The stacking there has the gpart commands not acting on the physical devices, but rather on the GEOM mirror. The GEOM mirror modifies what the gpart command sees as the last sector, or at least it should be modifying it.

Jose · Jul 16, 2022

mer said:
Jose, I did not mean to imply that Lucas was more authoratative than the Handbook but in practice he's actually doing a lot of the examples in his books so he may be more accurate than the Handbook. Enough said, I'm not debating that any more.

Hey no worries, man. I apologize for my tone above. This stuff is confusing.

mer said:
The quote above, robot468 is referring specifically to a disk image file so its trivial to append. But: the spirit of it is "take two devices, create a gmirror with them, then use gpart to create partitions on the mirror device"
The stacking there has the gpart commands not acting on the physical devices, but rather on the GEOM mirror. The GEOM mirror modifies what the gpart command sees as the last sector, or at least it should be modifying it.

I still think there's a problem here, but I need to do more research. I can take it to another thread if your and robot468's patience with me is at an end.

Jose · Jul 16, 2022

robot468 said:
They suggest building a pool of multiple raidz vdevs? I think the problem with this approach is that if I lose one raidz, I lose the whole pool.

This is true of any ZFS pool with any kind of vdev. The vdev layer is responsible for redundancy. I personally love the clean separation of concerns in the ZFS architecture. Maybe you have to suffer with LVM on Linux for a while to truly appreciated it.

robot468 · Jul 16, 2022

Jose said:
I still think there's a problem here, but I need to do more research. I can take it to another thread if your and [FONT=monospace]robot468[/FONT]'s patience with me is at an end.

No, I am ready to wait for your arguments

Jose said:
Maybe you have to suffer with LVM on Linux for a while to truly appreciated it.

Every time I touch Linux (out of necessity) I feel pain

ZFS 2x3tb as one raidz member

Attachments