ZFS ZFS special device on shared drive

I have two SSD drives, ada0 and ada1, 7.68TB each

Can I install FreeBSD using ZFS and use ada0/ada1 as a mirror vdev, carving 500GB for the OS while leaving the rest available for the role of ZFS special device ?

So for example, I imagine what I want to do is partition the drives in two: ada[0|1]p1 (500Gb) and ada[0|1]p2 (7.1TB).

Then I create a ZFS mirror based on ada[0|1]p1 for the OS and a special mirror vdev based on ada[0|1]p2

Is this doable or does the special device want the entire drive ?
 
It might be easier to comment if we know what you are trying to achieve.

A VDEV can be a whole disk, or any partition on any disk. Because of that, ZFS gives you plenty of latitude to combine VDEVs in sub-optimal ways.

As a general rule, it's not a great idea to create multiple partitions on a "disk" and assign those partitions to VDEVs in different pools because it can create uncoordinated competition for access to the "disk".

If you want a root mirror with swap, a pair of small consumer grade SSDs can be had for less than US$50.
 
I only have 2 slots for SSD drives in my chassis.

With those two slots I have to create mirror vdevs that will host the OS/swap & special device
 
SirDice "ZFS special allocation class" (I think). Basically put your metadata and "small allocations" on a separate device from the main devices making up a pool.

This link may have been posted by Alain De Vos in the OP's other thread


Based on what I've read and what Alain De Vos said in your other thread, the "ZFS special" device stores meta data and other small files (One reference I came across called it "special allocation class"). To me it sounded like the intention of it is put it on a separate device from your zpool to give performance boost. The downside sounded like: if the device it's on fails you lose the whole dataset, so you would want to create a zfs-special with the same level of redundancy and the zpool it's for: mirror gets a mirror, RAID-Z3 gets RAID-Z3 etc (that's according to a couple of the articles I came across).
What do you intend to attach your zfs-special vdev to? Hopefully not your boot VDEVs. Maybe for another zpool? If I recall correctly didn't your other thread talk about a system with some 32 disks or something?

My opinion only, your system so feel free to ignore it, but I would not create a ZFS special for a boot VDEV, especially not on the same physical devices. Mirror the root pool vdevs? Sure, I do that, lots of people do.
If you are going to use the second partition on ada0 and ada1 as a zfs-special vdev and attach it to a pool on other devices, it may work, but think about data recovery if you lose the special. Chances are at that point you've also lost your OS mirror, so you are in the stage of rebuilding the system, but now since your metadata is on the devices you are rebuilding, you may not be able to import the pool that had the special attached. It sounds like if explictly detach the special it will actually flush to the main devices, but that's not going to happen in a failure situation.
 
Basically put your metadata and "small allocations" on a separate device from the main devices making up a pool.
Yes, but why the need to store 7 TB of metadata? There's only a 500GB OS pool, that's not going to use up 7TB worth of metadata.
 
Yes, but why the need to store 7 TB of metadata? There's only a 500GB OS pool, that's not going to use up 7TB worth of metadata.
Going by memory because I'm too lazy to look for it, the OP last1 had a thread a week or two ago and I think he referenced something about a zpool with 32 devices or something, and the rest of what I typed in #6 are basically asking him what he wants to use it for.

As for answering your question, yes, I agree that if he was setting up a zfs-special for the OS pool, it's a waste, a bad idea. If he's going to use it for something else, well I've seen references that for "normal" workloads, metadata is 0.5-1% of a pool size. So if the pool size is 700-1400TB, then maybe the 7TB is needed.

The OP I think is leaving out a lot of critical information.

EDIT:
I was wrong about a different thread by the OP. I think he is confusing a device node like /dev/ada0, a lot of people have referred to that as a "special" device. He talks about "I have a mirror where can I dd from".

This is his other thread I was thinking of:
 
Oops, I didn't mention these little details. This is for a backup server with 36 slots for HDD's ( 400TB total storage ) + 2 slots for SSDs.

The 400TB actually stores upwards of 500 million small files so the 7TB in metadata is warranted - I'm not storing any small files on the special device, just metadata info.

On the 2 SSDs I want to store the OS and the metadata for the big pool, not the OS. Hence the initial question: if I can partition the 2 SSD's and mirror one partition for the OS and use the second one for the special device.
 
What's this "special device" you keep mentioning? Why do you need 7 TB of metadata? Metadata of what exactly?

As others have said it, it dramatically increases pool operation speed, at least in my case. I run backup jobs where I rsync many large directories of small files. Backup job went from 15-16h to just under 2h with the metadata stored on SSD.
 
As others have said it, it dramatically increases pool operation speed
It was a rhetorical question, I was wondering if you understood what it was.
Oops, I didn't mention these little details. This is for a backup server with 36 slots for HDD's ( 400Tb total storage ) + 2 slots for SSDs.
That's the detail that was missing. It made little sense to have 7TB of metadata for a 500GB pool.
 
On the 2 SSDs I want to store the OS and the metadata for the big pool, not the OS. Hence the initial question: if I can partition the 2 SSD's and mirror one partition for the OS and use the second one for the special device.
Ok, that makes sense.
In theory, yes I think you can do that.
What I'm not sure of is the wisdom doing that. It may have performance implications, it may not, I just don't know. I would also be cautious about the whole data recovery aspect if you lose the mirror pair, yes you'd need to lose the whole mirror and since it's the OS you're reinstalling, but just pointing out about deciding how important the data is.
 
Ok, that makes sense.
In theory, yes I think you can do that.
What I'm not sure of is the wisdom doing that. It may have performance implications, it may not, I just don't know. I would also be cautious about the whole data recovery aspect if you lose the mirror pair, yes you'd need to lose the whole mirror and since it's the OS you're reinstalling, but just pointing out about deciding how important the data is.

That's what I'm worried about as well, but I see no alternative.

This is the chassis I'm working with: https://www.supermicro.com/products/archive/chassis/sc847be1c-r1k28lpb

you mentioned a 3-way mirror but I see no way of doing that unless I place a third SSD into a 3.5" adapter.

I'm also thinking now about these performance implications. Theoretically the OS partition will very rarely be accessed, mainly for writing logs I guess but it's indeed an unknown. I thought someone might have done something similar.
A wild idea would be to boot the OS via USB and use the SSD drives dedicated for the special vdev but I've heard stories of USB ports freezing, etc. I don't want any troubles.
 
I have two SSD drives, ada0 and ada1, 7.68TB each

Can I install FreeBSD using ZFS and use ada0/ada1 as a mirror vdev, carving 500GB for the OS while leaving the rest available for the role of ZFS special device ?

So for example, I imagine what I want to do is partition the drives in two: ada[0|1]p1 (500Gb) and ada[0|1]p2 (7.1TB).

Then I create a ZFS mirror based on ada[0|1]p1 for the OS and a special mirror vdev based on ada[0|1]p2

Is this doable or does the special device want the entire drive ?
Yes. Create a mirror of ada0p1/ada1p1 and allocate ada0p2 and ada1p2 to the ZFS mirror. Let ZFS manage its own mirror and put your UFS filesystems on the geom_mirror partitions.
 
I'm also thinking now about these performance implications. Theoretically the OS partition will very rarely be accessed, mainly for writing logs I guess but it's indeed an unknown.
I'm assuming that there is little to no user login on this, it's basically big honking data server that is accessed remotely, so agree that the OS portion should be relatively quiet.
The 3-way mirror was based more on the conversation around "backing up" and using dd. This thread to me takes that out of the equation.
Right now it seems like your requirements are "remotely accessed 400TB with lots of small files, need as much performance as possible, desire to have redundant boot devices".
Is that a fair summary?

If so I'd look at the system 2 different parts:
Booting/OS
Performance data

Boot devices, mirror is good, but one could also install on a USB and then duplicate that so if the device fails, you plug another one in. Your "I don't want to have any troubles", well can't guarantee that can we. I look at this as "get back in service as quickly as possible".
Assuming this machine has gobs of memory, what about boot from USB but into a memory image (Linux initramfs idea) and basically run the OS from a RAM disk? That takes the USB out of the picture except for the booting, so in theory keep it safe longer?

The Data Pool. I think dedicating physical devices to the "special" and not trying to split them into OS and special is better in the long run. If you do have physical room to make the special into a 3-way mirror (even with an adaptor) I think that would be better. Just keep in mind that (this is a generalization) that writes to a mirror device roughly complete at the slowest, reads at the fastest, so keep them the same or close in specs.

I've never done this type of thing so have not hands on experience, but yes read a lot and have followed interesting discussions about it, so take everything I write as opinion/with a grain of salt.
 
Gotcha, ok. I think then I might just put a small SSD into a 3.5" adapter and take up one of the hdd slots, as boot device.
I can live with that because I didn't plan on using all 36 bays - I wanted to have two hot spares, but I can make do with just one.

So then I'd have : 2 mirrored SSDs fully dedicated to the special vdev, 34 HDDs for the pool, 1 hot spare HDD, 1 SSD in 3.5" adapter for boot device - this one is not mirrored but I use high quality SSDs and coupled with low usage, I don't expect it to fail; and even if it does, I can easily replace the OS.

How does this sound ?
 
I have looked after production systems with comparable capacities on several occasions, and have some observations.

You don't mention system availability requirements. I would never build a production system without a root mirror. The down-time and personal stress you would get from having to replace and rebuild a dead root on such a significant system is simply not worth it. Just a little bit of finger trouble can cause enormous grief. And it's not just about you. Even if you are super confident, your team mates, heirs, and successors won't thank you!

I don't really have a problem with your original idea of partitioning one large SSD for a root mirror, and special device mirror. The O/S will do very little I/O. However I would look very carefully at the spec of the SSDs used. I expect that the special device would warrant "Enterprise Class".

Your special device for metadata requires at least as much redundancy as you have in the 400 TB data pool(s).

With 400 TB of data, the end-to-end data movement capability needs special attention. In particular, the bandwidth of your backup regime should be tested. Make sure that the daily backups complete in less than a day, weekly backups in less than a week, etc. If backups impact production performance (and they will), your backup windows will be even smaller. It's tricky to test significant production load on host, network, and backup servers. Your system may warrant a variety of dedicated hardware components just for backups.

One hot spare in 36 slots is a big worry, especially if all the spindles are purchased in a single batch (because they will fail in clusters). Consider turning on ZFS compression (it's not a significant overhead), and allocating more slots to hot spares. I would think that your original plan of two spares would be a bare minimum.

You have not mentioned the redundancy scheme for your data pool(s). There are big trade-offs in capacity and write speed depending on what you choose (striped mirrors vs. RAIDZn).
 
I have looked after production systems with comparable capacities on several occasions, and have some observations.

You don't mention system availability requirements. I would never build a production system without a root mirror. The down-time and personal stress you would get from having to replace and rebuild a dead root on such a significant system is simply not worth it. Just a little bit of finger trouble can cause enormous grief. And it's not just about you. Even if you are super confident, your team mates, heirs, and successors won't thank you!

I don't really have a problem with your original idea of partitioning one large SSD for a root mirror, and special device mirror. The O/S will do very little I/O. However I would look very carefully at the spec of the SSDs used. I expect that the special device would warrant "Enterprise Class".

Your special device for metadata requires at least as much redundancy as you have in the 400 TB data pool(s).

With 400 TB of data, the end-to-end data movement capability needs special attention. In particular, the bandwidth of your backup regime should be tested. Make sure that the daily backups complete in less than a day, weekly backups in less than a week, etc. If backups impact production performance (and they will), your backup windows will be even smaller. It's tricky to test significant production load on host, network, and backup servers. Your system may warrant a variety of dedicated hardware components just for backups.

One hot spare in 36 slots is a big worry, especially if all the spindles are purchased in a single batch (because they will fail in clusters). Consider turning on ZFS compression (it's not a significant overhead), and allocating more slots to hot spares. I would think that your original plan of two spares would be a bare minimum.

You have not mentioned the redundancy scheme for your data pool(s). There are big trade-offs in capacity and write speed depending on what you choose (striped mirrors vs. RAIDZn).

Wow, thanks for the comprehensive reply!

I am using Intel DC S4610 7.68tb ssd drives. I've had very good experiences with them.

The pool will be mirrored vdevs so basically Raid 10.

So you're suggesting to slice the drives vs dedicated single boot drive. Hard decision!

Have you already run such a setup ?
 
I am using Intel DC S4610 7.68tb ssd drives. I've had very good experiences with them.
Intel DC is good. Expensive, but good!
The pool will be mirrored vdevs so basically Raid 10.
OK. Best performance. Requires an even number of slots.
So you're suggesting to slice the drives vs dedicated single boot drive. Hard decision!
Is it possible to get a motherboard with a pair of NVMe M.2 SSD slots? It gives you two extra "disk slots". You could use these just for boot, but I'd also consider using these for the special device as their performance would potentially be quite superior to a SATA SSD.
Have you already run such a setup ?
I have worked on sites with thousands of Linux systems. Mostly virtual to some extent, but some very large physicals. However I have no experience with ZFS in a large corporate or government setting. I do have a ZFS server at home. Most of the large physical systems I looked after ran Linux with XFS file systems.

Edit: If NVMe on the motherboard is not an option, then I'd consider:
  • 2 x 2.5" Intel DC S4610 provisioning O/S and special device for metadata (both ZFS managed mirrors);
  • 34 x 3.5" hot-swap SAS/SATA drive bay for 17 x ZFS mirror'd VDEVs; and
  • 2 x 3.5" hot-swap SAS/SATA drive bay housing hot spares.
And, if you don't already own the disks, I'd look to purchase them in at least two different tranches. Then record the serial numbers and mirror one tranche against the other.
 
  • Like
Reactions: mer
Back
Top