ZFS Mixing HDD and SSD in a vdev

Back in the times of gvinum it was not possible to mix SCSI and SATA disks into a mirror - it would give strange kernel errors and just not work. With ZFS it became possible to do that - but then, with ZFS I remember recording a note-to-self that here it does not work to mix HDD and SSD into a mirror.

Recently somebody told me that this does work. So I considered my experience as obsolete - but now I tried it.
I have SSD with quite a couple of partitions. Each of them is separately geli encrypted, but they tend to grow in numbers, and now I find that annoying. I would like to have a single geli container for all of them except swap. That seems to work with kern.geom.part.gpt.allow_nesting=1.

One of the concerned partitions is part of a 3-way raidz. So I temporarly replaced that one with an equally sized partition on a mechanical disk - and got a checksum error on the mechanical! During the backward replace I saw two more of them. And finally, when the operation was completed and I did a final scrub, a dozen more appeared on the changed partition. Nothing in the system log, nothing in SMART.

Now I am checking both, the SSD arrray and the mechanical, in all possible ways, and there are no errors.
This must be something ZFS does internally because it doesn't like the mixture.

So I stay with the conclusion that such mixture is not recommendable, not even for temporary actions.
 
I concur. The spinning disk (SSHD I reckon), in my case, was draining excess current that it blew off the power supply for the Supermicro server. The server, before then, was hosting mirrored-zpool of SSDs and there was a need to increase its storage capacity. The mirrored-zpool of SSHDs, as additional storage to existing mirrored-zpool of SSDs, worked fine only for a day or so. The OS was on SSDs and so it would successfully boot up and the server was usable for a day or so. Then the server would hang. I cannot remember seeing out of swap messages on console. Memory was not a problem. I ended the expedition a few days later. The reason was that a warm-reset would take forever; the mirrored-zpool in the SSHDs was too sluggish to come up (luckily for data only). Consequently, the SSDs failed not longer after.

However, I can report success on mixing mirrored-zpool NVMEs (Intel DC P4511 running the host OS), mirrored-zpool SSDs and mirrored-zpool Enterprise HDDs (Seagate Exos) all together in the same server. Of course, I changed the power supply in the mini-itx to a more robust one.
 
I wonder if access time comes into play here.
Roughly "Mirrors complete a write operation at the pace of the slowest device but complete a read operation at the pace of the fastest"
Raidz I think would always be at the mercy of the slowest device.
Maybe something internal doesn't like that all the devices in the vdev are not roughly the same speed.
 
This information is going to be very important later to me; as I'm intending to replace/extend my raidz (of spinning rust platters) to SSD's. I was originally intending to just slowly (over a couple days) replace them one at a time while leaving the system live (outside of the shutdowns to remove and insert the new drive). Considering, I know my current HDD's are on life support as it is; I may just either reconstruct my raidz setup (luckily there isn't much there, so that would be the fastest route) or doing all the replacing at once while booted from a live cd.
 
doing all the replacing at once while booted from a live cd.
If there is enough "room" (power connectors, data connectors etc) maybe duplicate the configuration, replicate data, export old pool, change mountpoints on the new and reboot.
 
I know right now, there isn't enough room to duplicate the pool; unless I get an controller to handle the other drives (which would give more weight to reconstructing the dataset). Right now, everything is in the initial planning stages, so I have time to consider any possibilities.

Right now, my pool is using roughly 50GB (intentionally kept small, as I knew ahead of time the drives needed to be replaced); so it is small enough that coping the data directly or snapshots to an external source is doable. Worse case, I have the option of rebuilding the important data from backups (the actual OS/programs can be easily reinstalled, so isn't considered important).
 
I know right now, there isn't enough room to duplicate the pool; unless I get an controller to handle the other drives (which would give more weight to reconstructing the dataset). Right now, everything is in the initial planning stages, so I have time to consider any possibilities.

Right now, my pool is using roughly 50GB (intentionally kept small, as I knew ahead of time the drives needed to be replaced); so it is small enough that coping the data directly or snapshots to an external source is doable. Worse case, I have the option of rebuilding the important data from backups (the actual OS/programs can be easily reinstalled, so isn't considered important).
I'd be inclined to have an external device with ZFS that you can zfs send|receive the original dataset(s), create the new pool with the new device and then reverse the zfs send| receive to pull from the external device.
 
I wonder if access time comes into play here.
Roughly "Mirrors complete a write operation at the pace of the slowest device but complete a read operation at the pace of the fastest"
Raidz I think would always be at the mercy of the slowest device.
Maybe something internal doesn't like that all the devices in the vdev are not roughly the same speed.
Very much so it looks. For curiosity I repeated the operation with the next piece of the raid and a different (and slower) mechanical drive - and got now three dozen errors.
 
  • Like
Reactions: mer
I wonder if access time comes into play here.
Roughly "Mirrors complete a write operation at the pace of the slowest device but complete a read operation at the pace of the fastest"
Raidz I think would always be at the mercy of the slowest device.
As a zeroest order approximation, what you wrote is somewhat correct.

For mirror writes, you have to wait for BOTH copies to be written; if one is consistently slower, you will always for it. Ergo, the slowest one wins.

For mirrors reads, it depends on how the storage layer schedules reads. Simple example: It sends the reads to a random device, or it does round-robin. In that case, average performance will be halfway between the two. Better example: There is enough workload that both disks are kept continuously busy; the storage layer sends new requests always to the disk that has a shorter queue of pending requests. In that case, performance will be dominated by the faster disk. But if there are no queues (if the workload isn't intense enough to keep both disks busy), how will the storage layer know which disk is faster? That's really hard. It could try to remember it, but that requires keeping track of performance metrics and storing them (at least in memory, better on disk), and then having a pretty complex algorithm to make decisions based on the performance metrics, while making sure to never make bad decisions. So the real answer is: read performance will be somewhere between 50:50 of the drives and the faster one, but it may have a very large spread.

For RAID-Z (or other parity-based RAID schemes, or in general any form of disk encoding that uses more than 2 disks), the above arguments applies in the write case. In the read case, for the most part no performance optimization is possible.
 
  • Like
Reactions: mer
I know right now, there isn't enough room to duplicate the pool; unless I get an controller to handle the other drives (which would give more weight to reconstructing the dataset). Right now, everything is in the initial planning stages, so I have time to consider any possibilities.

Right now, my pool is using roughly 50GB (intentionally kept small, as I knew ahead of time the drives needed to be replaced); so it is small enough that coping the data directly or snapshots to an external source is doable. Worse case, I have the option of rebuilding the important data from backups (the actual OS/programs can be easily reinstalled, so isn't considered important).
I have been through all that. However, the high water of the pool was about 10 TB. Like you, I was thinking out loud about how to copy the pool out and then back in again.

It was Andriy who encouraged me to consider the long term plan for backups because that drove my solution. Up until then I had split the pool into file systems, and had a splintered approach to get (the most valuable) data off-site 4 TB at a time.

I changed my plan and bought two 12 TB disks. I had hot-swap bays available, so they were SATA. USB3 would also have worked, and there are good quality PCIe USB3 controllers, if you need them.

I was able to send the entire tank to a new 12 TB mirror, and send it back again, in safety (without a multitude of external cables, converters, power supplies, concats, and attendant risk).

I now have one complete 12 TB copy of the pool off-site at all times, and another on-site ready to re-create and rotate.

The convenience and security of a good long-term backup plan justified the outlay...
 
I heard often Allan Jude saying he mixes HDDs and SSD using the latter as "cache" vdev.

In this article a SSD is used to caching L2ARC:

Yes, that is a common thing to do. The same is applicable to SSD and NVME, where the NVME serves as a cache/log.
 
I heard often Allan Jude saying he mixes HDDs and SSD using the latter as "cache" vdev.
Yes, this is normal. That's why I wrote "mixing ... in a vdev". The cache is a separate vdev (as are log and special devices) and is intended to be on faster media.
 
Back
Top