ZFS Can disks in a pool be replaced with smaller drives?

I currently have a system with 12x 5TB 2.5" HDDs. Due to a batch of cheap/faulty SATA cables, one of these drives failed hard (and was replaced), and a couple of the drives are now showing bad sectors. I've replaced the cables and the system is a lot more stable, but since I'll inevitably have to swap out these drives with replacements I'm considering moving to SSDs. However, I can't find any 5 or 6 TB drives on the market, and the 7+TB drives are way out of my budget.

So I'm wondering, if I keep the overall amount of data on the pool below a certain threshold, would it be possible for me to slowly migrate the pool from 12x 5TB disks to 12x 4TB disks? I'm thinking of replacing one per month: pull one, resilver, wait 30 days, repeat.

Thoughts? Suggestions? Warnings?
 
if you are a gambling man
buy 15 4TB ssds
partition 3 of them in 4 1TB partitions
so you have 12 disks and 12 partitions
create 12 gconcat devices out 1 disk and 1 partition each
replace each spinning rust drives with a frankenstein solid state drive
....
....
forget where you heard of this advice
 
  • Like
Reactions: mer
If you chose to configure mirror vdevs, you can add new mirrors of smaller disks and then remove the old mirrors one by one. but keep in might that this might take a lot of time due to the rebalancing each vdev removal triggers; so especially with spinning rust and SATA each removal job might take a full day.

If you opted for raidz, then NO. those vdevs can't be removed or scaled down in any way. That's why you should always go for mirrors unless you have a *very* good reason not to...

OTOH: currently the "sweet spot" of cost per TB for spinning drives seems to be at the ~8-10TB models while 4TB drives are becoming rather scarce nowadays and cost way more per TB. So instead of 12x4TB I'd go for 6x8TB or even 4x12TB, create a new pool and just zfs send|recv all datasets onto the new pool...


EDIT: sorry, I totally overlooked that you want to go for SSDs.
I'm currently also migrating my home server from a collection of 10 3-4TB SAS HDDs of various age, some older SATA SSDs and 2 NVMes to all flash...

That mentioned "cost/TB sweet spot" seems to be sitting at 2TB for higher-endurance consumer SSDs now, but if you are willing to trade some endurance for space, this goes up to 4TB or even 8TB.
If you don't have a lot of I/O to that pool, the Samsung 870QVO 8TB are quite cheap at ~75-80EUR/TB and with 2.88TB TBW they should have 'OK' endurance for a low-load pool in a home server. At 4TB the Transcend SSD230S are a bargain at ~60EUR/TB and at 2.24TBW they have almost twice the endurance of the samsung drives.
Given that I've used several Transcend SSD230s and MTE220s and they were pretty great endurance-wise I'd definitely go for those. They might not be the fastest, but in a pool with multiple vdevs they can still easily saturate a 10Gbit link and offer by far the highest endurance rating in that price segment.

Another option might be to watch out for a good deal on used enterprise SSDs that haven't seen much writes. But it will be hard to find something >2TB for a better per-TB-price than those Transcend SSDs.

While you *could* add/remove single vdevs on a mirror pool to migrate from HDD to SSD, I wouldn't do that as providers/vdevs with vastly different I/O characteristics might (will) lead to very weird and unexpected behaviour of the pool. Given the experiences with dying disks that "only" showed increased latency and still dragged the whole pool to an unusable state, I'd expect similar behaviour if some providers suddenly have 1/100th of the latency and IOPs capabilities higher by orders of magnitude. This won't go well with how ZFS is trying to spread and queue load across vdevs...
So still: Just create a new pool and send|recv
 
forget where you heard of this advice
Too funny. You also forgot the disclaimer I can neither confirm nor deny that you will not lose any data.

If you chose to configure mirror vdevs, you can add new mirrors of smaller disks and then remove the old mirrors one by one. but keep in might that this might take a lot of time due to the rebalancing each vdev removal triggers; so especially with spinning rust and SATA each removal job might take a full day.
Hmm. I know it works replacing with larger devices work fine. Smaller ones I think success may depend on how full they are.
In general I think replacing with smaller devices is not recommended.
 
Hmm. I know it works replacing with larger devices work fine. Smaller ones I think success may depend on how full they are.
Of course you can't shrink a pool to a lower capacity than already used (and always leave ~15-20% headroom for housekeeping). I assumed that was implicit...
But if his pool of 12x5TB (=~30TB usable for mirrors) holds only ~20TB of data, he could migrate (or send|receive for that matter) to a pool of 12x4TB (=~24TB).
 
if you are a gambling man
buy 15 4TB ssds
partition 3 of them in 4 1TB partitions
so you have 12 disks and 12 partitions
create 12 gconcat devices out 1 disk and 1 partition each
replace each spinning rust drives with a frankenstein solid state drive
....
....
forget where you heard of this advice
the problem with this "solution" is that if an extent / partitioned disk fails the whole pool goes down the drain
 
Sorry, should have provided further information:

It is a single pool configured in raidz2. Home use, mostly read traffic once files are initially stored. Total size currently shows ~54T, with ~32T allocated and ~22T free.

Adding more disks isn't really an option, the machine is at capacity physically. I think I have maybe 1 sata port free. Space is at a premium too: the drives are all 2.5", I don't think I'd be able to physically fit fewer 3.5" drives in there.

So… it looks like I'll be replacing with spinny things until the price of >5TB SSDs drops significantly!

Thanks everyone for the advice!
 
Also, thanks for the information everyone. I too am currently also busy redoing my home backup system. And this time I won't do raidz2.
 
Due to a batch of cheap/faulty SATA cables, one of these drives failed hard (and was replaced), and a couple of the drives are now showing bad sectors.
Note that bad cables won't cause bad sectors. This can certainly corrupt the data being written to the disk but it will not damage the actual sectors on the platters itself (which is what bad sectors are). Bad blocks are going to happen, even on brand spanking new disks. But all modern harddisks have a spare bit of space, the drive's firmware will automatically map those bad blocks to this spare bit of space. However, at some point in time that spare bit of space is going to be full. Then bad blocks cannot be remapped anymore, and you end up with a bunch of "Offline uncorrectable" blocks.
 
Damn, if I had a setup like that, I'd want to organize the datasets so that no single dataset is bigger than a physical disk. Since it looks like OP's best option is 4TB SSDs, that should be about the size for the datasets. Then it will be possible to move the datasets around like that sliding tile puzzle game.
1674834193315.png
 
Note that bad cables won't cause bad sectors. This can certainly corrupt the data being written to the disk but it will not damage the actual sectors on the platters itself (which is what bad sectors are).
That needs to be sorted out. Bad cables, power fluctuations, etc. will produce Current_Pending_Sectors, and probably other fluffy things, and will make extended selftests fail. These are bad sectors in the sense that they were badly written, not in the sense that the magnetic surface would be bad. They usually go away when rewriting them. They cannot go away from mere reading, which is what selftest does. When writing them, the Reallocated_Sector_Ct will either go up (real bad sector) or not go up (badly written sector).

Bad blocks are going to happen, even on brand spanking new disks. But all modern harddisks have a spare bit of space, the drive's firmware will automatically map those bad blocks to this spare bit of space. However, at some point in time that spare bit of space is going to be full. Then bad blocks cannot be remapped anymore, and you end up with a bunch of "Offline uncorrectable" blocks.
It is quite a lot bad sectors that fit into spare space, probably a couple of thousands. And if it actually gets full, many disks can do FORMAT UNIT (all Ultrastar can, and all SCSI can anyway).
 
Back
Top