ZFS Seagate Ironwolf doesn't like ZFS ...

I had a really strange effect when moving/enlarging some pool. So I repeated the process to be sure:

Copying from da4 to ada0 takes 1 hour 19 minutes:

Code:
# zpool status bm
  pool: bm
 state: ONLINE
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Mon Jul 14 18:32:00 2025
        916G scanned at 0B/s, 123G issued at 197M/s, 916G total
        27.9G resilvered, 13.46% done, 01:08:32 to go
config:

        NAME                STATE     READ WRITE CKSUM
        bm                  ONLINE       0     0     0
          raidz1-0          ONLINE       0     0     0
            ada2.elip3      ONLINE       0     0     0
            replacing-1     ONLINE       0     0     0
              da4.elip3     ONLINE       0     0     0
              ada0p9.elip3  ONLINE       0     0     0  (resilvering)
            ada7.elip3      ONLINE       0     0     0
            ada8.elip3      ONLINE       0     0     0

  pool: bm
 state: ONLINE
  scan: resilvered 220G in 01:19:26 with 0 errors on Mon Jul 14 19:51:26 2025
config:

        NAME              STATE     READ WRITE CKSUM
        bm                ONLINE       0     0     0
          raidz1-0        ONLINE       0     0     0
            ada2.elip3    ONLINE       0     0     0
            ada0p9.elip3  ONLINE       0     0     0
            ada7.elip3    ONLINE       0     0     0
            ada8.elip3    ONLINE       0     0     0

Copying back from ada0 to da4 takes more than 5 hours:

Code:
# zpool status bm
  pool: bm
 state: ONLINE
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Mon Jul 14 20:23:26 2025
        916G scanned at 0B/s, 34.3G issued at 43.0M/s, 916G total
        6.14G resilvered, 3.74% done, 05:49:58 to go
config:

        NAME                STATE     READ WRITE CKSUM
        bm                  ONLINE       0     0     0
          raidz1-0          ONLINE       0     0     0
            ada2.elip3      ONLINE       0     0     0
            replacing-1     ONLINE       0     0     0
              ada0p9.elip3  ONLINE       0     0     0
              da4.elip3     ONLINE       0     0     0  (resilvering)
            ada7.elip3      ONLINE       0     0     0
            ada8.elip3      ONLINE       0     0     0

  pool: bm
 state: ONLINE
  scan: resilvered 220G in 05:13:53 with 0 errors on Tue Jul 15 01:37:19 2025
config:

        NAME            STATE     READ WRITE CKSUM
        bm              ONLINE       0     0     0
          raidz1-0      ONLINE       0     0     0
            ada2.elip3  ONLINE       0     0     0
            da4.elip3   ONLINE       0     0     0
            ada7.elip3  ONLINE       0     0     0
            ada8.elip3  ONLINE       0     0     0

The problem is that da4 does rarely want to write more than 5 MB/sec:

Code:
 L(q)  ops/s    r/s   kBps   ms/r    w/s   kBps   ms/w   %busy Name
    2     98      0      0  0.000     98   3901   20.7  100.5| da4

ada0 is a 15 year old WDC WD5000AAKS-00A7B2 Caviar Blue
da4 is a brand new Seagate Ironwolf ST4000VN006-3CW104

The problem is specific to this pool. Other pools might copy to da4 quite normal. Also regular write access or sequential write with dd works as expected.

There is no extraordinary data in smartctl. And zpool iostat reports basically what is already visible in gperf above: lots of requests take about 20ms to fulfil, without apparent reason.

But then, there is this message, and it describes exactly the same behaviour, observed with multiple devices, and specific to this make and model:

So the question is: what is wrong with these drives? in what way could one possibly hose a disk design to obtain such behaviour?

It is probably not usual SMR, because the drive does sequential writes of 500GB nicely with 132 MB/sec. Also, there are dozens of interviews with Seagate engineers where they say again and again, Seagate will never sell Ironwolf as SMR. Apparently they have found something different that is not exactly SMR, but still allows to make cheaper drives...
 
The problem is specific to this pool. Other pools might copy to da4 quite normal. Also regular write access or sequential write with dd works as expected.
Are the other pools also using encrypted vdevs?
Other pools using the same configuration/geometry (raidz1)?
RAIDZ1, that's striping, yes? Basically the disks are concatanated so size of zpool is roughly sum of all vdevs?
Is it possible that the process of resilvering winds up waiting on acks from a device in the pool so that is affecting it?
Edit:
Looking at your data again, I see you're saying that da4 is acting like SMR when it is the target (basically a write load) but is reasonable when it's the source (read load).
 
Are the other pools also using encrypted vdevs?
Other pools using the same configuration/geometry (raidz1)?
RAIDZ1, that's striping, yes? Basically the disks are concatanated so size of zpool is roughly sum of all vdevs?
RAIDZ-1: sum of all vdevs minus 1
Everything is GELI-encrypted here, and it is a heterogenous mixture of (lots of) pools with different configs.

Is it possible that the process of resilvering winds up waiting on acks from a device in the pool so that is affecting it?
No. One would see that in gstat. There is only the 20ms delay on the Ironwolf visible, and that is likewise visible in zpool iostat.
So what I can say is: zfs sends out a write request to the disk. That request is 32k or 64k. And the disk takes (more often than not) 20ms of time for doing whatever before it returns. It does quite likely NOT do continuous seeks, as these would usually increase the temperature.

Looking at your data again, I see you're saying that da4 is acting like SMR when it is the target (basically a write load) but is reasonable when it's the source (read load).
Well, I don't say that (yet). The guys at servethehome did say it.

The disk behaves okay on read, and it also behaves okay on large sequential write (like copying a partition with dd)! It seems also to behave okay on random write. The problem appears so far only on the specific access pattern of resilver. Then, our resilver does first scan and then issue, and the issue phase should behave almost like sequential write. Something special must be with this almost, which triggers a misbehaviour of the disk.

I was not successful when trying to trigger that same misbehaviour with crafted write patterns. One would need to look exactly what write commands are issued by resilver, and in what way these are special.

Interim there are two things of interest:
  1. somebody deciding to run these disks in a NAS (for which they are marketed) might face a problem when a resilver is needed and then -due to yet unknown conditions- might take a day or longer, and
  2. I am curious what exactly is happening inside these disks, i.e. what part of the design is causing this.
 
  • Like
Reactions: mer
It's 5400rpm disk with low IOPs.
That part is fine with me, I tend to have thermal issues, and that piece is running nicely cool.
This does not explain why there is 130 MB/sec on pure sequential write, but 5 MB/sec on resilver.

Try with IronWolf PRO or EXOs or if you prefer WD then GOLD series is not bad.
8.7 Watt instead 3.95. :(
Also, that pool holds the backup of database redologs for eventual point.in.time recovery - a high-performance disk would be waste of money.
 
It's expected. 98 Random Writes per second with 32K stripe expected throughput will be 3.06MB/s

The IOPs for 5400rpm disk at 4k are ~57iops so it's not bad for your disk to have almost double than that.
 
It's expected. 98 Random Writes per second with 32K stripe expected throughput will be 3.06MB/s

The IOPs for 5400rpm disk at 4k are ~57iops so it's not bad for your disk to have almost double than that.

Okay, then please explain why this applies only to new drives, while those built 15 years ago achive by more than a maginitude faster performance.
 
Because all old disks were 512n at 7200 with smaller size and many plates and heads. The smaller disk plate gives better response time.
 
Because all old disks were 512n at 7200 with smaller size and many plates and heads. The smaller disk plate gives better response time.
No they weren't. Most ultrastar and deskstar were also available as "coolspin" variants with 5700 rpm, like this one (which is doing well at 119270 hours of operation, with no such resilver problems):

=== START OF INFORMATION SECTION ===
Device Model: Hitachi HDS5C1010CLA382
Serial Number: JC0911HX0Z0BDH
LU WWN Device Id: 5 000cca 363cda483
Firmware Version: JC4OA3MA
User Capacity: 1,000,204,886,016 bytes [1.00 TB]
Sector Size: 512 bytes logical/physical
Rotation Rate: 5700 rpm

Anyway, if the change from 512n to 4096 results in a performance degradation from ~100 MB/sec down to ~3 MB/sec, why would anybody want that?
 
There's no such HDD with 100MB/s at 4k write size. 4k with QD1 are under 1MB/s

For example the enterprise disk have ~12MB/s for (IOPS 32KQD20)
And ST4000VN006 has 8MB/s for IOPS 32KQD20

just grab some different disks categories and put them for a quick test with crystal disk mark. You can compare a standard 2.5'' SATA disk for a laptop with 2.5'' SAS disk and 3.5'' disks like those for standard desktop, NAS, Surveillance, Enterprise and DC. So you will get a general idea of the difference of the type of the technology SMR/CMR, number of plates, RPM, sector size ...etc.
 
I suspect that it is RAID-Z related behavior, because
my HDD works pretty well - tested single ZFS drive on very old machine (AMD X2 from around 2006, PCIe to AHCI add-on card), FreeBSD 14.3:

Drive details from dmesg:
Bash:
$ dmesg | grep -e ada0 -e ahci0:
ahci0: <ASMedia ASM116x AHCI SATA controller> mem 0xfcefe000-0xfcefffff,0xfcefc000-0xfcefdfff irq 18 at device 0.0 on pci3
ahci0: AHCI v1.31 with 24 6Gbps ports, Port Multiplier not supported
ada0 at ahcich1 bus 0 scbus9 target 0 lun 0
ada0: <ST4000VN006-3CW104 SC60> ACS-3 ATA SATA 3.x device
ada0: Serial Number [REDACTED]
ada0: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 8192bytes)
ada0: Command Queueing enabled
ada0: 3815447MB (7814037168 512 byte sectors)

ZFS configuration:
Bash:
$ zpool status

  pool: ziw2
 state: ONLINE
config:

        NAME        STATE     READ WRITE CKSUM
        ziw2        ONLINE       0     0     0
          ada0p4    ONLINE       0     0     0

errors: No known data errors

gstat output while unpacking 16GB .ova (.tar) on same drive gives nice ratio - even 80MB/s read + 80MB/w at same time:

Bash:
$ gstat -B -a -I 1s

T: 1.004s  w: 1.000s                                                                                                   
 L(q)  ops/s    r/s   kBps   ms/r    w/s   kBps   ms/w   %busy Name                                                     
    6    105     44  43903   68.2     61  62182   47.2   99.5  ada0                                                     
    6    105     44  43903   68.2     61  62182   47.2   99.5  ada0p4                                                   
dT: 1.004s  w: 1.000s                                                                                                   
 L(q)  ops/s    r/s   kBps   ms/r    w/s   kBps   ms/w   %busy Name 
    5    140     64  65178   50.1     77  69532   31.1  100.5  ada0                                                     
    5    140     64  65178   50.1     77  69532   31.1  100.5  ada0p4                                                   
dT: 1.017s  w: 1.000s                                                                                                   
 L(q)  ops/s    r/s   kBps   ms/r    w/s   kBps   ms/w   %busy Name 
    4     90     39  39298   25.5     51   9955   12.5   33.8  ada0 
    4     90     39  39298   25.5     51   9955   12.5   33.8  ada0p4                                                   
dT: 1.003s  w: 1.000s                                                                                                   
 L(q)  ops/s    r/s   kBps   ms/r    w/s   kBps   ms/w   %busy Name                 
    7    183     90  91843   55.9     92  85736   38.2  167.2  ada0                                                     
    7    183     90  91843   55.9     92  85736   38.2  167.2  ada0p4

Sometimes I even use that drive on USB 3.0 to SATA adapter to backup both FreeBSD and Linux machines to ZFS on that drive.
 
just grab some different disks categories and put them for a quick test with crystal disk mark.
I don't have Windows. I dumped that in 1990.

You can compare a standard 2.5'' SATA disk for a laptop with 2.5'' SAS disk and 3.5'' disks like those for standard desktop, NAS, Surveillance, Enterprise and DC. So you will get a general idea of the difference of the type of the technology SMR/CMR, number of plates, RPM, sector size ...etc.
Not interested. I have all my disks since 1998 still running in the system, 21 in total, and I have a very good idea about the various designs and their behaviour.
 
I suspect that it is RAID-Z related behavior
Probably yes, or maybe also blocksize related.

I would have filed this under "random unintellegible weirdness", but since these servethehome guys describe the exact same behaviour with the same model, it became more relevant - because then it should be somehow reproducible.
I currently do not have the time to analyze and log the exact write calls on this operation (resilver is a kernel operation, it cannot be logged simply with truss).

gstat output while unpacking 16GB .ova (.tar) on same drive gives nice ratio - even 80MB/s read + 80MB/w at same time:
I agree! There is no strange behaviour except during resilver, and during resilver only for some pools.

The difficulty is: when you need to resilver, you might already be in some trouble - and you do not know beforehand whether the performance impact happens on your pool.

On occasion I might investigate this further. Anyway thanks for Your feedback!
 
Can you post the output of:
camcontrol identify ada0
and for da4
Yes. But I won't, because it doesn't further clarification. Tech specs of the concerned drives are available online, people have been warned, and in case I am really bored I will figure out what exactly is going on.
 
camcontrol will tell if the Write cache is enabled on the disk or not. Tech specs by default specify that 02H is default but some have 82H as default.
 
Do all the pools have the proper 'ashift' for your 4K drives?
If the 'ashift' is too small, this would result in bad performance, especially on a resilver.
 
So the question is: what is wrong with these drives? in what way could one possibly hose a disk design to obtain such behaviour?
You probably just have a misunderstanding of how ZFS resilvering works and, thus, wrong expectations about its performance.
Resilvering is not a disk-to-disk copy. At least, not for RAIDZ and not without device_rebuild feature.
 
You probably just have a misunderstanding of how ZFS resilvering works and, thus, wrong expectations about its performance.
Resilvering is not a disk-to-disk copy. At least, not for RAIDZ and not without device_rebuild feature.
Well, if I have dozens of disks, and all of them fulfil my expectations except one, then I think the problem is rather with that one disk than with my expectations.

Also, it doesn't matter what resilvering is, because the two-step approach of scanning and issueing should result in some almost sequential access. This subtle almost then being the triggerpoint for a difference between 3 MB/sec and 100 MB/sec is quite enough evidence. (Once upon a time there was a sysctl switch to revert back to the non-sequential one-pass resilver, and I had the thought to try that, but then wasn't willing to wait another seven hours)

Then also, device_rebuild is something entirely different and not concerned here.
In short: you're not helpful.
The thing that would be helpful is some dtrace testpoints to obtain a list of issued writes (size and offset) during a resilver.
 
Back
Top