ZFS zpool scrub triggers hardware errors that disappear after reboot

mca2 · May 5, 2020

Generally my ZFS RaidZ2 pool (6 disks) works just fine. But more often than not, I get some errors as shown below when I do a scrub (which is twice a month) and then the disks ends up as unavailable. Smartd is of course running and does short test twice a week and long test twice a month.
/dev/da0 -a -o on -S on -n standby -s (S/../../(1|4/01)|L/../(07|22)/./02) -W 10,35,35 -p -m foo@bar.foo
But Smartd has never logged any warnings nor errors and also a smartctl -a /dev/da0 gives nothing. So I tend to reboot the server and then the disk just works fine again.

But something is wrong, any idea what?

Code:

Apr 25 02:00:24 freebsd kernel: (da0:mps0:0:0:0): READ(10). CDB: 28 00 bc 0c 21 f0 00 00 08 00
Apr 25 02:00:24 freebsd kernel: (da0:mps0:0:0:0): CAM status: SCSI Status Error
Apr 25 02:00:24 freebsd kernel: (da0:mps0:0:0:0): SCSI status: Check Condition
Apr 25 02:00:24 freebsd kernel: (da0:mps0:0:0:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Apr 25 02:00:24 freebsd kernel: (da0:mps0:0:0:0): Retrying command (per sense data)
Apr 25 02:05:52 freebsd kernel:         (da0:mps0:0:0:0): READ(10). CDB: 28 00 24 48 39 10 00 00 08 00 length 4096 SMID 340 Command timeout on target 0(0x0009) 60000 set, 60.189811576 elapsed
Apr 25 02:05:52 freebsd kernel: mps0: Sending abort to target 0 for SMID 340
Apr 25 02:05:52 freebsd kernel:         (da0:mps0:0:0:0): READ(10). CDB: 28 00 24 48 39 10 00 00 08 00 length 4096 SMID 340 Aborting command 0xfffffe000269e8e0
Apr 25 02:05:52 freebsd kernel: mps0: Finished abort recovery for target 0
Apr 25 02:05:52 freebsd kernel: mps0: Unfreezing devq for target ID 0
Apr 25 02:05:52 freebsd kernel: (da0:mps0:0:0:0): READ(10). CDB: 28 00 24 48 39 10 00 00 08 00
Apr 25 02:05:52 freebsd kernel: (da0:mps0:0:0:0): CAM status: Command timeout
Apr 25 02:05:52 freebsd kernel: (da0:mps0:0:0:0): Retrying command, 3 more tries remain

ralphbsz · May 6, 2020

I would start by looking for overheating of the disks, or power supply problems. A scrub will really work the disks hard. The "unit attention" you are getting is indicating that the disk has been reset. That could indicate a power supply problem. It could also indicate that the disk stopped working, and the LSI/Broadcom/Avago card had to force-reset it after a 60 second timeout.

You know how to monitor disk temperatures? smartctl can do that. Power supply is harder; a voltmeter is the zeroest step, a storage scope is better (*), but to really do it you need a power quality analyzer, which is something few people have. If you can, I would try just switching out the disk drive power supply, or splitting the disks onto two power supplies. While you are at it: make sure your SCSI HBA is also getting well cooled. The LSI cards can use an enormous amount of power, and I once had one of then burn itself (we had low-level monitoring software running on it, under full workload the card reported a die temperature of 106 degrees C, then never talked to us again, and when we opened the smelly computer, we found that the PC board had turned brown). So adding a little cooling fan right on the card might help debug the problem.

(*) Footnote: Famous quote from "The Soul of a New Machine": An oscilloscope is what cavemen used to debug fire. The reason for the quote was that the engineers building that computer wanted management to buy them a logic analyzer.

mca2 · May 15, 2020

smartd is monitoring my HD temps: -W 10,35,35 and also smartctl -a doesn't show any errors. I found this bug(?) PR 224496 and tried the suggested "fix":

 sysctl kern.cam.da.default_timeout=90

sysctl kern.cam.ada.default_timeout=60

And did another zpool scrub which didn't give any issues. So let's see how far this brings me.

Jose · May 15, 2020

Interesting. Do you have SMR disks?

D-FENS · May 15, 2020

Welcome to the club!
I am tracking this problem for more than a year already. Here is the main thread: https://forums.freebsd.org/threads/kernel-ahcich-timeout-in-slot.51868/page-3#post-423317
In a nutshell, my current understanding is that the SATA driver in the kernel has a bug and certain SATA controllers in some circumstances lead to this error. In my case when I scrub, after a couple of minutes the whole ZFS pool is detached because some of the VDEVs are destroyed and then recreated by the kernel. The bug has been reported and appears as early as 11.2-RELEASE, maybe even earlier.
I see that your controller is SCSI, which leads me to think that SCSI has the same or a similar problem.

mca2 said:
Generally my ZFS RaidZ2 pool (6 disks) works just fine. But more often than not, I get some errors as shown below when I do a scrub (which is twice a month) and then the disks ends up as unavailable. Smartd is of course running and does short test twice a week and long test twice a month.
/dev/da0 -a -o on -S on -n standby -s (S/../../(1|4/01)|L/../(07|22)/./02) -W 10,35,35 -p -m foo@bar.foo
But Smartd has never logged any warnings nor errors and also a smartctl -a /dev/da0 gives nothing. So I tend to reboot the server and then the disk just works fine again.

But something is wrong, any idea what?

Code:

Apr 25 02:00:24 freebsd kernel: (da0:mps0:0:0:0): READ(10). CDB: 28 00 bc 0c 21 f0 00 00 08 00 Apr 25 02:00:24 freebsd kernel: (da0:mps0:0:0:0): CAM status: SCSI Status Error Apr 25 02:00:24 freebsd kernel: (da0:mps0:0:0:0): SCSI status: Check Condition Apr 25 02:00:24 freebsd kernel: (da0:mps0:0:0:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred) Apr 25 02:00:24 freebsd kernel: (da0:mps0:0:0:0): Retrying command (per sense data) Apr 25 02:05:52 freebsd kernel: (da0:mps0:0:0:0): READ(10). CDB: 28 00 24 48 39 10 00 00 08 00 length 4096 SMID 340 Command timeout on target 0(0x0009) 60000 set, 60.189811576 elapsed Apr 25 02:05:52 freebsd kernel: mps0: Sending abort to target 0 for SMID 340 Apr 25 02:05:52 freebsd kernel: (da0:mps0:0:0:0): READ(10). CDB: 28 00 24 48 39 10 00 00 08 00 length 4096 SMID 340 Aborting command 0xfffffe000269e8e0 Apr 25 02:05:52 freebsd kernel: mps0: Finished abort recovery for target 0 Apr 25 02:05:52 freebsd kernel: mps0: Unfreezing devq for target ID 0 Apr 25 02:05:52 freebsd kernel: (da0:mps0:0:0:0): READ(10). CDB: 28 00 24 48 39 10 00 00 08 00 Apr 25 02:05:52 freebsd kernel: (da0:mps0:0:0:0): CAM status: Command timeout Apr 25 02:05:52 freebsd kernel: (da0:mps0:0:0:0): Retrying command, 3 more tries remain

mca2 · May 18, 2020

I can't remember having these issues in 11.x but ok, that could be wishful thinking. Btw I'm using a flashed IBM ServeRAID M1015 with standard SATA drives. Currently a mix of 2TB WD Green being phased out for 4TB Toshiba N300. No clue if these are SMR but a quick google didn't say so.

Jose · May 18, 2020

mca2 said:
I can't remember having these issues in 11.x but ok, that could be wishful thinking. Btw I'm using a flashed IBM ServeRAID M1015 with standard SATA drives. Currently a mix of 2TB WD Green being phased out for 4TB Toshiba N300. No clue if these are SMR but a quick google didn't say so.

I only asked because the workaround you're using specifically cited SMR drives as the problem. Looks like there's a bug with the controller driver, then. This worries me 'cause I'm using a SAS9207-8i for my ZFS array. All my drives are CMR, though. I managed to buy some WD reds before the switch, and I have a WD Gold and a Seagate Ironwolf in there too. I haven't had any problems, but the array is not under heavy use yet.

ralphbsz · May 20, 2020

mca2 said:
...
I found this bug(?) PR 224496 and tried the suggested "fix":
sysctl kern.cam.da.default_timeout=90 ...

Interesting. If this is true, then what you're saying is that with the old default (30 second) timeout, the IO stack gets timeout on disk operations, but with a longer timeout (60 or 90), it does not. I find that believable. As surprising as it might sound to people, but under intense workloads, individual CMR disk drives can have IOs that validly take up to 2 seconds. This is not an error condition, nor does it involve the drive having to correct for read errors, just an effect of intense workload: If you queue dozens of IOs in a drive, even though each IO takes on average 10-20ms, occasionally one will have to wait for a hundred or so others to complete. That was measured several years ago, on 3 and 4 TB 7200 RPM nearline drives. With modern 10+ TB drives, and 5400 RPM, it could easily take longer. And with SMR, it could take significantly longer. Now, I'm not saying that *A LOT* of IOs take that long; only that such long IOs do happen (if you have a system with tens of thousands of drives, each continuously busy at roughly 100 IOps, then this will happen regularly). To quote an old colleague: "In a sufficiently large system, the unlikely happens all the time, and the impossible happens occasionally".

So it is possible that the combination of larger (slower) drives, SMR drives being on average somewhat slower and occasionally a lot slower, ZFS being very good at throwing a lot of IOs at the disks at once, in particular when scrubbing, ZFS's IO pattern in general being more biased towards seeks (which drives are slow at), the CAM layer having become more efficient in FreeBSD 12.x (meaning more IOs get to the drive) means that the old 30s timeout is not longer sufficient.

And on the LSI/Broadcom/Avago adapters, the effect of a command timeout is pretty heavyweight: the adapter and the whole SCSI stack get reset, all IOs in flight get cancelled, the application layer above (the ZFS file system) has to deal with all the cancellations, all IOs have to be restarted, oodles of scary messages show up in logs. And with the error handling often being the weakest link in the software stack, this tends to trigger bugs, so is is best avoided.

mca2 · Jul 19, 2020

It's been a few months since I applied the "fix"

 sysctl kern.cam.da.default_timeout=90

sysctl kern.cam.ada.default_timeout=60

and I have no more errors in my zpool scrubs. So something seems to be wrong with maybe the drivers but I gladly accept this work-around