Ok, so since the last update, I have tried both a different motherboard (with no change in results) as well as installing Linux on the machine, however I found out that Linux was of limited use. The ZFS versions were different so I could not import the pools. In the end all I did was some random IO testing and CPU calculations. I saw similar errors in dmesg but no drive drop offs the bus like in FreeBSD.
However I have to say it was not a proper like-for-like comparison, as I could not load the ZFS arrays or execute the same jobs as on the FreeBSD machine.
So I moved to the next thing, which was another complete disassembly and reassembly of the machine. I cleaned all the contact pins, fitted new fans throughout, checked everything, re-wired the power feeds to the drives, and reinstalled FreeBSD and rebuilt the "storagefast" array as a raid-z3 for extra redundancy.
With a clean install, now the drives no longer randomly fall out of the zpools. Instead the entire pool hangs, while I get the "error retrying command" and similar errors filling up my message logs. This happens every minute or so, greatly slowing down the array speed. Unlike the errors before which occurred at high load, this happens all the time now regardless of load.
As it was always the same drive that was showing the error, I decided to manually offline this drive to see if the problem went away. The problem did go away but then re-appeared on another drive that had no errors before. I then offlined that drive, only for the errors to move to a third drive that also showed no errors before. At this point I can't offline any more drives, ZFS won't let me as there are not enough drives left for the pool to keep functioning.
I am scratching my head here, because I have never seen what appears to be a HW fault jump around as if it was a software fault. I have tried different SATA channels and cables, but it seems the errors just move around all the time.
The errors are the same as before, the below on eternal repeat:
Code:
mps0: Controller reported scsi ioc terminated tgt 19 SMID 1711 loginfo 31080000
(da1:mps0:0:19:0): READ(10). CDB: 28 00 12 66 fd c8 00 00 38 00
(da1:mps0:0:19:0): CAM status: CCB request completed with an error
(da1:mps0:0:19:0): Retrying command, 3 more tries remain
(da1:mps0:0:19:0): READ(10). CDB: 28 00 12 66 fe 80 00 00 50 00
(da1:mps0:0:19:0): CAM status: CCB request completed with an error
(da1:mps0:0:19:0): Retrying command, 3 more tries remain
(da1:mps0:0:19:0): READ(10). CDB: 28 00 07 59 1a f8 00 00 a0 00
(da1:mps0:0:19:0): CAM status: SCSI Status Error
(da1:mps0:0:19:0): SCSI status: Check Condition
(da1:mps0:0:19:0): SCSI sense: ABORTED COMMAND asc:0,0 (No additional sense information)
(da1:mps0:0:19:0): Retrying command (per sense data)
mps0: Controller reported scsi ioc terminated tgt 19 SMID 1271 loginfo 31080000
(da1:mps0:0:19:0): READ(10). CDB: 28 00 13 da c7 20 00 00 b0 00
(da1:mps0:0:19:0): CAM status: CCB request completed with an error
(da1:mps0:0:19:0): Retrying command, 3 more tries remain
(da1:mps0:0:19:0): READ(10). CDB: 28 00 13 da c7 b0 00 00 08 00
(da1:mps0:0:19:0): CAM status: CCB request completed with an error
(da1:mps0:0:19:0): Retrying command, 3 more tries remain
(da1:mps0:0:19:0): READ(10). CDB: 28 00 13 da c6 20 00 00 f8 00
(da1:mps0:0:19:0): CAM status: SCSI Status Error
(da1:mps0:0:19:0): SCSI status: Check Condition
(da1:mps0:0:19:0): SCSI sense: ABORTED COMMAND asc:0,0 (No additional sense information)
(da1:mps0:0:19:0): Retrying command (per sense data)
(da1:mps0:0:19:0): READ(10). CDB: 28 00 17 5f 65 90 00 00 08 00
(da1:mps0:0:19:0): CAM status: SCSI Status Error
(da1:mps0:0:19:0): SCSI status: Check Condition
(da1:mps0:0:19:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
(da1:mps0:0:19:0): Retrying command (per sense data)