ZFS Random drive detachments from host

ralphbsz · Jul 25, 2023

Your blanket indictment of WD is unjustified and over the top. Certainly they have had their share of customer relation horror stories, but so have other vendors. And on average, the quality of their enterprise drives is no worse than their competitors Seagate and Toshiba, probably actually better. For a publicly visible example, look at Backblaze's failure statistics: Both WD and HGST (a division of WD) branded drives beat Seagate significantly in reliability, with WD-branded drives the undisputed leader in that metric.

PMc · Jul 25, 2023

I had lots of these issues, where a mechanical disk just suddenly disconnects from the SATA port, without any previous messages and without anything recorded in SMART.
It mainly concerns Seagate and WD 7k2 disks.
My impression is that these issues are related to power supply. With normal consumer PS, having wires with two subsequent power connectors each, I was not able to run a Seagate disk on the second connector: it would usually go offline at the first serious access (ZFS tasting the partitions).
After removing the power wiring and inserting hand-crafted wiring from thick measurement-wire, things got quite a bit better. There are still occasioanl problems, and I assume these are related to the SATA power connector, because there were never any problems with those drives stuck onto a backplane, as long as the backplane power is properly supplied.

Sure, running standard consumer equipment with 18 disks (13 mechanical) is probably not what one is supposed to do or what would work out of the box. But one major problem I could identify is the wiring: I once measured 0.5V loss on the 12V wire between the first and second connector. This is not related to the actual wattage of the PS; I tried various ones, and a fully populated backplane would just not start, staggered or not, until I completely replaced the wiring.

Conclusion: enterprise PS have their justification. But then, there is not so much magic in there, so you can pay a few hundred bucks, or you can DIY.

cmoerz · Jul 26, 2023

ralphbsz said:
look at Backblaze's failure statistics

https://www.backblaze.com/blog/backblaze-drive-stats-for-2022/ for anyone curious as well - I admit, WD's stats are looking really good, at least for some models.

To add to the disconnect issue: you mentioned you replaced the wiring; does that include any splitters that might be in there?

If you run with just those drives on the main controller, do the disconnects occur as well? Are there any fans or other equipment you could unplug, which might cause voltage spikes or fluctuations? Might not be worth the hassle, I suppose, if you're getting a new, larger controller anyways.

I also once had issues with L-shaped connectors on SATA cables, but that probably was a one-off thing and I assume does not apply to you since you switched those cables already.

Unixnut · Jul 29, 2023

I don't give much attention to "$drive manufacturer sucks" type stuff simply because throughout my life and discussions with others, every single HD manufacturer apparently sucks to someone. I've met people who have collectively sworn off and recommended every single HD manufacturer out there. IMO it seems there are not "bad HD manufacturers" as much as there are "bad batches" of hard drives from time to time.

Likewise recommending enterprise drives for a home server NAS that sees low duty cycle access of videos/movies/photos and documents seems seriously overkill to me.

As for my situation, well I've installed the new HBA, and re-added one drive, which (after 2 days resilivering) is back in the pool and everything seems to be working ok with three drives. I'm going to add the fourth and final one now, so fingers crossed it works.

Unixnut · Jul 29, 2023

Alas it was not to be, when I added the fourth drive, during resilver two drives detached and re-attached:

Code:

mps0: Controller reported scsi ioc terminated tgt 29 SMID 1372 loginfo 31110d00
(da11:mps0:0:29:0): WRITE(10). CDB: 2a 00 02 ae bd c8 00 01 00 00
(da11:mps0:0:29:0): CAM status: CCB request completed with an error
(da11:mps0:0:29:0): Retrying command, 3 more tries remain
(da11:mps0:0:29:0): WRITE(10). CDB: 2a 00 02 ae be d0 00 01 00 00
(da11:mps0:0:29:0): CAM status: CCB request completed with an error
(da11:mps0:0:29:0): Retrying command, 3 more tries remain
mps0: mpssas_prepare_remove: Sending reset for target ID 29
da11 at mps0 bus 0 scbus6 target 29 lun 0
da11: <ATA TOSHIBA HDWG11A 0603>  s/n 21N0A001FATG detached
(da11:mps0:0:29:0): WRITE(10). CDB: 2a 00 02 ae be d0 00 01 00 00
(da11:mps0:0:29:0): CAM status: CCB request aborted by the host
(da11:mps0:0:29:0): Error 5, Periph was invalidated
mps0: No pending commands: starting remove_device
(da11:mps0:0:29:0): WRITE(10). CDB: 2a 00 02 ae bd c8 00 01 00 00
(da11:mps0:0:29:0): CAM status: CCB request aborted by the host
(da11:mps0:0:29:0): Error 5, Periph was invalidated
mps0: Controller reported scsi ioc terminated tgt 24 SMID 1938 loginfo 31110d00
mps0: Controller reported scsi ioc terminated tgt 24 SMID 1163 loginfo 31110d00
(da8:mps0:0:24:0): READ(16). CDB: 88 00 00 00 00 03 7c 0f ec 38 00 00 00 80 00 00
mps0: Controller reported scsi ioc terminated tgt 24 SMID 1821 loginfo 31110d00
(da8:mps0:0:24:0): CAM status: CCB request completed with an error
(da8:mps0:0:24:0): Retrying command, 3 more tries remain
mps0: Controller reported scsi ioc terminated tgt 24 SMID 878 loginfo 31110d00
(da8:mps0:0:24:0): READ(16). CDB: 88 00 00 00 00 03 7c 0f eb b0 00 00 00 80 00 00
mps0: Controller reported scsi ioc terminated tgt 24 SMID 1512 loginfo 31110d00
(da8:mps0:0:24:0): CAM status: CCB request completed with an error
(da8:mps0:0:24:0): Retrying command, 3 more tries remain
(da8:mps0:0:24:0): READ(16). CDB: 88 00 00 00 00 03 7c 0f ea a8 00 00 00 80 00 00
da8: <ATA TOSHIBA HDWG11A 0603>  s/n X0R0A00BFATG detached
(da8:mps0:0:24:0): READ(16). CDB: 88 00 00 00 00 03 7c 0f ec 38 00 00 00 80 00 00
(da8:mps0:0:24:0): CAM status: CCB request aborted by the host
(da8:mps0:0:24:0): Error 5, Periph was invalidated
(da8:mps0:0:24:0): READ(16). CDB: 88 00 00 00 00 03 7c 0f eb b0 00 00 00 80 00 00
mps0: No pending commands: starting remove_device
(da8:mps0:0:24:0): CAM status: CCB request aborted by the host
(da8:mps0:0:24:0): Error 5, Periph was invalidated
(da8:mps0:0:24:0): READ(16). CDB: 88 00 00 00 00 03 7c 0f ea a8 00 00 00 80 00 00
(da8:mps0:0:24:0): CAM status: CCB request aborted by the host
(da8:mps0:0:24:0): Error 5, Periph was invalidated
(da8:mps0:0:24:0): READ(10). CDB: 28 00 04 09 f9 80 00 07 b8 00
(da8:mps0:0:24:0): CAM status: CCB request aborted by the host
(da8:mps0:0:24:0): Error 5, Periph was invalidated
(da8:mps0:0:24:0): READ(10). CDB: 28 00 04 09 f1 c0 00 07 b8 00
(da8:mps0:0:24:0): CAM status: CCB request aborted by the host
(da8:mps0:0:24:0): Error 5, Periph was invalidated
(da8:mps0:0:24:0): READ(10). CDB: 28 00 04 09 ea 08 00 07 b8 00
(da8:mps0:0:24:0): CAM status: CCB request aborted by the host
(da8:mps0:0:24:0): Error 5, Periph was invalidated
(da8:mps0:0:24:0): Periph destroyed
(da11:mps0:0:29:0): Periph destroyed
da8 at mps0 bus 0 scbus6 target 29 lun 0
da8: <ATA TOSHIBA HDWG11A 0603> Fixed Direct Access SPC-4 SCSI device
da8: Serial Number 21N0A001FATG
da8: 600.000MB/s transfers
da8: Command Queueing enabled
da8: 9537536MB (19532873728 512 byte sectors)
da11 at mps0 bus 0 scbus6 target 24 lun 0
da11: <ATA TOSHIBA HDWG11A 0603> Fixed Direct Access SPC-4 SCSI device
da11: Serial Number X0R0A00BFATG
da11: 600.000MB/s transfers
da11: Command Queueing enabled
da11: 9537536MB (19532873728 512 byte sectors)

The drives I added were da9 and da11, but da8 and da11 detached here, so it is not specific to slot or drive.

So the Saga continues

At least with the HBA I actually get errors reported now.

At this point I will probably revisit the PSU/Power as a potential cause, despite the fact I saw no voltage drops when measured, and the entire system does not use more than 300W of the 750W the PSU can provide.
I don't have an "Enterprise" spec PSU, and I probably won't be able to get hold of one easily, so for the moment I will try to hook the drives up to a second ATX PSU. That way they will have all the power they need and then will see if the issues go away.

diizzy · Jul 29, 2023

So you're seeing this issue on both the built in(?) controller and the LSI HBA? That would point me to either motherboard/memory/cpu and/or PSU issues.
What does smartctl -a /dev/da8 say about the drive (smartmontools)? I would also keep an eye on temperatures overall, LSI HBA runs hot and need active cooling that wouldn't explain the issues you're seeing on the built in(?) controller though but different drives so might be something else going on.

Unixnut · Jul 30, 2023

diizzy said:
So you're seeing this issue on both the built in(?) controller and the LSI HBA?

Correct

diizzy said:
That would point me to either motherboard/memory/cpu and/or PSU issues.

I don't think so. For one the motherboard/memory/cpu has been working fine throughout, even with all CPUs pegged at 100% for days on end when doing some computational work. Secondly, there are 10 other drives attached to this machine (2 on built in controller, 12 on the HBA), and all of them work flawlessly.

The problem is only with these 4 drives, which is why I thought it had something to do with the drives, cables or the slots themelves. Power issues were next in line, thinking that as these are 3.5" disk drives their higher power draw may result in issues, but then I would expect all 12 drives to randomly drop off if there was a reduction in power quality.

At this point I will next try to run the 4 drives on a separate ATX, to remove the power from the equation (or find out that is the root cause), after which I am not sure what to try next.

One thing I did notice is that in the last couple of months (while waiting for delivery of the new HBA), I didn't bother re-adding the drives, and left the array to be. It has run with 2 drives not dropping off this entire time. I added the third and it was fine, but when I added the fourth I started getting detachments.

If the separate PSU does not solve the issue, and I can't think of anything else, I may well destroy the array and rebuild it with three drives, either raidz1 so I keep the space at the loss of redundancy, or raidz2 and 10TB less available.

diizzy said:
What does smartctl -a /dev/da8 say about the drive (smartmontools)?

I didn't see anything untoward on da8, but on da11 I did see this:

Code:

Error 63 occurred at disk power-on lifetime: 600 hours (25 days + 0 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 41 e0 00 00 00 40  Error: ICRC, ABRT at LBA = 0x00000000 = 0

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 01 e0 00 00 00 40 00      00:01:08.239  READ FPDMA QUEUED
  ef 02 00 00 00 00 40 00      00:01:08.239  SET FEATURES [Enable write cache]
  ef aa 00 00 00 00 40 00      00:01:08.239  SET FEATURES [Enable read look-ahead]
  c6 00 10 00 00 00 40 00      00:01:08.239  SET MULTIPLE MODE
  ef 10 02 00 00 00 40 00      00:01:08.239  SET FEATURES [Enable SATA feature]

Error 62 occurred at disk power-on lifetime: 600 hours (25 days + 0 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 41 60 00 00 00 40  Error: ICRC, ABRT at LBA = 0x00000000 = 0

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 01 60 00 00 00 40 00      00:01:07.909  READ FPDMA QUEUED
  60 01 58 40 00 00 40 00      00:01:07.897  READ FPDMA QUEUED
  ef 02 00 00 00 00 40 00      00:01:07.897  SET FEATURES [Enable write cache]
  ef aa 00 00 00 00 40 00      00:01:07.897  SET FEATURES [Enable read look-ahead]
  c6 00 10 00 00 00 40 00      00:01:07.897  SET MULTIPLE MODE

Error 61 occurred at disk power-on lifetime: 600 hours (25 days + 0 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 41 20 40 00 00 40  Error: ICRC, ABRT at LBA = 0x00000040 = 64

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 01 20 40 00 00 40 00      00:01:07.564  READ FPDMA QUEUED
  ef 02 00 00 00 00 40 00      00:01:07.564  SET FEATURES [Enable write cache]
  ef aa 00 00 00 00 40 00      00:01:07.564  SET FEATURES [Enable read look-ahead]
  c6 00 10 00 00 00 40 00      00:01:07.564  SET MULTIPLE MODE
  ef 10 02 00 00 00 40 00      00:01:07.564  SET FEATURES [Enable SATA feature]

Error 60 occurred at disk power-on lifetime: 600 hours (25 days + 0 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 00 0f 02 00 40  Error: ICRC, ABRT at LBA = 0x0000020f = 527

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 10 00 02 00 40 00      00:01:07.284  READ DMA
  ef 02 00 00 00 00 40 00      00:01:07.284  SET FEATURES [Enable write cache]
  ef aa 00 00 00 00 40 00      00:01:07.284  SET FEATURES [Enable read look-ahead]
  c6 00 10 00 00 00 40 00      00:01:07.284  SET MULTIPLE MODE
  ef 10 02 00 00 00 40 00      00:01:07.284  SET FEATURES [Enable SATA feature]

Error 59 occurred at disk power-on lifetime: 600 hours (25 days + 0 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 00 0f 02 00 40  Error: ICRC, ABRT at LBA = 0x0000020f = 527

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 10 00 02 00 40 00      00:01:06.956  READ DMA
  ef 02 00 00 00 00 40 00      00:01:06.955  SET FEATURES [Enable write cache]
  ef aa 00 00 00 00 40 00      00:01:06.955  SET FEATURES [Enable read look-ahead]
  c6 00 10 00 00 00 40 00      00:01:06.955  SET MULTIPLE MODE
  ef 10 02 00 00 00 40 00      00:01:06.955  SET FEATURES [Enable SATA feature]

63 errors, the visible ones are posted above. Not sure if these are a root cause or a symptom of something else. I checked the other drives and da10 has errors as well. Does anyone know what could be the cause of this?

diizzy said:
I would also keep an eye on temperatures overall, LSI HBA runs hot and need active cooling that wouldn't explain the issues you're seeing on the built in(?) controller though but different drives so might be something else going on.

If the HBA was overheating, would I not see random issues around all 12 drives, not just these 4? Also originally this problem was only on the internal SATA controller. I replaced the 8 port HBA with a 16 port so I could hook these drives up to that as well, thinking the problem was with the internal SATA controller. Looks like it wasn't the root cause in the end.

VladiBG · Jul 30, 2023

Did you disable the power saving features on the disk?

Erichans · Jul 30, 2023

What's the FreeBSD version you're running here? ( freebsd-version -kru). Just to be complete, you are running bare metal (no vm)?

Unixnut · Jul 31, 2023

VladiBG said:
Did you disable the power saving features on the disk?

I made no changes to any of the disks. Whatever settings they came with when I bought them have remained (I don't think I've ever changed the power saving features on a disk before).

Erichans said:
What's the FreeBSD version you're running here? ( freebsd-version -kru). Just to be complete, you are running bare metal (no vm)?

Code:

# freebsd-version  -kru
13.1-RELEASE-p6
13.1-RELEASE-p6
13.1-RELEASE-p7

It is bare metal.

VladiBG · Jul 31, 2023

Did you search google about WDTLER link and WDIDLE3 ?

Long story short when the disk is trying to read some slow/bad sector it have predefine timeout before reporting it as bad sector and reallocating it.
During this time if it's queried by the controller or issued some command and it doesn't respond it's get kick out due to timeout as the raid controller doesn't wait 30 sec for the disk to respond and it mark it bad and remove it from the raid volume. Same goes for disk which is in idle / spin-down state. So try to disable the spin-down and reduce the TLER to 7 or 6 sec.

Seagate use the same error recovery control ERC for they enterprise disk which limit this timeout so the disk can report and reallocate the bad sector without been kicked out of the raid volume.

Unixnut · Jul 31, 2023

VladiBG said:
Did you search google about WDTLER link and WDIDLE3 ?

Long story short when the disk is trying to read some slow/bad sector it have predefine timeout before reporting it as bad sector and reallocating it.
During this time if it's queried by the controller or issued some command and it doesn't respond it's get kick out due to timeout as the raid controller doesn't wait 30 sec for the disk to respond and it mark it bad and remove it from the raid volume. Same goes for disk which is in idle / spin-down state. So try to disable the spin-down and reduce the TLER to 7 or 6 sec.

Seagate use the same error recovery control ERC for they enterprise disk which limit this timeout so the disk can report and reallocate the bad sector without been kicked out of the raid volume.

I did not, first time I even considered that the drive configuration could be an issue. As it was these drives worked fine since they were first installed in 2018, hence I never thought that a config change would be needed. Plus a long time ago, when I fiddled with drive configs, I managed to brick the drive, and since then I don't like to fiddle with the drive controller settings.

What you say however makes sense, if the drives are wearing out, then eventually with time I would start seeing these issues even if no configs were changed. So I've searched now, WDIDLE3 and WDIDLE3 seem to be specific to WD drives only. Can this be an issue with non WD drives as well? The two that just dropped out are Toshiba drives, and the WD Red I have in there has no errors and is working fine. Considering I am already running 2 drives down, fiddling with one of the drives that isn't having issues is not something I am comfortable doing. If I lose a third drive I lose the array.

My Toshiba drives are "N300 NAS" drives, so should already come with power management heavily reduced if not disabled. A quick search says that "hdparm -B" and "hdparm -S" options can disable power management on them if it isn't already done, but this is a Linux utility. Is there an equivalent on FreeBSD that allows me to set drive parameters?

VladiBG · Jul 31, 2023

So you are mix and match the following drives:

May 15 05:05:52 Mnemosyne kernel: ada1: <WDC WD101EFBX-68B0AN0 85.00A85> s/n VCPTJH2P detached

May 17 09:52:15 Mnemosyne kernel: ada4: <ST10000NE0008-1ZF101 SN03> s/n ZS517X0M detached

da11: <ATA TOSHIBA HDWG11A 0603> s/n 21N0A001FATG detached

Western Digital WD Red Plus 3.5 10TB 7200rpm 256MB SATA3
WD101EFBX

Seagate IronWolf ST10000NE0008 10 TB
ST10000NE0008

Toshiba N300 10TB
HDWG11A

The probability to have the same issue with 3 different drives from 3 different manufactures is very unlikely. Did you test them on another system or if there's any power settings in the bios of the current one regarding the SATA?

Charlie_ · Jul 31, 2023

Unixnut said:
Is there an equivalent on FreeBSD that allows me to set drive parameters?

camcontrol(8)

Unixnut · Jul 31, 2023

VladiBG said:
So you are mix and match the following drives:

Western Digital WD Red Plus 3.5 10TB 7200rpm 256MB SATA3
WD101EFBX

Seagate IronWolf ST10000NE0008 10 TB
ST10000NE0008

Toshiba N300 10TB
HDWG11A

The probability to have the same issue with 3 different drives from 3 different manufactures is very unlikely. Did you test them on another system or if there's any power settings in the bios of the current one regarding the SATA?

Yes I run a mix, to avoid "bad batches" of drives, or just avoid having all the drives having similar MTBF's, which would result in a higher likelihood of them failing round about the same time ( most likely when resilvering ).

Yes I have tested them on another system. I also put them in a Linux machine and ran "badblocks -w" on them, with no errors, drop outs or other issues.

VladiBG · Jul 31, 2023

Is there any pending or reallocated sectors in the SMART status?

Unixnut · Jul 31, 2023

VladiBG said:
Is there any pending or reallocated sectors in the SMART status?

Doesn't look like there are:

Code:

 # foreach i ( 8 9 10 11 )
foreach? smartctl -a /dev/da$i | grep Sector
foreach? end
Sector Sizes:     512 bytes logical, 4096 bytes physical
  5 Reallocated_Sector_Ct   0x0033   100   100   050    Pre-fail  Always       -       0
197 Current_Pending_Sector  0x0032   100   100   000    Old_age   Always       -       0

Sector Sizes:     512 bytes logical, 4096 bytes physical
  5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       0
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0

Sector Sizes:     512 bytes logical, 4096 bytes physical
  5 Reallocated_Sector_Ct   0x0033   100   100   050    Pre-fail  Always       -       0
197 Current_Pending_Sector  0x0032   100   100   000    Old_age   Always       -       0

Sector Sizes:     512 bytes logical, 4096 bytes physical
  5 Reallocated_Sector_Ct   0x0033   100   100   050    Pre-fail  Always       -       0
197 Current_Pending_Sector  0x0032   100   100   000    Old_age   Always       -       0

Unixnut · Aug 22, 2023

I got another catastrophic I/O failure on the pool today, so I took advantage of the reboot and have now fitted a new ATX PSU just to power the four drives. Will re-run and now see if the problem goes away.

Unixnut · Aug 27, 2023

Well two days in, it is a mixed bag.

The good news is that the hard drives have been behaving very well since I put them on their own ATX PSU. I have had no drop outs, and was even able to do a full zfs scrub without issue. It seems fine at the moment.

However, in the last two days I had two SSDs fault out with errors on the other pool:

Code:

  pool: storagefast
 state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
	Sufficient replicas exist for the pool to continue functioning in a
	degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the device
	repaired.
  scan: scrub repaired 0B in 01:08:14 with 0 errors on Tue Aug  1 01:08:14 2023
config:

	NAME         STATE     READ WRITE CKSUM
	storagefast  DEGRADED     0     0     0
	  raidz1-0   DEGRADED     0     0     0
	    da2      ONLINE       0     0     0
	    da0      ONLINE       0     0     0
	    da6      ONLINE       0     0     0
	    da5      FAULTED    227   885     0  too many errors
	  raidz1-1   DEGRADED     0     0     0
	    da1      ONLINE       0     0     0
	    da7      ONLINE       0     0     0
	    da4      ONLINE       0     0     0
	    da3      FAULTED     20 2.06K     0  too many errors

This pool has had no problems up till now. It is also a different failure method to the drives in that they didn't drop off, they just started generating errors until ZFS faulted them. I am not sure if this is just an unfortunate coincidence, or whether moving the hard drives off the PSU has unveiled a new problem....

Edit:

Getting a lot of errors like this:

Code:

mps0: Controller reported scsi ioc terminated tgt 17 SMID 757 loginfo 31120303
(da1:mps0:0:17:0): WRITE(10). CDB: 2a 00 0f 45 9a 28 00 00 d0 00 
(da1:mps0:0:17:0): CAM status: CCB request completed with an error
(da1:mps0:0:17:0): Retrying command, 3 more tries remain
(da1:mps0:0:17:0): WRITE(10). CDB: 2a 00 0f 4d 44 58 00 00 80 00 
(da1:mps0:0:17:0): CAM status: CCB request completed with an error
(da1:mps0:0:17:0): Retrying command, 3 more tries remain
(da1:mps0:0:17:0): WRITE(10). CDB: 2a 00 0f 45 9a 28 00 00 d0 00 
(da1:mps0:0:17:0): CAM status: SCSI Status Error
(da1:mps0:0:17:0): SCSI status: Check Condition
(da1:mps0:0:17:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
(da1:mps0:0:17:0): Retrying command (per sense data)
(da1:mps0:0:17:0): READ(10). CDB: 28 00 14 1c b0 88 00 00 68 00 
(da1:mps0:0:17:0): CAM status: SCSI Status Error
(da1:mps0:0:17:0): SCSI status: Check Condition
(da1:mps0:0:17:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
(da1:mps0:0:17:0): Retrying command (per sense data)

da1 is a SSD, but it is not one of the two that had failed so far. As they are raidz1 sets another failure will trash the array, so currently backing up what I can before I dig further.

Unixnut · Aug 27, 2023

This strange behaviour is making me think that I may be having a PSU that is failing in an odd way, causing spurious errors to manifest.

Some people on this thread mentioned "Enterprise" level PSUs which may be more reliable. What ones are available that are in ATX format? Most of the ones I know have proprietary shapes and interconnects which precludes them from working with standard components.

gpw928 · Aug 27, 2023

Does SMART say anything unusual about da1, da3, or da5?

Given your circumstances, the PSU is one of the very first things I would have switched out.

Have you ruled out over-temperature as an issue?

ralphbsz · Aug 28, 2023

Unixnut said:
(da1:mps0:0:17:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)

Either that device had a short power interruption (where short could mean: milliseconds), or its HBA performed a bus reset. The latter seems very unlikely, and you would probably see lines in dmesg if the HBA resets the devices. I suspect your power situation isn't good enough yet.

Unixnut · Aug 28, 2023

gpw928 said:
Does SMART say anything unusual about da1, da3, or da5?

No, Smartctl reports all drives healthy with no errors.

gpw928 said:
Given your circumstances, the PSU is one of the very first things I would have switched out.

Well, given all the tests I did (including monitoring the power lines), it didn't seem likely to be a power issue, especially as I had changed nothing on the machine power wise. It still has the same power consumption as when I first built it, so I didn't expect that to become an issue.

Likewise I've never seen a PSU fail like this. Usually they just die, either entirely or a single voltage line starts faltering. I saw neither here, even with a scope attached.

The PSU itself is 750W, and my UPS reports the machine does not exceed 310W under full load, so it isn't like I am running the PSU beyond its capabilities either.

gpw928 said:
Have you ruled out over-temperature as an issue?

Temperature of the SSDs is between 35 and 43C, which is within the usual parameters for summer, and no different to the rest of the SSD's in the pool.

Unixnut · Aug 28, 2023

ralphbsz said:
Either that device had a short power interruption (where short could mean: milliseconds), or its HBA performed a bus reset. The latter seems very unlikely, and you would probably see lines in dmesg if the HBA resets the devices. I suspect your power situation isn't good enough yet.

Thanks, surprised I would have more power issues once I actually removed load from the PSU (as it no longer powers the hard drives). The second ATX PSU is generic and does not have enough plugs for the SSD pool as well as the drives, so I can't easily switch the SSDs over to see if the problem goes away.

gpw928 · Aug 28, 2023

Unixnut said:
Well, given all the tests I did (including monitoring the power lines), it didn't seem likely to be a power issue, especially as I had changed nothing on the machine power wise.

I do appreciate you had looked at just about everything carefully, including the power supply.

By component switching, you appear to have eliminated controllers, backplanes, data cables, power cables, and individual disks.

Your symptoms are compatible with a power "glitch".

So it would be sensible to drill down on external power, UPS, PSU, and motherboard power regulation.

Maybe there's a an outside chance of a faulty UPS (e.g. could it be switching to bypass mode?).

I'd replace the PSU just to eliminate it as the cause. The worst outcome is that you get a spare PSU.

Faulty power regulation on the motherboard would fit the symptoms. That's where I would go next.