ZFS Random drive detachments from host

Hi all,

I have been grappling with a problem with my ZFS array for the last year or so, and it has reached a point now where I have run out of ideas of what may be wrong, so posting in the hope someone has some advice.

I have a machine with 14 drives, configured like so:

4x 10TB HDD in ZFS pool (storage)
1x 1TB HDD in ZFS pool (root fs)
8x 240GB SSD in ZFS pool (fast storage)
1x 6TB HDD as UFS filesystem

the 8x SSD are attached to a HBA, and the other 6 drives are on the motherboards internal SATA controller.

My issue is with the 4x10TB array. When built it worked fine, but after a year or so I found that drives would be marked "REMOVED" on the zpool. Looking at the messages I would see the following:

Code:
May 15 05:05:52 Mnemosyne kernel: ada1 at ahcich4 bus 0 scbus2 target 0 lun 0
May 15 05:05:52 Mnemosyne kernel: ada1: <WDC WD101EFBX-68B0AN0 85.00A85> s/n VCPTJH2P detached
May 15 05:06:04 Mnemosyne kernel: (ada1:ahcich4:0:0:0): Periph destroyed
May 15 05:06:04 Mnemosyne kernel: ada1 at ahcich4 bus 0 scbus2 target 0 lun 0
May 15 05:06:04 Mnemosyne kernel: ada1: <WDC WD101EFBX-68B0AN0 85.00A85> ACS-2 ATA SATA 3.x device
May 15 05:06:04 Mnemosyne kernel: ada1: Serial Number VCPTJH2P
May 15 05:06:04 Mnemosyne kernel: ada1: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 8192bytes)
May 15 05:06:04 Mnemosyne kernel: ada1: Command Queueing enabled
May 15 05:06:04 Mnemosyne kernel: ada1: 9537536MB (19532873728 512 byte sectors)

It looks like the drive just detached, and then destroyed and re-added a few seconds later. There are no other errors. Originally it would not happen often, once every few weeks, but it became more and more frequent. By the end of it a drive would drop off a few seconds after being re-added, and subsequently I would end up with multiple drives detaching before the array manages to resilver, resulting in data corruption.

In an effort to resolve the problem:
- I checked all the drives SMART status with smartctl, all came back healthy
- I pulled drives that were failing, and ran a badblocks test on them to see if they would drop off under sustained I/O or log any errors (no issues found)
- I cleaned and re-seated the SATA and power connectors
- I replaced the SATA cables
- I swapped the SATA cables around between the 6 drives to see if the detachments followed a certain cable/port
- I bought two new sets of 10TB drives from different manufacturers, thinking I had a faulty batch of drives despite the SMART status and testing.
- I bypassed the drive caddy and connected the drives directly, thinking there was a fault with the backplane

The only bits I did not touch are:

- Trying a different SATA controller For one it is onboard and I have no expansion slots for another controller. For another, there are 6 drives attached to that controller, and the other two drives have never detached on me, they have been rock solid. Likewise moving the cables around did not result in the drop offs moving as well, which is what I would have expected if there were some problems with specific ports.

- Trying a different PSU: There are 14 drives connected to this PSU. If it was related to the PSU failing, I would expect random drop offs to occur across all the drives (not to mention general system instability), however I only see it with the four mentioned.

At this point I have run out of ideas, there is not much else I can think of to do to work out what the actual problem is. Has anyone seen something like this before?
 
What is in the log *before* the lines you pasted, any errors/information from that drive? Some BIOS'es have various settings for SATA power management, try switching that off (if it exists).

(adding) Please also try booting in verbose mode (boot -v) to see if that would provide any hints on what exactly is happening.
 
It looks like the drive just detached, and then destroyed and re-added a few seconds later.
Could be an issue with the backplane/port-extender. I've had a server once with a broken backplane, drives kept randomly dropping off the bus and attaching again.
 
No errors what so ever, and no other related messages that I can see. Messages before are general logs. Here are fresh detachments, followed by me re-adding them to the pool once the drive re-attaches:

Code:
root@Mnemosyne:~ # tail -f /var/log/messages
May 17 09:52:15 Mnemosyne kernel: ada4: <ST10000NE0008-1ZF101 SN03> s/n ZS517X0M detached
May 17 09:53:02 Mnemosyne kernel: (ada4:ahcich9:0:0:0): Periph destroyed
May 17 09:53:02 Mnemosyne kernel: ada4 at ahcich9 bus 0 scbus5 target 0 lun 0
May 17 09:53:02 Mnemosyne kernel: ada4: <ST10000NE0008-1ZF101 SN03> ACS-3 ATA SATA 3.x device
May 17 09:53:02 Mnemosyne kernel: ada4: Serial Number ZS517X0M
May 17 09:53:02 Mnemosyne kernel: ada4: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 8192bytes)
May 17 09:53:02 Mnemosyne kernel: ada4: Command Queueing enabled
May 17 09:53:02 Mnemosyne kernel: ada4: 9537536MB (19532873728 512 byte sectors)
May 17 09:53:02 Mnemosyne ZFS[95072]: vdev state changed, pool_guid=5865384698702913917 vdev_guid=979139601836451568
May 17 09:53:02 Mnemosyne ZFS[95644]: vdev is removed, pool_guid=5865384698702913917 vdev_guid=979139601836451568
May 17 09:54:00 Mnemosyne ZFS[6893]: vdev state changed, pool_guid=5865384698702913917 vdev_guid=979139601836451568
May 17 09:54:26 Mnemosyne kernel: ada4 at ahcich9 bus 0 scbus5 target 0 lun 0
May 17 09:54:26 Mnemosyne kernel: ada4: <ST10000NE0008-1ZF101 SN03> s/n ZS517X0M detached
May 17 09:55:02 Mnemosyne kernel: (ada4:ahcich9:0:0:0): Periph destroyed
May 17 09:55:02 Mnemosyne kernel: ada4 at ahcich9 bus 0 scbus5 target 0 lun 0
May 17 09:55:02 Mnemosyne kernel: ada4: <ST10000NE0008-1ZF101 SN03> ACS-3 ATA SATA 3.x device
May 17 09:55:02 Mnemosyne kernel: ada4: Serial Number ZS517X0M
May 17 09:55:02 Mnemosyne kernel: ada4: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 8192bytes)
May 17 09:55:02 Mnemosyne kernel: ada4: Command Queueing enabled
May 17 09:55:02 Mnemosyne kernel: ada4: 9537536MB (19532873728 512 byte sectors)
May 17 09:55:02 Mnemosyne ZFS[43976]: vdev state changed, pool_guid=5865384698702913917 vdev_guid=979139601836451568
May 17 09:55:02 Mnemosyne ZFS[44498]: vdev is removed, pool_guid=5865384698702913917 vdev_guid=979139601836451568
May 17 09:56:00 Mnemosyne ZFS[54513]: vdev state changed, pool_guid=5865384698702913917 vdev_guid=979139601836451568
May 17 09:56:23 Mnemosyne kernel: ada4 at ahcich9 bus 0 scbus5 target 0 lun 0
May 17 09:56:23 Mnemosyne kernel: ada4: <ST10000NE0008-1ZF101 SN03> s/n ZS517X0M detached
May 17 09:57:02 Mnemosyne kernel: (ada4:ahcich9:0:0:0): Periph destroyed
May 17 09:57:02 Mnemosyne kernel: ada4 at ahcich9 bus 0 scbus5 target 0 lun 0
May 17 09:57:02 Mnemosyne kernel: ada4: <ST10000NE0008-1ZF101 SN03> ACS-3 ATA SATA 3.x device
May 17 09:57:02 Mnemosyne kernel: ada4: Serial Number ZS517X0M
May 17 09:57:02 Mnemosyne kernel: ada4: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 8192bytes)
May 17 09:57:02 Mnemosyne kernel: ada4: Command Queueing enabled
May 17 09:57:02 Mnemosyne kernel: ada4: 9537536MB (19532873728 512 byte sectors)
May 17 09:57:02 Mnemosyne ZFS[63557]: vdev state changed, pool_guid=5865384698702913917 vdev_guid=979139601836451568
May 17 09:57:02 Mnemosyne ZFS[64131]: vdev is removed, pool_guid=5865384698702913917 vdev_guid=979139601836451568


Today seems to be a bad day, the drives are dropping every few seconds, so the logs are filled with the above repeating. Seems odd that power management would be that aggressive, but I will reboot the machine and check the BIOS settings (not that I have changed them in years).
 
Could be an issue with the backplane/port-extender. I've had a server once with a broken backplane, drives kept randomly dropping off the bus and attaching again.

Yes, I suspected that as well at first, which is why I bypassed the caddy and its backplane and connected the drives directly for testing. Still had the same issue.
 
- I checked all the drives SMART status with smartctl, all came back healthy

If you are referring to that "health self-asessment result" - this is completely worthless. I've never seen any other value than "PASSED" even for drives that were on their very last breath and losing sectors left and right.
Only look at the attributes and their values.

Because you already mentioned the PSU: Is this an actual server or a repurposed desktop system with a standard ATX PSU? Those have multi-rail designs that can barely support any real load on single rails and are are mostly designed to support high-wattage CPUs and GPUs, but only very low loads e.g. on 5V for periphery like disks. There's a reason why servers have single-rail PSUs that just have a single secondary voltage.
I'd try replacing that PSU or lowering the load by e.g. temporarily disconnecting the backplane or disks holding that 8-disk pool. If those are enterprise(-ey) SSDs, they usually have some form of PLP or at least some small caps and due to the lower power requirements they might handle small power drops better than spinning disks. You could/should also monitor the voltages to see if there are any irregularities or sudden drops. Usually the BMC takes care of that - but on desktop hardware you may have to resort to using multimeters depending on how the rails are distributed - e.g. if the mainboard gets its own rail, the voltages measured by the board are *not* the same as what e.g. your drives are getting.
 
In addition to what Cracauer said ...

- Trying a different PSU: There are 14 drives connected to this PSU. If it was related to the PSU failing, I would expect random drop offs to occur across all the drives (not to mention general system instability), however I only see it with the four mentioned.
Maybe those 4 drives are just more sensitive to power fluctuations?

Any way you can move part of the drives to a separate power supply? Like put a second case next to this server, and use some power splitters and extenders? At this point, you have no other logical explanation, and few options, so try something.
 
If you are referring to that "health self-asessment result" - this is completely worthless. I've never seen any other value than "PASSED" even for drives that were on their very last breath and losing sectors left and right.
Only look at the attributes and their values.
I did look at the attributes/values as well, nothing untoward. Plus I bought three new 10TB drives, stuck them in there, and they also keep dropping off. I now have Toshiba NasN300, WD Red and Seagate IronWolf pro all show the same failure mode.
Because you already mentioned the PSU: Is this an actual server or a repurposed desktop system with a standard ATX PSU? Those have multi-rail designs that can barely support any real load on single rails and are are mostly designed to support high-wattage CPUs and GPUs, but only very low loads e.g. on 5V for periphery like disks. There's a reason why servers have single-rail PSUs that just have a single secondary voltage.
I'd try replacing that PSU or lowering the load by e.g. temporarily disconnecting the backplane or disks holding that 8-disk pool. If those are enterprise(-ey) SSDs, they usually have some form of PLP or at least some small caps and due to the lower power requirements they might handle small power drops better than spinning disks. You could/should also monitor the voltages to see if there are any irregularities or sudden drops. Usually the BMC takes care of that - but on desktop hardware you may have to resort to using multimeters depending on how the rails are distributed - e.g. if the mainboard gets its own rail, the voltages measured by the board are *not* the same as what e.g. your drives are getting.

This is my home lab, so not a purpose built server. The case is a 4U server case, and came with built in backplanes and caddy's. The rest are desktop components (with the exception of the LSI HBA card). AMD Ryzen-9 and ASUS PRIME X570-PRO motherboard and 64GB of RAM.

The PSU is a Thermaltake, unfortunately mounted such that I can't read the model number, but from memory it is the M750W. Should be able to sustain 62A on the 12v line and 25A on the 5v line.

Very limited monitoring unfortunately, the motherboard only provides CPU voltages, but I do have a dual channel oscilloscope, so I have connected that up to the same bus that powers the 10TB drives and am now waiting for a drive to drop off again.

With everything working as it should:
- The 5V line swings between 5.0 and 4.8V
- The 12v line swings between 12.6 and 12.2V

I will keep monitoring. I've also been monitoring the amount of times the drives have dropped out of the ZFS array, since 9:50AM UTC to now I have had 109 drop outs:
Total drive removals: 109
Breakdown by drive:
23 ada1p1
86 ada4p1
Only two of the four drives keep dropping off, and one almost 4 times more often than the other. Not sure if that points to anything specific. Both drives are the Ironwolf ones, which replaced two Toshiba N300's that were exhibiting the same issue.
 
At this point I would exchange the 8-port controller for a 16-port and hook up the 10 TB drives there.
If I had one kicking around I would have done that already :)

However I don't, and before I go and buy one (and wait a few months for it to arrive), I thought I would ask here to make sure I've ruled out anything else that I may not have thought of (Especially if it turns out to be something small/silly I have overlooked). It is bad enough I have bought so many 10TB drives chasing this problem down as it is 🤣
 
In addition to what Cracauer said ...


Maybe those 4 drives are just more sensitive to power fluctuations?

Any way you can move part of the drives to a separate power supply? Like put a second case next to this server, and use some power splitters and extenders? At this point, you have no other logical explanation, and few options, so try something.
Possibly, I guess it would depend on whether I see any voltage drops or similar on the power lines when I next get a drop off, if things point to the PSU, I may try wiring up a secondary PSU just for the drives and see if that solves the issue. If it does, I go shopping for a new PSU with the capabilities I need.
 
With a very similar problem, I realised that I could troubleshoot nothing unless I got in some spare parts to allow swapping things around. So I sourced a new LSI controller, new "ocatpus" fan-out SATA cables, and a new PSU.

I put in the new PSU first, as it was the least disruptive change. There was no improvement.

I had to make some room on the PCIe bus (I took out an Ethernet controller), before I installed the second LSI controller. I then started moving disks across to it, on new SATA cables, one at a time.

The cause was eventually traced to one bad 4x fan-out SATA cable from the original LSI controller. It took a lot of accurate book-keeping to figure that out.

When the fault-finding was done, I had a second LSI controller with new SATA cables. That allowed me to attach each side of each mirror to a different controller. So, significantly improved redundancy, and money well spent.

My new power supply is also an improvement. I chose a good quality one, and over-spec'd it. Since its fan only starts under significant load, and then runs at variable speed, it's usually very quiet.

I now have a spare power supply for fault finding, and plenty of spare SATA ports in the server.
 
Unixnut,
Is your server connected to outlets directly or through a UPS ? In my house, where I live and where I also have my own home laboratory, power surges are often observed (they are "invisible to the eye", even household appliances are not sensitive to them). I had similar problems before I connected my server to a UPS.
I have an
[lanin@freebsd ~]$ upsc pw9130@localhost
battery.charge: 100
battery.runtime: 1706
battery.type: PbAc
device.mfr: EATON Powerware
device.model: 9130 1000VA-T
device.serial: PL342A5660
device.type: ups
driver.name: usbhid-ups
driver.parameter.pollfreq: 30
driver.parameter.pollinterval: 2
driver.parameter.port: auto
driver.parameter.synchronous: auto
driver.version: 2.8.0
driver.version.data: MGE HID 1.46
driver.version.internal: 0.47
driver.version.usb: libusb-1.0.0 (API: 0x1000102)
input.frequency: 49.9
input.transfer.high: 276
input.transfer.low: 140
input.voltage: 223.0
input.voltage.nominal: 220
outlet.1.delay.shutdown: 65535
outlet.1.delay.start: 0
outlet.1.desc: PowerShare Outlet 1
outlet.1.id: 2
outlet.1.status: on
outlet.1.switchable: yes
outlet.2.delay.shutdown: 65535
outlet.2.delay.start: 0
outlet.2.desc: PowerShare Outlet 2
outlet.2.id: 3
outlet.2.status: on
outlet.2.switchable: yes
output.current: 1.30
output.frequency: 49.9
output.frequency.nominal: 50
output.voltage: 219.0
output.voltage.nominal: 220
ups.beeper.status: enabled
ups.delay.shutdown: 20
ups.delay.start: 30
ups.firmware: 0130
ups.load: 28
ups.load.high: 102
ups.mfr: EATON Powerware
ups.model: 9130 1000VA-T
ups.power: 287
ups.power.nominal: 1000
ups.productid: ffff
ups.realpower: 254
ups.serial: PL342A5660
ups.status: OL
ups.temperature: 25.9
ups.test.result: Done and passed
ups.timer.shutdown: -1
ups.timer.start: -1
ups.vendorid: 0463

To be honest, I don't have a server, but rather a workstation .
It has
lanin@freebsd ~]$ sudo camcontrol devlist
<WDC WD6003FRYZ-01F0DB0 01.01H01> at scbus0 target 0 lun 0 (pass0,ada0)
<WDC WD4003FRYZ-01F0DB0 01.01H01> at scbus1 target 0 lun 0 (pass1,ada1)
<WDC WD4002FYYZ-01B7CB1 01.01M03> at scbus2 target 0 lun 0 (pass2,ada2)
<WDC WD4002FYYZ-01B7CB0 01.01M02> at scbus3 target 0 lun 0 (pass3,ada3)
<WDC WD4000FYYZ-01UL1B3 01.01K04> at scbus4 target 0 lun 0 (pass4,ada4)
<WDC WD4003FRYZ-01F0DB0 01.01H01> at scbus5 target 0 lun 0 (pass5,ada5)
<WDC WD6003FRYZ-01F0DB0 01.01H01> at scbus6 target 0 lun 0 (pass6,ada6)
<PLEXTOR DVDR PX-891SA 1.04> at scbus7 target 0 lun 0 (pass7,cd0)
<AHCI SGPIO Enclosure 2.00 0001> at scbus8 target 0 lun 0 (pass8,ses0)
<WDC WUH721816AL5204 C232> at scbus12 target 4 lun 0 (pass9,da0)
<WDC WUH721816AL5204 C680> at scbus12 target 5 lun 0 (pass10,da1)
My PSU is Corsair HX1200. I used to have a ThermalTake 1500 which is much worse than the Corsair).
 
Thanks all for the ideas.

So, I have been running with the scope connected, and despite all the drive detachments, there has not been a single voltage drop on either the 12v or 5v buses.

As such, it looks like it is not PSU related. My machine is on a UPS, so it has as clean a power input as I can do at home.

As for drop offs, since my last update:
Total drive removals: 733
Breakdown by drive:
51 ada1p1
1 ada3p1
681 ada4p1

I noticed two things:

1. While given enough time all four drives will probably detach, it is not evenly distributed. One drive detaches a lot more than the others
2. The detachments are correlated to the I/O on the pool. If the pool is idle or lightly used, it is fine, while the higher the utilisation, the faster the drives drop off.

I am beginning to suspect that this may be a very odd failure of a single channel of the controller, that may be causing some kind of interference on the other channels.

I am currently backing up the pool, and once that is done I will disconnect the ada4 channel, then run the pool degraded and see if the other drop offs stop. I will keep the drive powered up so that the power draw stays roughly the same.

If the faults go away, then it is a good sign that I need a new controller.
 
So it's not the power supply because oscilloscope. It's not the disk (or things associated with the disks), because it hits all disks, which is statistically highly unlikely. It is not the data cables, because you already replaced them. It could be power cables, or any other component you haven't replaced yet. Or the painful hypothesis: It could be any one component (like a faulty disk), combined with bad error handling anywhere in the driver stack. For example, a low-level communication error that leaves the NEXT io vulnerable, and the next IO is often not on the same drive.

The thing which is intriguing is that it is correlated with IO activity. Crazy idea: Write a small piece of code that makes the disks be insanely busy. For example a small program that 100 times per second reads a 4K sector at a random place on the disk; this will use up roughly 100% of the seek capacity of the disk. Now run this program, in various combinations: By itself (while normal workload is completely quiesced) on 1 disk, multiple copies with quiesced workload on all spinning disks, and then one or multiple copies while normal workload (even a scrub) is progressing. Obviously, this takes effort (an hour to write the program, hours to perform the tests), and it disrupts the workload. Maybe this would be capable of elevating the problem to the level where it becomes easy to diagnose.
 
Ah, I forgot to mention, I replaced the power cables as well. The backplane provides power to the drives, so when I bypassed the backplane, I used plain SATA power cables direct to the drives for testing.

I could write the program to stress test the drives (or find one that already exists, I seem to remember Solaris had such a utility back in the day, and I am sure someone has written such a thing for other OSes), however so far just doing the backup confirmed it.

When the backup is doing a zfs dump to the backup disk, the drives are dropping off the array. When it switches to verifying the backup (zfs restore to /dev/null), the array is idle, and no drives drop off. The zfs dump is sustained streaming I/O, and that is enough to cause problems.

Unfortunately during the backup I managed to lose three drives, and ended up with the pool being suspended. So spent the weekend trying to recover the pool.

One thing I did this time, is I interleaved the internal SATA controller with the HBA. Now two channels of the "storage" pool are on the HBA and two channels of the "storagefast" pool are on the internal SATA controller.

I rebuilt the array with three drives, and it resilvered overnight with no errors at all. Then this morning I re-added "ada4", and shortly after resilvering one of the drives that is on the HBA dropped off.

Not sure what to make of this, if ada4 is the channel causing problems, I would expect it to only cause problems on its own controller, not affect the HBA (which has run the SSDs for years without a single hiccup).

One good thing though, the HBA is more verbose with its error messages. While the drives would just detach without error on the internal SATA, here I actually got some error messages:

Code:
(da6:mps0:0:43:0): SCSI sense: NOT READY asc:4,0 (Logical unit not ready, cause not reportable)
(da6:mps0:0:43:0): Retrying command (per sense data)
(da6:mps0:0:43:0): READ(16). CDB: 88 00 00 00 00 04 8c 3f fa 38 00 00 00 10 00 00 
(da6:mps0:0:43:0): CAM status: SCSI Status Error
(da6:mps0:0:43:0): SCSI status: Check Condition
(da6:mps0:0:43:0): SCSI sense: NOT READY asc:4,0 (Logical unit not ready, cause not reportable)
(da6:mps0:0:43:0): Error 5, Retries exhausted
(da6:mps0:0:43:0): READ(16). CDB: 88 00 00 00 00 04 8c 3f fc 38 00 00 00 10 00 00 
(da6:mps0:0:43:0): CAM status: SCSI Status Error
(da6:mps0:0:43:0): SCSI status: Check Condition
(da6:mps0:0:43:0): SCSI sense: NOT READY asc:4,0 (Logical unit not ready, cause not reportable)
(da6:mps0:0:43:0): Error 5, Retries exhausted
(da6:mps0:0:43:0): READ(6). CDB: 08 00 02 38 10 00 
(da6:mps0:0:43:0): CAM status: SCSI Status Error
(da6:mps0:0:43:0): SCSI status: Check Condition
(da6:mps0:0:43:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
(da6:mps0:0:43:0): Retrying command (per sense data)


Unfortunately as far as I can see, it does not give a clue as to the underlying fault, just "cause not reportable", but at least I am seeing some error messages now.
 
What HBA are you using anyways?
IIRC with SAS2 HBAs the interoperability with SATA was way more unstable/uncertain than with newer ones. same goes for old SATA drives that often don't play well over SAS, especially when connected through a port extender.
Given that the drives *only* disconnect under load, I'd still try replacing the PSU (ideally with a proper server-rated one or at least some single-rail design) and also closely monitor the HBA temperatures or just replace it. I've only had a failing HBA once and it was only throwing I/O errors/timeouts on single drives while all other were fine. HBAs are hiding lots of things (including errors) behind firmware - not as bad as raid controllers, but still they are trying to lie as much as possible to the OS...
 
Given that the drives *only* disconnect under load, I'd still try replacing the PSU (ideally with a proper server-rated one or at least some single-rail design) and also closely monitor the HBA temperatures or just replace it.
Sadly, that's pretty much the only options left: Replace power supply, replace HBA. See what happens.

Other people are capable of running lots of disks under heavy load, so it must be something specific to your system. And given that it affects many disks, it's probably not the individual disks.
 
Your issue may have something to do with ZFS' in conjunction with the HBA. Why is it necessary? The RAID array can be safely assembled in software alone.
 
What HBA are you using anyways?
IIRC with SAS2 HBAs the interoperability with SATA was way more unstable/uncertain than with newer ones. same goes for old SATA drives that often don't play well over SAS, especially when connected through a port extender.
Given that the drives *only* disconnect under load, I'd still try replacing the PSU (ideally with a proper server-rated one or at least some single-rail design) and also closely monitor the HBA temperatures or just replace it. I've only had a failing HBA once and it was only throwing I/O errors/timeouts on single drives while all other were fine. HBAs are hiding lots of things (including errors) behind firmware - not as bad as raid controllers, but still they are trying to lie as much as possible to the OS...

I am using a LSI SAS2008, and I can say it has worked flawlessly ever since I installed it. The drives dropping off have all been on the internal SATA controller (not sure what chip it uses, dmidecode doesn't specify and it is integrated into the motherboard).

As a final test I swapped two channels from the HBA and internal SATA controller to see what happens, and I had two drives drop off (one internal,one on the HBA), except the HBA one actually gave me errors (see above).

I have no objection to replacing the PSU if that is what it takes. What PSUs are in standard ATX format, but "server rated"?
Most PSUs of that type I know of are proprietary in pinout and/or dimensions for specific vendor servers. The best PSUs I could find are "extreme gamer" types which provide high currents, but may not be single-rail.

Sadly, that's pretty much the only options left: Replace power supply, replace HBA. See what happens.

Other people are capable of running lots of disks under heavy load, so it must be something specific to your system. And given that it affects many disks, it's probably not the individual disks.

I agree, that is pretty much all I have left and of course, this is not a common issue. I am sure most people can run this (and more) without issue.

I've now bought a new 16 port HBA ( a LSI 9201-16i ) to replace the 8-port, and then all the drives will be connected to it. Arrival expected sometime by the end of July, so will have to wait until then.
 
Your issue may have something to do with ZFS' in conjunction with the HBA. Why is it necessary? The RAID array can be safely assembled in software alone.

A HBA is not a RAID card. It is just a plain controller, giving you a bunch of channels to attach to disks to. All my disks are in ZFS as raw drives (raidz2 on the hard drives, striped raidz1 on the SSDs).

Have to comment on how impressive ZFS is. Despite these ongoing problems for over a year, including multiple pool failures, I have yet to lose any data. Every time it was able to recover everything without restoring from backup. I can't think of any other RAID system I have ever used that showed this much resiliency in the face of flaky hardware.
 
Well, I've finally received the new HBA card, so will try to install that and see if things improve.
 
use enterprise grade disk
I believe this excludes Western Digital by default. By now, they have established a pretty good history of going for things that screws over their customers, i.e.:


Sure, almost all manufacturers have their share of bad stories thanks to supply chain issues and such, but usually they handle their communications better IMO. I personally won't touch a WD drive any time soon. Not even with a ten foot pole.
 
Back
Top