ZFS Random drive detachments from host

Andriy · Feb 23, 2024

Unixnut perhaps your storage controller overheats?
You can check its temperature with mpsutil show adapter

Unixnut · Feb 23, 2024

Andriy said:
Unixnut perhaps your storage controller overheats?
You can check its temperature with mpsutil show adapter

Thanks! That is a very useful tool. Unfortunately it reports "Temperature: Unknown/Unsupported" for me.

I had originally thought that the controller might be overheating, so I added a fan for active cooling to its heatsink but it made no difference.

It does seem unlikely though that it would take 2 weeks for the controller to overheat to the point where it malfunctions, but the lack of actual temperature monitoring does make this a valid area of further investigation. I will see if I can attach an external temperature probe to the SAS cards heatsink and see what it says.

Unixnut · Mar 5, 2024

Well, a quick update. I had a run finally complete with the reduced number of CPUs in use. I now know it takes approx a month to finish the calculations.

I then re-tried the calculations, but using all 32 CPUs and setting low priority with idprio.
However this did not help. Within 24 hours three drives had dropped off and the ZFS pool was suspended, so had to do a reboot.

It seems to me setting priority does not help the situation.

I now await the temperature probe I bought. Once recieved I will check the temperature of the HBA when under load and update here.

Unixnut · Mar 5, 2024

Andriy said:
Unixnut perhaps your storage controller overheats?
You can check its temperature with mpsutil show adapter

So, the probe arrived today. I hooked it up to the heatsink of the HBA and ran the load on the system up to 48 in order to trigger failures as fast as possible.
First drive (a hard disk) dropped off after 14 mins with the HBA heatsink temperature at 45.7°C.
Second drive (a SSD) dropped off after 17 mins with the HBA heatsink temperatureat 47.3°C
Third drive (a hard disk) dropped off after 23 mins with the HBA heatsink temperature at 48.1°C
Fourth drive (a SSD) dropped off after 26 mins with the HBA heatsink temperature at 49.5°C

I kept the system running at full tilt for two hours, until the point when one more drive failure would result in loss of a zpool. Maximum temperature on the HBA was 50.5°C (hovering between 50°C and 47°C). Max CPU temp was 74°C.

Unfortunately the system temp is not available on FreeBSD (unless someone knows a command?) but the SSDs were between 39°C and 44°C.

I hope this helps. The HBA is a bit warm but nothing excessive in my opinion. FYI this issue has been seen on two HBAs and on the internal SATA controller so far.

Unixnut · Mar 13, 2024

So to test the overheating hypothesis a bit more, I took the cover off the server, temps dropped around 10°C on the HBA and CPU.

I then ran a backup job and a bunch of other tasks to provide IO and high load on the machine. Within 48 hours two SSDs dropped off.

I then put the cover back on, temps increased but there was no increase in dropped drives when idle. When I simulated load again another drive (HDD) dropped off.

From my observations, whatever the issue is it responds to an increase in system load more than it does to an increase in system temperature.

sko · Mar 13, 2024

are you still using that desktop PSU?

Unixnut · Mar 13, 2024

sko said:
are you still using that desktop PSU?

Nope, that was replaced a while ago.

At this point the only things I've not replaced are the Motherboard/CPU and the OS itself.

bsdimp · Mar 15, 2024

I've seen this exact problem on systems that were marginal in power. Some of that batch had it some did not.

I started on some code to not actually detach the drives for a second or two on the off chance they'd come back quickly. But the bad batch was retired before i was done.

At least i fixed the panics that would go along with these events

bsdimp · Mar 15, 2024

Unixnut said:
Nope, that was replaced a while ago.

At this point the only things I've not replaced are the Motherboard/CPU and the OS itself.

Our bad batch tested good, but was just marginal enough... but the bad batch had lots of confounding issues: marginal psu, marginal cables and bad thermals which all acted together to cause very short brown outs that the drives hated...

Unixnut · Mar 15, 2024

bsdimp said:
I've seen this exact problem on systems that were marginal in power. Some of that batch had it some did not.

I started on some code to not actually detach the drives for a second or two on the off chance they'd come back quickly. But the bad batch was retired before i was done.

At least i fixed the panics that would go along with these events

Our bad batch tested good, but was just marginal enough... but the bad batch had lots of confounding issues: marginal psu, marginal cables and bad thermals which all acted together to cause very short brown outs that the drives hated...
Yes, power issues were suspected earlier on, however:

Interestingly I have had no panic's at all. Apart from the drives dropping off FreeBSD has been rock solid, computing away for weeks at a time at high load without issue. The only problem is that when enough drives drop off the zpool gets suspended and I need to do a reboot, but the OS itself does not show any problems beyond the printed errors.

Neither do I get any CPU errors or segfaults, which I would expect to happen if there was a power issue. Normally at the margins of power draw I would expect all kinds of odd errors/panics, as different parts of the system get affected at different times by the brownouts. Earlier I did voltage tests and saw no drop in voltages across the drives 12V and 5V lines that I would expect to see in a brownout situation.

The failure modes are interesting as well. Sometimes the drives just drop off and I can re-attach them. Other times an entire channel seems to get into a confused state and nothing except a hard turn-off/wait/turn-on cycle will reset them.

This happened during the last backup, where it trashed the backup drive while in the caddy. I took advantage of this and tried all kinds of other drives in the caddy, each of which would result in I/O errors on FreeBSD if I tried to access them. Even a soft reboot did not fix that, hence the hard reset required to fix.

At this point I only have two things I've not changed as mentioned. I am currently taking advantage of the working pools to do a full backup, after which I may try installing Linux on this machine, importing the zpools and then thrashing the machine for testing. If drives drop off there too then we can rule out an issue with FreeBSD.

Unixnut · May 6, 2024

Ok, so since the last update, I have tried both a different motherboard (with no change in results) as well as installing Linux on the machine, however I found out that Linux was of limited use. The ZFS versions were different so I could not import the pools. In the end all I did was some random IO testing and CPU calculations. I saw similar errors in dmesg but no drive drop offs the bus like in FreeBSD.

However I have to say it was not a proper like-for-like comparison, as I could not load the ZFS arrays or execute the same jobs as on the FreeBSD machine.

So I moved to the next thing, which was another complete disassembly and reassembly of the machine. I cleaned all the contact pins, fitted new fans throughout, checked everything, re-wired the power feeds to the drives, and reinstalled FreeBSD and rebuilt the "storagefast" array as a raid-z3 for extra redundancy.

With a clean install, now the drives no longer randomly fall out of the zpools. Instead the entire pool hangs, while I get the "error retrying command" and similar errors filling up my message logs. This happens every minute or so, greatly slowing down the array speed. Unlike the errors before which occurred at high load, this happens all the time now regardless of load.

As it was always the same drive that was showing the error, I decided to manually offline this drive to see if the problem went away. The problem did go away but then re-appeared on another drive that had no errors before. I then offlined that drive, only for the errors to move to a third drive that also showed no errors before. At this point I can't offline any more drives, ZFS won't let me as there are not enough drives left for the pool to keep functioning.

I am scratching my head here, because I have never seen what appears to be a HW fault jump around as if it was a software fault. I have tried different SATA channels and cables, but it seems the errors just move around all the time.

The errors are the same as before, the below on eternal repeat:

Code:

mps0: Controller reported scsi ioc terminated tgt 19 SMID 1711 loginfo 31080000
(da1:mps0:0:19:0): READ(10). CDB: 28 00 12 66 fd c8 00 00 38 00
(da1:mps0:0:19:0): CAM status: CCB request completed with an error
(da1:mps0:0:19:0): Retrying command, 3 more tries remain
(da1:mps0:0:19:0): READ(10). CDB: 28 00 12 66 fe 80 00 00 50 00
(da1:mps0:0:19:0): CAM status: CCB request completed with an error
(da1:mps0:0:19:0): Retrying command, 3 more tries remain
(da1:mps0:0:19:0): READ(10). CDB: 28 00 07 59 1a f8 00 00 a0 00
(da1:mps0:0:19:0): CAM status: SCSI Status Error
(da1:mps0:0:19:0): SCSI status: Check Condition
(da1:mps0:0:19:0): SCSI sense: ABORTED COMMAND asc:0,0 (No additional sense information)
(da1:mps0:0:19:0): Retrying command (per sense data)
mps0: Controller reported scsi ioc terminated tgt 19 SMID 1271 loginfo 31080000
(da1:mps0:0:19:0): READ(10). CDB: 28 00 13 da c7 20 00 00 b0 00
(da1:mps0:0:19:0): CAM status: CCB request completed with an error
(da1:mps0:0:19:0): Retrying command, 3 more tries remain
(da1:mps0:0:19:0): READ(10). CDB: 28 00 13 da c7 b0 00 00 08 00
(da1:mps0:0:19:0): CAM status: CCB request completed with an error
(da1:mps0:0:19:0): Retrying command, 3 more tries remain
(da1:mps0:0:19:0): READ(10). CDB: 28 00 13 da c6 20 00 00 f8 00
(da1:mps0:0:19:0): CAM status: SCSI Status Error
(da1:mps0:0:19:0): SCSI status: Check Condition
(da1:mps0:0:19:0): SCSI sense: ABORTED COMMAND asc:0,0 (No additional sense information)
(da1:mps0:0:19:0): Retrying command (per sense data)
(da1:mps0:0:19:0): READ(10). CDB: 28 00 17 5f 65 90 00 00 08 00
(da1:mps0:0:19:0): CAM status: SCSI Status Error
(da1:mps0:0:19:0): SCSI status: Check Condition
(da1:mps0:0:19:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
(da1:mps0:0:19:0): Retrying command (per sense data)

cmoerz · May 10, 2024

I've had similar issues for a while too. I replaced the controller as well as the cables.

I also paid a bit more attention towards the cable path because I've got multiple fans on the side of the housing which may have also added interference.

Sometimes it's a combination of multiple things, unfortunately.

Unixnut · May 11, 2024

cmoerz said:
I've had similar issues for a while too. I replaced the controller as well as the cables.

I also paid a bit more attention towards the cable path because I've got multiple fans on the side of the housing which may have also added interference.

Sometimes it's a combination of multiple things, unfortunately.

Yes, I've gone down the route of changing the controller and cables, as well as the drives, PSU, motherboard and removed the backplane.

Interesting thing you mentioned about the fans. I noticed when I tested with the other motherboard that the drops had reduced, only to increase after I reassembled the case. I suspected that perhaps the fans (powerful ball bearing ones) were causing some RFI.

It was tricky to prove however because to push the machine to its limits with no fans meant it shut down due to overheating before any drives would drop off.

However in the latest rebuild, I replaced all the ball bearing fans with normal PC fans which are smaller and quieter. I had to cut large holes into the case and reduce the number of slots (From 8 to 4 3.5" slots) to fit bigger fans in order to keep enough airflow going through the case. Temperatures are actually 5-10°C lower across the board now that I've de-restricted the internal airflow.

Unfortunately it has not resolved the problem, but the drop outs have reduced to about 1-2 per day. I have a script that auto re-adds drives if possible so as to extend the period before I get pools suspended, however I have yet to find a root cause.

As you say, it may well be a combination of multiple things, but as I've just noticed we're coming up to the 1-year anniversary of me opening this thread (while the problem itself is older than 1 year now), and quite frankly I'm running out of ideas on what else to try.

Unixnut · May 20, 2024

Unfortunately it did not last. Things got progressively worse until I could not go a full day without 4 drives dropping off the SSD array and the pool suspending.

So I decided to scrap the 8 drive SSD array. I replaced it with 4x1TB WD Red HDD's in raidz1. That only lasted 2 days until I lost the array to the same issue. So at least I know it is not related to the SSDs themselves.

Having given it more thought, I realised that while I have bought a new HBA to replace the old one as part of debugging, they are both from the same manufacturer (LSI). Perhaps there is something about the LSI controllers that doesn't like interfacing with this motherboard/UEFI BIOS?

So to test I've bought a generic 6-port SATA card. Coupled with the 6-ports on the motherboard I have 12 in total, enough to cover the new reduced size arrays (8 drives total). I will give that a try and see if things improve.

ralphbsz · May 20, 2024

Unixnut said:
Perhaps there is something about the LSI controllers that doesn't like interfacing with this motherboard/UEFI BIOS?

That sounds very unlikely. Once the OS is running, the UEFI and BIOS are out of the way, and you only have the following moving parts: The CPU (Intel or AMD), the PCI bus (not very much can go wrong here), the OS and its drivers (we know FreeBSD and trust it), and the LSI card itself.

BUT ...

So to test I've bought a generic 6-port SATA card. Coupled with the 6-ports on the motherboard I have 12 in total, enough to cover the new reduced size arrays (8 drives total). I will give that a try and see if things improve.

Given that we don't have a plausible explanation for your problems, changing some arbitrary part and seeing whether it helps is the best idea for now.