ZFS Random drive detachments from host

Unixnut perhaps your storage controller overheats?
You can check its temperature with mpsutil show adapter

Thanks! That is a very useful tool. Unfortunately it reports "Temperature: Unknown/Unsupported" for me.

I had originally thought that the controller might be overheating, so I added a fan for active cooling to its heatsink but it made no difference.

It does seem unlikely though that it would take 2 weeks for the controller to overheat to the point where it malfunctions, but the lack of actual temperature monitoring does make this a valid area of further investigation. I will see if I can attach an external temperature probe to the SAS cards heatsink and see what it says.
 
Well, a quick update. I had a run finally complete with the reduced number of CPUs in use. I now know it takes approx a month to finish the calculations.

I then re-tried the calculations, but using all 32 CPUs and setting low priority with idprio.
However this did not help. Within 24 hours three drives had dropped off and the ZFS pool was suspended, so had to do a reboot.

It seems to me setting priority does not help the situation.

I now await the temperature probe I bought. Once recieved I will check the temperature of the HBA when under load and update here.
 
Unixnut perhaps your storage controller overheats?
You can check its temperature with mpsutil show adapter
So, the probe arrived today. I hooked it up to the heatsink of the HBA and ran the load on the system up to 48 in order to trigger failures as fast as possible.
First drive (a hard disk) dropped off after 14 mins with the HBA heatsink temperature at 45.7°C.
Second drive (a SSD) dropped off after 17 mins with the HBA heatsink temperatureat 47.3°C
Third drive (a hard disk) dropped off after 23 mins with the HBA heatsink temperature at 48.1°C
Fourth drive (a SSD) dropped off after 26 mins with the HBA heatsink temperature at 49.5°C

I kept the system running at full tilt for two hours, until the point when one more drive failure would result in loss of a zpool. Maximum temperature on the HBA was 50.5°C (hovering between 50°C and 47°C). Max CPU temp was 74°C.

Unfortunately the system temp is not available on FreeBSD (unless someone knows a command?) but the SSDs were between 39°C and 44°C.

I hope this helps. The HBA is a bit warm but nothing excessive in my opinion. FYI this issue has been seen on two HBAs and on the internal SATA controller so far.
 
So to test the overheating hypothesis a bit more, I took the cover off the server, temps dropped around 10°C on the HBA and CPU.

I then ran a backup job and a bunch of other tasks to provide IO and high load on the machine. Within 48 hours two SSDs dropped off.

I then put the cover back on, temps increased but there was no increase in dropped drives when idle. When I simulated load again another drive (HDD) dropped off.

From my observations, whatever the issue is it responds to an increase in system load more than it does to an increase in system temperature.
 
I've seen this exact problem on systems that were marginal in power. Some of that batch had it some did not.

I started on some code to not actually detach the drives for a second or two on the off chance they'd come back quickly. But the bad batch was retired before i was done.

At least i fixed the panics that would go along with these events
 
Nope, that was replaced a while ago.

At this point the only things I've not replaced are the Motherboard/CPU and the OS itself.
Our bad batch tested good, but was just marginal enough... but the bad batch had lots of confounding issues: marginal psu, marginal cables and bad thermals which all acted together to cause very short brown outs that the drives hated...
 
I've seen this exact problem on systems that were marginal in power. Some of that batch had it some did not.

I started on some code to not actually detach the drives for a second or two on the off chance they'd come back quickly. But the bad batch was retired before i was done.

At least i fixed the panics that would go along with these events
Our bad batch tested good, but was just marginal enough... but the bad batch had lots of confounding issues: marginal psu, marginal cables and bad thermals which all acted together to cause very short brown outs that the drives hated...
Yes, power issues were suspected earlier on, however:

Interestingly I have had no panic's at all. Apart from the drives dropping off FreeBSD has been rock solid, computing away for weeks at a time at high load without issue. The only problem is that when enough drives drop off the zpool gets suspended and I need to do a reboot, but the OS itself does not show any problems beyond the printed errors.

Neither do I get any CPU errors or segfaults, which I would expect to happen if there was a power issue. Normally at the margins of power draw I would expect all kinds of odd errors/panics, as different parts of the system get affected at different times by the brownouts. Earlier I did voltage tests and saw no drop in voltages across the drives 12V and 5V lines that I would expect to see in a brownout situation.

The failure modes are interesting as well. Sometimes the drives just drop off and I can re-attach them. Other times an entire channel seems to get into a confused state and nothing except a hard turn-off/wait/turn-on cycle will reset them.

This happened during the last backup, where it trashed the backup drive while in the caddy. I took advantage of this and tried all kinds of other drives in the caddy, each of which would result in I/O errors on FreeBSD if I tried to access them. Even a soft reboot did not fix that, hence the hard reset required to fix.

At this point I only have two things I've not changed as mentioned. I am currently taking advantage of the working pools to do a full backup, after which I may try installing Linux on this machine, importing the zpools and then thrashing the machine for testing. If drives drop off there too then we can rule out an issue with FreeBSD.
 
Back
Top