Solved Random ZFS drive detachments from host

Andriy · Feb 23, 2024

Unixnut perhaps your storage controller overheats?
You can check its temperature with mpsutil show adapter

Unixnut · Feb 23, 2024

Andriy said:
Unixnut perhaps your storage controller overheats?
You can check its temperature with mpsutil show adapter

Thanks! That is a very useful tool. Unfortunately it reports "Temperature: Unknown/Unsupported" for me.

I had originally thought that the controller might be overheating, so I added a fan for active cooling to its heatsink but it made no difference.

It does seem unlikely though that it would take 2 weeks for the controller to overheat to the point where it malfunctions, but the lack of actual temperature monitoring does make this a valid area of further investigation. I will see if I can attach an external temperature probe to the SAS cards heatsink and see what it says.

Unixnut · Mar 5, 2024

Well, a quick update. I had a run finally complete with the reduced number of CPUs in use. I now know it takes approx a month to finish the calculations.

I then re-tried the calculations, but using all 32 CPUs and setting low priority with idprio.
However this did not help. Within 24 hours three drives had dropped off and the ZFS pool was suspended, so had to do a reboot.

It seems to me setting priority does not help the situation.

I now await the temperature probe I bought. Once recieved I will check the temperature of the HBA when under load and update here.

Unixnut · Mar 5, 2024

Andriy said:
Unixnut perhaps your storage controller overheats?
You can check its temperature with mpsutil show adapter

So, the probe arrived today. I hooked it up to the heatsink of the HBA and ran the load on the system up to 48 in order to trigger failures as fast as possible.
First drive (a hard disk) dropped off after 14 mins with the HBA heatsink temperature at 45.7°C.
Second drive (a SSD) dropped off after 17 mins with the HBA heatsink temperatureat 47.3°C
Third drive (a hard disk) dropped off after 23 mins with the HBA heatsink temperature at 48.1°C
Fourth drive (a SSD) dropped off after 26 mins with the HBA heatsink temperature at 49.5°C

I kept the system running at full tilt for two hours, until the point when one more drive failure would result in loss of a zpool. Maximum temperature on the HBA was 50.5°C (hovering between 50°C and 47°C). Max CPU temp was 74°C.

Unfortunately the system temp is not available on FreeBSD (unless someone knows a command?) but the SSDs were between 39°C and 44°C.

I hope this helps. The HBA is a bit warm but nothing excessive in my opinion. FYI this issue has been seen on two HBAs and on the internal SATA controller so far.

Unixnut · Mar 13, 2024

So to test the overheating hypothesis a bit more, I took the cover off the server, temps dropped around 10°C on the HBA and CPU.

I then ran a backup job and a bunch of other tasks to provide IO and high load on the machine. Within 48 hours two SSDs dropped off.

I then put the cover back on, temps increased but there was no increase in dropped drives when idle. When I simulated load again another drive (HDD) dropped off.

From my observations, whatever the issue is it responds to an increase in system load more than it does to an increase in system temperature.

sko · Mar 13, 2024

are you still using that desktop PSU?

Unixnut · Mar 13, 2024

sko said:
are you still using that desktop PSU?

Nope, that was replaced a while ago.

At this point the only things I've not replaced are the Motherboard/CPU and the OS itself.

bsdimp · Mar 15, 2024

I've seen this exact problem on systems that were marginal in power. Some of that batch had it some did not.

I started on some code to not actually detach the drives for a second or two on the off chance they'd come back quickly. But the bad batch was retired before i was done.

At least i fixed the panics that would go along with these events

bsdimp · Mar 15, 2024

Unixnut said:
Nope, that was replaced a while ago.

At this point the only things I've not replaced are the Motherboard/CPU and the OS itself.

Our bad batch tested good, but was just marginal enough... but the bad batch had lots of confounding issues: marginal psu, marginal cables and bad thermals which all acted together to cause very short brown outs that the drives hated...

Unixnut · Mar 15, 2024

bsdimp said:
I've seen this exact problem on systems that were marginal in power. Some of that batch had it some did not.

I started on some code to not actually detach the drives for a second or two on the off chance they'd come back quickly. But the bad batch was retired before i was done.

At least i fixed the panics that would go along with these events

Our bad batch tested good, but was just marginal enough... but the bad batch had lots of confounding issues: marginal psu, marginal cables and bad thermals which all acted together to cause very short brown outs that the drives hated...
Yes, power issues were suspected earlier on, however:

Interestingly I have had no panic's at all. Apart from the drives dropping off FreeBSD has been rock solid, computing away for weeks at a time at high load without issue. The only problem is that when enough drives drop off the zpool gets suspended and I need to do a reboot, but the OS itself does not show any problems beyond the printed errors.

Neither do I get any CPU errors or segfaults, which I would expect to happen if there was a power issue. Normally at the margins of power draw I would expect all kinds of odd errors/panics, as different parts of the system get affected at different times by the brownouts. Earlier I did voltage tests and saw no drop in voltages across the drives 12V and 5V lines that I would expect to see in a brownout situation.

The failure modes are interesting as well. Sometimes the drives just drop off and I can re-attach them. Other times an entire channel seems to get into a confused state and nothing except a hard turn-off/wait/turn-on cycle will reset them.

This happened during the last backup, where it trashed the backup drive while in the caddy. I took advantage of this and tried all kinds of other drives in the caddy, each of which would result in I/O errors on FreeBSD if I tried to access them. Even a soft reboot did not fix that, hence the hard reset required to fix.

At this point I only have two things I've not changed as mentioned. I am currently taking advantage of the working pools to do a full backup, after which I may try installing Linux on this machine, importing the zpools and then thrashing the machine for testing. If drives drop off there too then we can rule out an issue with FreeBSD.

Unixnut · May 6, 2024

Ok, so since the last update, I have tried both a different motherboard (with no change in results) as well as installing Linux on the machine, however I found out that Linux was of limited use. The ZFS versions were different so I could not import the pools. In the end all I did was some random IO testing and CPU calculations. I saw similar errors in dmesg but no drive drop offs the bus like in FreeBSD.

However I have to say it was not a proper like-for-like comparison, as I could not load the ZFS arrays or execute the same jobs as on the FreeBSD machine.

So I moved to the next thing, which was another complete disassembly and reassembly of the machine. I cleaned all the contact pins, fitted new fans throughout, checked everything, re-wired the power feeds to the drives, and reinstalled FreeBSD and rebuilt the "storagefast" array as a raid-z3 for extra redundancy.

With a clean install, now the drives no longer randomly fall out of the zpools. Instead the entire pool hangs, while I get the "error retrying command" and similar errors filling up my message logs. This happens every minute or so, greatly slowing down the array speed. Unlike the errors before which occurred at high load, this happens all the time now regardless of load.

As it was always the same drive that was showing the error, I decided to manually offline this drive to see if the problem went away. The problem did go away but then re-appeared on another drive that had no errors before. I then offlined that drive, only for the errors to move to a third drive that also showed no errors before. At this point I can't offline any more drives, ZFS won't let me as there are not enough drives left for the pool to keep functioning.

I am scratching my head here, because I have never seen what appears to be a HW fault jump around as if it was a software fault. I have tried different SATA channels and cables, but it seems the errors just move around all the time.

The errors are the same as before, the below on eternal repeat:

Code:

mps0: Controller reported scsi ioc terminated tgt 19 SMID 1711 loginfo 31080000
(da1:mps0:0:19:0): READ(10). CDB: 28 00 12 66 fd c8 00 00 38 00
(da1:mps0:0:19:0): CAM status: CCB request completed with an error
(da1:mps0:0:19:0): Retrying command, 3 more tries remain
(da1:mps0:0:19:0): READ(10). CDB: 28 00 12 66 fe 80 00 00 50 00
(da1:mps0:0:19:0): CAM status: CCB request completed with an error
(da1:mps0:0:19:0): Retrying command, 3 more tries remain
(da1:mps0:0:19:0): READ(10). CDB: 28 00 07 59 1a f8 00 00 a0 00
(da1:mps0:0:19:0): CAM status: SCSI Status Error
(da1:mps0:0:19:0): SCSI status: Check Condition
(da1:mps0:0:19:0): SCSI sense: ABORTED COMMAND asc:0,0 (No additional sense information)
(da1:mps0:0:19:0): Retrying command (per sense data)
mps0: Controller reported scsi ioc terminated tgt 19 SMID 1271 loginfo 31080000
(da1:mps0:0:19:0): READ(10). CDB: 28 00 13 da c7 20 00 00 b0 00
(da1:mps0:0:19:0): CAM status: CCB request completed with an error
(da1:mps0:0:19:0): Retrying command, 3 more tries remain
(da1:mps0:0:19:0): READ(10). CDB: 28 00 13 da c7 b0 00 00 08 00
(da1:mps0:0:19:0): CAM status: CCB request completed with an error
(da1:mps0:0:19:0): Retrying command, 3 more tries remain
(da1:mps0:0:19:0): READ(10). CDB: 28 00 13 da c6 20 00 00 f8 00
(da1:mps0:0:19:0): CAM status: SCSI Status Error
(da1:mps0:0:19:0): SCSI status: Check Condition
(da1:mps0:0:19:0): SCSI sense: ABORTED COMMAND asc:0,0 (No additional sense information)
(da1:mps0:0:19:0): Retrying command (per sense data)
(da1:mps0:0:19:0): READ(10). CDB: 28 00 17 5f 65 90 00 00 08 00
(da1:mps0:0:19:0): CAM status: SCSI Status Error
(da1:mps0:0:19:0): SCSI status: Check Condition
(da1:mps0:0:19:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
(da1:mps0:0:19:0): Retrying command (per sense data)

cmoerz · May 10, 2024

I've had similar issues for a while too. I replaced the controller as well as the cables.

I also paid a bit more attention towards the cable path because I've got multiple fans on the side of the housing which may have also added interference.

Sometimes it's a combination of multiple things, unfortunately.

Unixnut · May 11, 2024

cmoerz said:
I've had similar issues for a while too. I replaced the controller as well as the cables.

I also paid a bit more attention towards the cable path because I've got multiple fans on the side of the housing which may have also added interference.

Sometimes it's a combination of multiple things, unfortunately.

Yes, I've gone down the route of changing the controller and cables, as well as the drives, PSU, motherboard and removed the backplane.

Interesting thing you mentioned about the fans. I noticed when I tested with the other motherboard that the drops had reduced, only to increase after I reassembled the case. I suspected that perhaps the fans (powerful ball bearing ones) were causing some RFI.

It was tricky to prove however because to push the machine to its limits with no fans meant it shut down due to overheating before any drives would drop off.

However in the latest rebuild, I replaced all the ball bearing fans with normal PC fans which are smaller and quieter. I had to cut large holes into the case and reduce the number of slots (From 8 to 4 3.5" slots) to fit bigger fans in order to keep enough airflow going through the case. Temperatures are actually 5-10°C lower across the board now that I've de-restricted the internal airflow.

Unfortunately it has not resolved the problem, but the drop outs have reduced to about 1-2 per day. I have a script that auto re-adds drives if possible so as to extend the period before I get pools suspended, however I have yet to find a root cause.

As you say, it may well be a combination of multiple things, but as I've just noticed we're coming up to the 1-year anniversary of me opening this thread (while the problem itself is older than 1 year now), and quite frankly I'm running out of ideas on what else to try.

Unixnut · May 20, 2024

Unfortunately it did not last. Things got progressively worse until I could not go a full day without 4 drives dropping off the SSD array and the pool suspending.

So I decided to scrap the 8 drive SSD array. I replaced it with 4x1TB WD Red HDD's in raidz1. That only lasted 2 days until I lost the array to the same issue. So at least I know it is not related to the SSDs themselves.

Having given it more thought, I realised that while I have bought a new HBA to replace the old one as part of debugging, they are both from the same manufacturer (LSI). Perhaps there is something about the LSI controllers that doesn't like interfacing with this motherboard/UEFI BIOS?

So to test I've bought a generic 6-port SATA card. Coupled with the 6-ports on the motherboard I have 12 in total, enough to cover the new reduced size arrays (8 drives total). I will give that a try and see if things improve.

ralphbsz · May 20, 2024

Unixnut said:
Perhaps there is something about the LSI controllers that doesn't like interfacing with this motherboard/UEFI BIOS?

That sounds very unlikely. Once the OS is running, the UEFI and BIOS are out of the way, and you only have the following moving parts: The CPU (Intel or AMD), the PCI bus (not very much can go wrong here), the OS and its drivers (we know FreeBSD and trust it), and the LSI card itself.

BUT ...

So to test I've bought a generic 6-port SATA card. Coupled with the 6-ports on the motherboard I have 12 in total, enough to cover the new reduced size arrays (8 drives total). I will give that a try and see if things improve.

Given that we don't have a plausible explanation for your problems, changing some arbitrary part and seeing whether it helps is the best idea for now.

Unixnut · Sep 10, 2024

Well, after my last post a couple of weeks later the 6-port SATA card arrived and I set about replacing the HBA with it.

To my surprise the machine booted up and all the problems went away. The reason I didn't reply since May is because I didn't believe it. So I wanted to be absolutely sure that the problem had gone away for good.

As such I hammered the machine non stop over the summer, where outside temps here exceed 40°C. I only had the machine die on two occasions, both of them when the CPU exceeded 105°C, which I think is an internal over-temp cutoff.

After 4 months of stress testing the machine remained rock solid. I've not had a single error message, and I've not had a single drive drop off any of the arrays. The performance of the drives is also not distinguishable from the LSI HBA's, as I can get a sustained 1GB/s from the SSD array and around 600MB/s from the HDD array (random IOPS performance not tested though).

As such I am happy to mark this saga resolved.

I don't quite understand why the LSI HBAs had such problems. I picked them because they were recommended online for JBOD setups, well supported on FreeBSD and ZFS.

I bought two of them, one brand new and one used from different suppliers, so it is unlikely I bought a faulty unit (unless I am so unlucky to get two units with the exact same fault in a row)

Yet in the end turns out they were the problem, despite being multiples more expensive than the cheap SATA card I replaced them with (which cost me less than €30 from AliExpress, including shipping).

The irony is that after years of building cheap home servers from generic parts, I finally decided to splurge on a more professional machine, rack mounted with expensive HBAs and backplanes, only in the end to replace all that with generic parts again. I guess a lesson learned for me, I will stick with generic parts in future.

nerozero · Sep 10, 2024

Let me add one thing I had recently done on a very machine which just sitting aside because of this random disconnects.
For testing reason I booted from my live USB flash with BSD14.1 and upgraded zfs to latest one. Weird but I had tested this machine for 5 days with active disk load - 0 drive disconnects.

nerozero · Sep 10, 2024

Unixnut said:
Well, after my last post a couple of weeks later the 6-port SATA card arrived and I set about replacing the HBA with it.

Could you please share the model of that card ?

Unixnut · Sep 10, 2024

nerozero said:
Could you please share the model of that card ?

It has no model, or even a make printed on it. It is a generic card I bought off Aliexpress, the cheapest one I could find. It looks like this one: https://www.aliexpress.com/item/1005003599256025.html

PMc · Sep 10, 2024

Unixnut said:
It has no model, or even a make printed on it. It is a generic card I bought off Aliexpress, the cheapest one I could find. It looks like this one: https://www.aliexpress.com/item/1005003599256025.html

Hm, I see. And what does FreeBSD think it is? (/var/run/dmesg.boot)

Unixnut · Sep 10, 2024

PMc said:
Hm, I see. And what does FreeBSD think it is? (/var/run/dmesg.boot)

My current controllers:

Code:

ahci0: <AMD KERNCZ AHCI SATA controller> mem 0xfb500000-0xfb5007ff irq 29 at device 0.0 on pci7
ahci1: <AMD KERNCZ AHCI SATA controller> mem 0xfb400000-0xfb4007ff irq 30 at device 0.0 on pci8
ahci2: <ASMedia ASM116x AHCI SATA controller> mem 0xfc082000-0xfc083fff,0xfc080000-0xfc081fff irq 54 at device 0.0 on pci9

Given the AMD motherboard, I'm going to say it is the ASMedia ASM116x in that list.

Charlie_ · Sep 10, 2024

Is this.

SA3026 6-port PCIe X4 SATA Expansion Card, Including SATA Cables and 1:5 SATA Splitter Power Cable

<ul class="a-unordered-list a-vertical a-spacing-mini"> <li class="a-spacing-mini"><span class="a-list-item">6 Port PCIe X4 SATA Adapter Card: PCI-Express X1 extends 6 x SATA III 6Gbps ports, so PC can access 6 x SATA drivers at the same time. You can setup a storage pool with 6 x SATA disks, or...

www.glotrends-store.com

The company appears to be located in Hong Kong.

PMc · Sep 10, 2024

Charlie_ said:
The company appears to be located in Hong Kong.

It's one of the ASMedia chips. I was looking into these, because when I did run out of SATA ports on the mainboard, I would have needed just some simple PCIe-to-SATA controller, not the elaborate LSI SAS with it's own cpu onboard. But nobody could really tell if these pieces would work decently for unix, and everybody says that LSI SAS is the only safe bet.
Some of the ASMedia do use port-multipliers, and there are reports that they do NOT work well with simultaneous access, but this one seems to be a native 6-way, even with lightshow

Unixnut · Sep 10, 2024

PMc said:
It's one of the ASMedia chips. I was looking into these, because when I did run out of SATA ports on the mainboard, I would have needed just some simple PCIe-to-SATA controller, not the elaborate LSI SAS with it's own cpu onboard. But nobody could really tell if these pieces would work decently for unix, and everybody says that LSI SAS is the only safe bet.
Some of the ASMedia do use port-multipliers, and there are reports that they do NOT work well with simultaneous access, but this one seems to be a native 6-way, even with lightshow

The irony, considering how much headache the LSI boards gave me in the end. Not sure the onboard CPU is that beneficial if you are just using it as a HBA, it isn't like it has much calculation to do. All of the heavy lifting is done on the main CPU, which nowadays is very powerful.

The last generic SATA card I used was a Silicon Image PCI, which worked well for years but having upgraded to the new machine with far more drives and PCI-E, I decided to go with LSI for the reasons you mentioned. Turns out in this case it was a mistake. I only picked this ASMedia card because it was the cheapest and I wanted something to test with, I didn't expect it to work as well as it has.

The lightshow is nice, but unlike the LSI it has no header pins so I could not connect up the front channel LEDs like I had before. Perhaps in future I will solder some connectors on, but right now I'm just glad the damn thing is working and I don't really want to touch it.

diizzy · Sep 10, 2024

There have been multiple reports of the ASM1166 controllers running just fine in general, I don't have any issues with mine either.