Solved Random ZFS drive detachments from host

I like the UPS suggestion. I had an UPS once that had a bad contact, and crashed computers instead of maintaining them.

If you're here in California, my advice would be to buy a new UPS. You'll need it anyway. We call our power company "Pacific Gas and sometimes Electric", or "Parttime Gas and Electric", or "Pacific Gas and Extortion". The most insulting one (not politically correct) is "Pakistani Gas and Electric".

If you are in a civilized country (like most parts of Europe), just try running with the UPS.
 
Well, I lost a third SSD and the storagefast pool failed. I took out the SSDs and and tested them like I did the HDD's, they seem to have no issues.

So I guess power related, although I don't understand why the SSDs (which ran faultlessly throughout) would suddenly develop problems when I moved the HDDs over to their own ATX PSU. If anything the reduction on load on the original PSU would have made things better. It is as if the problem just jumped to the next set of available drives once the originals were moved away.

I didn't consider the UPS a potential cause. While the 10TB were working fine on the second ATX PSU, that was not plugged into the UPS, but directly to the mains. I've now bypassed the UPS, and put all the drives on the original PSU to see if the problems go away.

I have a second 750W PSU that is new, however it is a no-name brand vs the Thermaltake I have currently installed. I am not sure if its 12/5v current output is enough for all I am running, however if the UPS bypass does not resolve the situation I will go with the PSU replacement.
 
While I wait and see if the system works, I made note of the PSU and my power draw.

The PSU is The Termaltake SmartSE 730 (https://uk.thermaltake.com/smart-se-730w.html) 730W sustained 830W peak power draw. Single rail 12V line with 57A, and single rail 5v with 20A.

My 3.5" drives draw 2.8A at 5V and 3.96A at 12V. The 2.5" SSDs don't have current listed, so I took the current draw from a 2.5" HDD at 0.55A, giving 4.4A.

In total 7.2A at 5V (36W) and 3.96A at 12V (48W).

Motherboard is an AsusTEK PRIME X570-PRO with an AMD Ryzen 9 5950X. Looking online it seems at full load this setup would draw 88W.

A total of 84W used for the drives and 88W for the CPU/Motherboard, leaving 558W to run the rest of the system (fans, ancillaries, etc...), so I don't think I am reaching the limits of the PSUs ability to supply power.

What I do notice is that the PSU is very weighted to supplying power to the 12V line, being a gaming PSU I guess the logic is to power current hungry GPUs. I only have 20A on the 5V line , and 7.2A of that is taken by the drives, leaving 12.8A (64W) for the motherboard, fans, etc...
The UPS tells me that at full power the system draws between 280 and 300W, meaning 128W is used for ancillaries. If we assumed all 128W was on the 5V line, then we could be running out of power on the 5V line here. Even if only half of the ancillary power was on the 5V line it could cause issues.

The generic PSU I have actually is more even, having 38A on the 5V line and 37A of the 12v line, assuming that the numbers listed are true. The wiring on it is so thin compared to the the Thermaltake that I would be surprised if they can carry more than 15A.
 
And while I was writing the above post, the system failed. the storage pool lost a drive, and 2 others gave checksum errors so it got suspended. The SSD storagefast pool resilvered just fine and now has no errors.

So the problem has migrated back to the HDD's, and bypassing the UPS did not help.

I guess my next step is to replace the PSU, which I will now do. It still doesn't make sense though, if my hypothesis above is correct and we are running out of power on the 5V line, then removing the HDDs would have made the situation better for the remaining SSDs, but the problem just migrated across instead.

Still, it is the only piece apart from the motherboard I've not yet replaced, and I have a spare, so will give it a go and see if the problem goes away.
 
Well, new PSU is in, and machine is booted up. Will not resilver/scrub all the pools and see if things have improved.
 
No luck, still getting drive dropouts and faults. I'm going to try bypassing the UPS again with the new PSU and see if things improve.
 
Well... I stopped getting dropouts on the "storage" array so things were looking good.
However the test came to an abrupt end when the PSU exploded. Proper loud bang and smoke filled the room. I guess the ratings printed on the PSU did not match its abilities.

So I've put the Thermaltake back in and thankfully looks like the other components survived unscathed.

This brings me to the question of replacement PSUs. I've looked at the other Thermaltake PSU's, they all provide 20A on the 5V line, from their 550W to their 1200W PSUs the variance in power is all on the 12V line, so buying a more powerful version is of no benefit to me.

Likewise I see the same with Corsair PSU's, seems they all top out at 20A of the 5V line ( I guess gaming PC's are always in need of more power on the 12V line).

If my issue is due to running out of power on the 5V line, what PSUs should I look for? Any specific makes or models that cater to ATX-based servers?
 
The 2.5" SSDs don't have current listed, so I took the current draw from a 2.5" HDD at 0.55A, giving 4.4A.
Enterprise SSDs consume a lot of power.
This is my 1DWPD 2.5" SATA SSD THNSN8480PCSE.
Code:
Supply Voltage  5.0 V ±5 %
Power Consumption (Operating)   4.5W Typ.
Power Consumption (Ready)       1.2 W Typ.
If my issue is due to running out of power on the 5V line, what PSUs should I look for? Any specific makes or models that cater to ATX-based servers?
I did a search and this is what I found.
 
Well... I stopped getting dropouts on the "storage" array so things were looking good.
However the test came to an abrupt end when the PSU exploded. Proper loud bang and smoke filled the room. I guess the ratings printed on the PSU did not match its abilities.
You may want to have a look at this thread.

This brings me to the question of replacement PSUs. I've looked at the other Thermaltake PSU's, they all provide 20A on the 5V line, from their 550W to their 1200W PSUs the variance in power is all on the 12V line, so buying a more powerful version is of no benefit to me.
I don't think You need more Watts. I recently measured my machine (ASUS Xeon/EP, 18 disks, 13 of them spinning), it did run for years stable on a pimpwired(*) 350W supply (until the supply degraded due to running at it's limits). But measuring, I found the main consumption comes from six old SCSI-10k disks. (This is a museum, I'm keeping all the old stuff to see how long it will work.) I don't think a modern server HDD would eat more than these.

Enterprise SSDs consume a lot of power.
This is my 1DWPD 2.5" SATA SSD THNSN8480PCSE.
Code:
Supply Voltage  5.0 V ±5 %
Power Consumption (Operating)   4.5W Typ.
Power Consumption (Ready)       1.2 W Typ.
That is something I do not yet have in my zoo. These may indeed create ugly ripple-loads on the 5V rail.

I did a search and this is what I found.
That looks not bad. No modular nonsense (that just adds resistance). And apparently thick wires.

(*) pimpwiring is how I solved my hdd-disconnect issue. But you should do that only if you're very confident with your high-voltage skill and high-amp soldering experience (this is the 350W, the 500W got a lot thicker wires - but I'm now out-of-stock with these):
IMG_20230903_041927.jpg
 

Thanks, that was a nice distraction. I actually opened up the blown PSU. The 15A fuse has gone, and looks like the transistors on the high tension side have burnt. I suspect they shorted out, which then caused the fuse to blow. The rest of the PSU looks pristine, which makes sense considering it is new. So I may desolder the transistors and see if I have equivalent (or higher power) ones in my collection, in which case I will try to resurrect it.

I don't think You need more Watts. I recently measured my machine (ASUS Xeon/EP, 18 disks, 13 of them spinning), it did run for years stable on a pimpwired(*) 350W supply (until the supply degraded due to running at it's limits). But measuring, I found the main consumption comes from six old SCSI-10k disks. (This is a museum, I'm keeping all the old stuff to see how long it will work.) I don't think a modern server HDD would eat more than these.

Well it is the best idea I've got at the moment. The fact that the problem went away for the HDD's on a second ATX PSU seemed to indicate this direction. Then the fact the problem moved to the SSDs made me think it was the 5V line, because the SSDs don't use the 12V line.

After which the approximate calculations above hinted that I may be hitting the power limits of the 5V line. After all in addition to the HDD's and SSDs, I have the HBA, network and video cards all drawing power, who could be making use of the 5V line.

(*) pimpwiring is how I solved my hdd-disconnect issue. But you should do that only if you're very confident with your high-voltage skill and high-amp soldering experience (this is the 350W, the 500W got a lot thicker wires - but I'm now out-of-stock with these):View attachment 16935

When I originally built this machine, I bought a modular 650W PSU for it, and then crimped my own connectors using 250V13A rated copper cable, from the 6-pin modular socket on the PSU to the system. Worked a treat for many years until the PSU got damaged by a power surge (which is why I bought the UPS after).
The new 730W PSU was also modular, but annoyingly while they all use the same 6-pin connector, the pinouts differ so I had to rewire it.

I preferred modular PSUs because it was neater (which was better for airflow and cooling), and I could crimp my own cables using thicker wire. Before it usually involved me cutting off PSU connectors and wiring them to blocks, from which my wiring would carry to the system. I never considered modular PSUs would have a significant increase in resistance though (at least any more than blocks would).
 
Enterprise SSDs consume a lot of power.
This is my 1DWPD 2.5" SATA SSD THNSN8480PCSE.
Code:
Supply Voltage  5.0 V ±5 %
Power Consumption (Operating)   4.5W Typ.
Power Consumption (Ready)       1.2 W Typ.

Interesting, as my SSD array faulted again I will see about taking one out and looking at the model number. It is not an enterprise drive, so I would expect its power consumption to be lower, but best to check to be sure.

I did a search and this is what I found.

Thanks for this! I was searching for "high power" or "heavy duty" ATX PSUs, and not really getting anywhere, seems the terms "Industrial ATX" is what I am looking for.

That one does look nice, but there is only one distributor where I live, and they only deal with B2B, they don't want to sell to those who are not registered companies, so can't buy it here. I will see if I can find someone who will ship it to me, or an equivalent model I can get here.

It is nice to see that they actually give you the current output in the specs, amazing how many of the "gamer" PSUs don't have that in the spec sheet, you have to dig around to find the manual and see if it says in there.


EDIT: My SSDs are "kioxia" LTC10Z240G, which is apparently a new Brand that Toshiba sells its SSDs under.

This link says they draw a max of 1.6W. As I have 8 of them that is 12.8W, or about 2.6A at 5V
 
Well, an update. After finally buying and installing a PSU with 35A on the 5V yesterday, it unfortunately has not resolved the issue.

So far the four hard drives are working fine, but the bus-reset and faulted drives issue is still affecting the SSD array. I mean the issue is no longer about the drives randomly dropping off, but actually being marked "faulted".

In some ways this is worse, because before I could re-add dropped off drives and resilver them online. However so far I have found no way to clear and resilver a faulted drive. My only options are to reboot the machine (which rebuilds the array to a clean state), or just run the array with more and more drives being faulted until I get an I/O suspension. EDIT: worked out you can offline then online the drives, followed by "zpool clear" to bring them back into use

Not sure what else I can try at this point, apart from the motherboard I've now replaced every single other component in the system ?
?‍?

The error messages FWIW:

[etc....]
(da1:mps0:0:17:0): Retrying command, 3 more tries remain
(da1:mps0:0:17:0): READ(10). CDB: 28 00 0b d0 a5 e8 00 00 30 00
(da1:mps0:0:17:0): CAM status: Command timeout
(da1:mps0:0:17:0): Retrying command, 3 more tries remain
(da1:mps0:0:17:0): READ(10). CDB: 28 00 0b d0 a4 38 00 00 f0 00
(da1:mps0:0:17:0): CAM status: SCSI Status Error
(da1:mps0:0:17:0): SCSI status: Check Condition
(da1:mps0:0:17:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
(da1:mps0:0:17:0): Retrying command (per sense data)
mps0: Controller reported scsi ioc terminated tgt 17 SMID 1432 loginfo 31080000
mps0: Controller reported scsi ioc terminated tgt 17 SMID 1581 loginfo 31080000
(da1:mps0:0:17:0): WRITE(10). CDB: 2a 00 0f 1a 57 90 00 00 18 00
mps0: Controller reported scsi ioc terminated tgt 17 SMID 1909 loginfo 31080000
(da1:mps0:0:17:0): CAM status: CCB request completed with an error
mps0: Controller reported scsi ioc terminated tgt 17 SMID 122 loginfo 31080000
(da1:mps0:0:17:0): Retrying command, 2 more tries remain
(da1:mps0:0:17:0): WRITE(10). CDB: 2a 00 0f 1a 57 a8 00 00 18 00
mps0: Controller reported scsi ioc terminated tgt 17 SMID 2022 loginfo 31080000
(da1:mps0:0:17:0): CAM status: CCB request completed with an error
mps0: Controller reported scsi ioc terminated tgt 17 SMID 498 loginfo 31080000
(da1:mps0:0:17:0): Retrying command, 2 more tries remain
mps0: Controller reported scsi ioc terminated tgt 17 SMID 948 loginfo 31080000
(da1:mps0:0:17:0): WRITE(10). CDB: 2a 00 0f 1a 57 c0 00 00 18 00
mps0: Controller reported scsi ioc terminated tgt 17 SMID 1295 loginfo 31080000
(da1:mps0:0:17:0): CAM status: CCB request completed with an error
mps0: Controller reported scsi ioc terminated tgt 17 SMID 1319 loginfo 31080000
(da1:mps0:0:17:0): Retrying command, 2 more tries remain
mps0: Controller reported scsi ioc terminated tgt 17 SMID 1836 loginfo 31080000
(da1:mps0:0:17:0): WRITE(10). CDB: 2a 00 0f 1a 57 d8 00 00 18 00
(da1:mps0:0:17:0): CAM status: CCB request completed with an error
(da1:mps0:0:17:0): Retrying command, 2 more tries remain
(da1:mps0:0:17:0): WRITE(10). CDB: 2a 00 0f 30 60 68 00 00 10 00
(da1:mps0:0:17:0): CAM status: CCB request completed with an error
(da1:mps0:0:17:0): Retrying command, 2 more tries remain
(da1:mps0:0:17:0): WRITE(6). CDB: 0a 08 48 00 b0 00
(da1:mps0:0:17:0): CAM status: CCB request completed with an error
(da1:mps0:0:17:0): Retrying command, 2 more tries remain
(da1:mps0:0:17:0): WRITE(10). CDB: 2a 00 0f 1a 58 18 00 00 28 00
(da1:mps0:0:17:0): CAM status: CCB request completed with an error
(da1:mps0:0:17:0): Retrying command, 2 more tries remain
(da1:mps0:0:17:0): WRITE(10). CDB: 2a 00 0f 1a 57 f0 00 00 28 00
(da1:mps0:0:17:0): CAM status: CCB request completed with an error
(da1:mps0:0:17:0): Retrying command, 2 more tries remain
(da1:mps0:0:17:0): WRITE(10). CDB: 2a 00 0f 1a 58 40 00 00 28 00
(da1:mps0:0:17:0): CAM status: CCB request completed with an error
(da1:mps0:0:17:0): Retrying command, 2 more tries remain
(da1:mps0:0:17:0): WRITE(10). CDB: 2a 00 0f 1a 58 68 00 00 28 00
(da1:mps0:0:17:0): CAM status: CCB request completed with an error
(da1:mps0:0:17:0): Retrying command, 2 more tries remain
(da1:mps0:0:17:0): READ(10). CDB: 28 00 0b d0 a4 38 00 00 f0 00
(da1:mps0:0:17:0): CAM status: SCSI Status Error
(da1:mps0:0:17:0): SCSI status: Check Condition
(da1:mps0:0:17:0): SCSI sense: ABORTED COMMAND asc:0,0 (No additional sense information)
(da1:mps0:0:17:0): Retrying command (per sense data)
[etc...]
(da1:mps0:0:17:0): CAM status: SCSI Status Error
(da1:mps0:0:17:0): SCSI status: Check Condition
(da1:mps0:0:17:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
(da1:mps0:0:17:0): Error 6, Retries exhausted
(da1:mps0:0:17:0): Invalidating pack
 
Any updates for this thread?

Because I'm having similar issues almost on all types of machines, this issue never happening on a primary/boot disks, but randomly on all others. When switching "problematic drive" to a primary/boot slot - the drive doesn't experience any issues, but the previously primary drive becoming n-th drive - does. Old hard drives(5..7yo) does not have this issues.
This happening to on SSDs, on dell server raids, ... you name it. I never found a valid answer to that issue. I have strong filling that this is a FreeBSD drive handling issue.
 
Well after the problems persisted I dismantled the server and rebuilt it using a generic 4U rackmount case. So now the server has no backplane, just a straight connection between the HBA and the drives. I also wiped and reinstalled the OS on it with the latest FreeBSD and reconfigured everything.

It worked fine for a couple of weeks, than the problems started happening again. Slowly at first, the odd drive every two weeks or so. Then it started occurring more and more often. Now we are back to drives dropping off every hour on average.

I wrote a script that runs in cron every hour to auto re-add the drives, but even with this eventually the problem gets so bad that multiple drives end up dropping out within an hour and I end up with suspended zpools, requiring a hard system reset.

After a reset, the problem goes away for a few days, but then once again gradually gets worse and worse the longer the system is running, until the inevitable happens.

Interestingly this issue never happened on my boot drives either, in any incarnation of the problem. Only on all the other drives randomly (including the HDD's). I presumed this is because the boot drives don't get much use. After the initial OS load they only get occasional log writes to them.

My script keeps a log of failed drives, and failures are distributed across the drives, so it isn't a particular "bad drive" causing this from what I can see. Likewise taking the drives out and testing them shows the drives are fine.

From the rest of this thread you can see that I've pretty much replaced every component except the Motherboard/CPU and apart from this issue I have had no other problems with the system (and I do drive the CPU's at full load for weeks at a time crunching numbers). If it was CPU/MB issues I would expect other issues to crop up outside of the storage system.

At this point I've spent far more time and money on this problem than I am happy with, and as this is my home lab rather than a production system I've just accepted that I will have to live with it.

The script keeps the system running for a week or so before multiple-drive drops crash things. A hard reset is still manual, but in future I may automate that somehow as well.

Unfortunately I don't have a solution to the problem. I could not be sure if it was OS/Software related because it seemed I was the only one having this issue, so figured it was something about my specific set up. If you have the same symptoms across multiple different HW and configurations then perhaps there is something up with FreeBSD itself.

I still can't pinpoint what exactly. All I can say is that from what I can see, the failure rate is correlated with activity. The more activity on a zpool, the faster drives drop off. Case in point:

- My SSDs drop off the most as all my VM's and jails use the SSD pool for their back-end storage, meaning that has the highest activity.

- My slow storage HDDs are second most active as the VMs/Jails sometimes push/pull data to them, and the HDD pool is also shared via NFS to the network. I see drop outs here, but at a lower rate.

- I have had no failures on my OS root disks, but as mentioned above those have virtually no activity on them.
 
Regarding the root/booting disks, I had system with only two similar drives in mirror, so basically 2 identical partition tables, one of them dedicated to zfs. The issue ALWAYS happened with the second drive, NEVER with the primary drive. I recall I had the ability to set drive priorities from the BIOS, so disks recognized by the system on the order defined in BIOS. As result of switching disks ADA0 and ADA1 become switched, still no issues with ADA0, but a lots of detach events for ADA1 ... This is definitely the FreeBSD bug, just because I have tested almost all other possibilities. The only thing I noticed for some drives this can be fixed by disabling EPC "Extended Power Condition". See my post, but for most of the drives this has no effect, especially for SSDs ...
 
Yes I remember that post, that was one of the things I tried before I created this topic. I did a search to make sure I had tried everything others had tried before creating a new topic.

One question about the machines you have experienced this failure mode on. Are they under consistently high load? Specifically CPU bound load?
 
Are they under consistently high load? Specifically CPU bound load?
No, I got 24 core CPU on Dell server on one machine and 32 core cpus on 3 more similar servers. average monthly load is ~8.2 -12.7%. All of them are bhyve vmhosts...
Also I had couple of standalone mail servers built on a Asus motherboard and 8-th gen intel I5 Cpus, Storage - ZFS mirror, root on ZFS, 2 Hard Drives
All of them had similar issues with some drives ..

~ a week ago I replaced 4TB hard drive on a server built on intel I9 12th gen cpu, right after replacing drive I start experience the random detach of the new one ( ST4000VX016-3CV104 ). Good news the solution in my previous post solved the issue. a week - no issues.
 
Glad to hear your issue is sorted. That is good news!

The fact you mentioned you were having the same issue on differing hardware combinations and thought it may be related to the OS got me thinking.

My last error message above shows that the starting error lines are "CAM status: Command timeout", from where the other errors seem to cascade. Up until now I had assumed this was due to hardware issues. Assuming instead that the timeouts were due to OS issues, I thought about what could be the culprit.

What has come to my mind is the kernel is not responding fast enough to the requests. I would have expected the kernel to schedule important events like handling I/O promptly, but what if it didn't? You would get "command timeouts" just like I see in my messages above, simply because the kernel did not context switch fast enough to handle the request. What may then happen is the HW and OS end up in an inconsistent state, resulting in the OS just dropping the drive.

I run a lot of calculations on this machine for extended periods of time, up until now I used all 32 CPUs for the calculations (on a lower priority than normal), relying on the OS to schedule the other tasks/processes correctly.

To test my hypothesis I have re-started the calculations using only 28 CPUs, leaving 4 idle for the OS. Since I did this last week the error messages have gone as have the drop outs.

It is too early for me to declare victory, but I wanted to see if your machines were similarly loaded and share this idea in case it helps you too, but turns out you solved it another way :)
 
Glad to hear your issue is sorted. That is good news!
Unfortunately this is the case just for that type of hard drive... For a lot of others brands that method had 0 effect.

Here is the example I did today with an old machine standing near to me. This is a basic desktop PC with 2x 4TB Seagate Exos hard drives, Freebsd root on zfs mirror. Both hard drives passes self test successfully. ( 1yo drives )
At the moment - this is just a PC with literally no services running, 0% cpu load, 0% hard drive load, just ssh, htop.

here is the pool status:
sh:
       NAME        STATE     READ WRITE CKSUM
       zroot       DEGRADED     0     0     0
         mirror-0  DEGRADED     0     0     0
           ada0p4  ONLINE       0     0     0
           ada1p4  REMOVED      0     0     0


I can see the ada1 disk is listed in the devices. Then I tell zfs that this drive is present and online: zpool online zroot ada1p4
This is what I got in messages:

Code:
### zpool online zroot ada1p4 ->
Feb  9 10:03:12 old ZFS[49528]: vdev state changed, pool_guid=6104843082126424794 vdev_guid=9671843129628471665
Feb  9 10:03:15 old kernel: ada1 at ahcich1 bus 0 scbus1 target 0 lun 0
Feb  9 10:03:15 old kernel: ada1: <ST4000NM0035-1V4107 TN05> s/n XXXXXXX detached
Feb  9 10:03:17 old kernel: (ada1:ahcich1:0:0:0): Periph destroyed
Feb  9 10:03:17 old ZFS[49532]: vdev state changed, pool_guid=6104843082126424794 vdev_guid=9671843129628471665
Feb  9 10:03:17 old ZFS[49536]: vdev is removed, pool_guid=6104843082126424794 vdev_guid=9671843129628471665
Feb  9 10:03:25 old kernel: ada1 at ahcich1 bus 0 scbus1 target 0 lun 0
Feb  9 10:03:25 old kernel: ada1: <ST4000NM0035-1V4107 TN05> ACS-3 ATA SATA 3.x device
Feb  9 10:03:25 old kernel: ada1: Serial Number XXXXXXX
Feb  9 10:03:25 old kernel: ada1: 300.000MB/s transfers (SATA 2.x, UDMA6, PIO 8192bytes)
Feb  9 10:03:25 old kernel: ada1: Command Queueing enabled
Feb  9 10:03:25 old kernel: ada1: 3815447MB (7814037168 512 byte sectors)


### zpool online zroot ada1p4 ->
Feb  9 10:03:47 old ZFS[49550]: vdev state changed, pool_guid=6104843082126424794 vdev_guid=9671843129628471665
Feb  9 10:04:00 old kernel: ada1 at ahcich1 bus 0 scbus1 target 0 lun 0
Feb  9 10:04:00 old kernel: ada1: <ST4000NM0035-1V4107 TN05> s/n XXXXXXX detached
Feb  9 10:04:06 old kernel: (ada1:ahcich1:0:0:0): Periph destroyed
Feb  9 10:04:06 old ZFS[49555]: vdev state changed, pool_guid=6104843082126424794 vdev_guid=9671843129628471665
Feb  9 10:04:06 old ZFS[49559]: vdev is removed, pool_guid=6104843082126424794 vdev_guid=9671843129628471665
Feb  9 10:04:09 old kernel: ada1 at ahcich1 bus 0 scbus1 target 0 lun 0
Feb  9 10:04:09 old kernel: ada1: <ST4000NM0035-1V4107 TN05> ACS-3 ATA SATA 3.x device
Feb  9 10:04:09 old kernel: ada1: Serial Number XXXXXXX
Feb  9 10:04:09 old kernel: ada1: 300.000MB/s transfers (SATA 2.x, UDMA6, PIO 8192bytes)
Feb  9 10:04:09 old kernel: ada1: Command Queueing enabled
Feb  9 10:04:09 old kernel: ada1: 3815447MB (7814037168 512 byte sectors)

As you can see - no error messages at all, hard drive just fall off and instangly come back online, while first drive (ada0) continue working normally. I have no explanation to this. I found a lot of posts with similar events with no reasonable explanation, so I lost my hope....
 
Unfortunately this is the case just for that type of hard drive... For a lot of others brands that method had 0 effect.

[snip]

As you can see - no error messages at all, hard drive just fall off and instangly come back online, while first drive (ada0) continue working normally. I have no explanation to this. I found a lot of posts with similar events with no reasonable explanation, so I lost my hope....

Unfortunately while our symptoms are similar, it looks like your root cause is very different to mine. You may be better off starting a separate topic about it?

My issue seems to be load related. While system load is less than 32 (as its a 32 core system), the system worked fine (for 2 weeks) with no errors. However the backup process kicked off yesterday, pushing the load to 41. Since then I started getting errors again and within 24 hours had 4 drives drop off (both SSD and HDD), including the UFS backup drive (trashing the backup attempt).

This is a new data point as it means whatever the issue is, it is not related to ZFS but something on a lower level. Also interestingly the channel the UFS drive is on is effectively dead. I have tried pulling out and re-inserting the drive. FreeBSD registers the peripheral and assigns it a dev node. It also shows up in "camcontrol", however if I try to access it (e.g. via fsck) it gives I/O errors:

Code:
$# fsck /dev/da13 
I/O error reading 0
I/O error reading 0

LOOK FOR ALTERNATE SUPERBLOCKS? no

$# mount /dev/da13 /mnt/backup/
mount: /dev/da13: Input/output error

$#

While I get the following console errors:

Code:
Feb 22 12:06:28 Mnemosyne kernel: (da13:mps0:0:42:0): Retrying command (per sense data)
Feb 22 12:06:28 Mnemosyne kernel: (da13:mps0:0:42:0): READ(6). CDB: 08 00 67 00 40 00 
Feb 22 12:06:28 Mnemosyne kernel: (da13:mps0:0:42:0): CAM status: SCSI Status Error
Feb 22 12:06:28 Mnemosyne kernel: (da13:mps0:0:42:0): SCSI status: Check Condition
Feb 22 12:06:28 Mnemosyne kernel: (da13:mps0:0:42:0): SCSI sense: ABORTED COMMAND asc:47,3 (Information unit iuCRC error detected)
Feb 22 12:06:28 Mnemosyne kernel: (da13:mps0:0:42:0): Error 5, Retries exhausted

I checked the drive on another machine and it works absolutely fine, so it is not an issue with the drive.

At this point I will probably have to restart the OS in order for the channel to be accessible again. However I want my long running calculations to finish before I attempt a restart so am currently waiting.

It feels like a bug in FreeBSD to be honest, possibly in the way it interacts with my HW. I state this simply because any Unix OS should be able to correctly schedule load in excess of number of CPUs and sustain that for an extended period of time, and I have had other BSD boxes handle that situation without fault.

In the end I may have no choice but to split this system into two, one for storage and one processing so that the storage node load doesn't exceed its ability to schedule.
 
Does it make no difference if you set the lowest priority for a processes that uses the CPU?
idprio(1) idprio 31 ...
Unfortunately not. I tried using "nice" to set the priority rather than idprio. The long running calculations are at nice value 10, while the backup was at nice value 15. The kernel is real time and should be able to preempt everything else on the system. However the problems persisted.
 
My experience is that there was a change a long time ago (over 20 years ago) that made nice less effective.
I have been using idprio ever since.
I don't think nice(1) is very useful in your situation.
 
My experience is that there was a change a long time ago (over 20 years ago) that made nice less effective.
I have been using idprio ever since.
I don't think nice(1) is very useful in your situation.
I guess I am behind the times :-) I will try idprio but that will be after the current run finishes (or the system crashes again).
 
I don`t think all of that error has anything to do with CPU load. Same system same drives but UFS / GMIRROR works absolutely fine.
Yet another symptoms I noticed, if system runs for a long time, say more then a month, the frequency of disconnects increases dramatically. Once restarted the issue may gone completely for couple of days.
I'm leaning towards some kind of memory leak is in freebsd disk driver implementation when used by ZFS. Or issues with some SATA mostly commands used by ZFS.

The errors similar to Unixnut shows I'm also receiving on some more sophisticated RAID controllers, with more monitoring capabilities.
 
Back
Top