So you're seeing this issue on both the built in(?) controller and the LSI HBA?
Correct
That would point me to either motherboard/memory/cpu and/or PSU issues.
I don't think so. For one the motherboard/memory/cpu has been working fine throughout, even with all CPUs pegged at 100% for days on end when doing some computational work. Secondly, there are 10 other drives attached to this machine (2 on built in controller, 12 on the HBA), and all of them work flawlessly.
The problem is only with these 4 drives, which is why I thought it had something to do with the drives, cables or the slots themelves. Power issues were next in line, thinking that as these are 3.5" disk drives their higher power draw may result in issues, but then I would expect all 12 drives to randomly drop off if there was a reduction in power quality.
At this point I will next try to run the 4 drives on a separate ATX, to remove the power from the equation (or find out that is the root cause), after which I am not sure what to try next.
One thing I did notice is that in the last couple of months (while waiting for delivery of the new HBA), I didn't bother re-adding the drives, and left the array to be. It has run with 2 drives not dropping off this entire time. I added the third and it was fine, but when I added the fourth I started getting detachments.
If the separate PSU does not solve the issue, and I can't think of anything else, I may well destroy the array and rebuild it with three drives, either raidz1 so I keep the space at the loss of redundancy, or raidz2 and 10TB less available.
What does smartctl -a /dev/da8 say about the drive (smartmontools)?
I didn't see anything untoward on da8, but on da11 I did see this:
Code:
Error 63 occurred at disk power-on lifetime: 600 hours (25 days + 0 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
84 41 e0 00 00 00 40 Error: ICRC, ABRT at LBA = 0x00000000 = 0
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
60 01 e0 00 00 00 40 00 00:01:08.239 READ FPDMA QUEUED
ef 02 00 00 00 00 40 00 00:01:08.239 SET FEATURES [Enable write cache]
ef aa 00 00 00 00 40 00 00:01:08.239 SET FEATURES [Enable read look-ahead]
c6 00 10 00 00 00 40 00 00:01:08.239 SET MULTIPLE MODE
ef 10 02 00 00 00 40 00 00:01:08.239 SET FEATURES [Enable SATA feature]
Error 62 occurred at disk power-on lifetime: 600 hours (25 days + 0 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
84 41 60 00 00 00 40 Error: ICRC, ABRT at LBA = 0x00000000 = 0
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
60 01 60 00 00 00 40 00 00:01:07.909 READ FPDMA QUEUED
60 01 58 40 00 00 40 00 00:01:07.897 READ FPDMA QUEUED
ef 02 00 00 00 00 40 00 00:01:07.897 SET FEATURES [Enable write cache]
ef aa 00 00 00 00 40 00 00:01:07.897 SET FEATURES [Enable read look-ahead]
c6 00 10 00 00 00 40 00 00:01:07.897 SET MULTIPLE MODE
Error 61 occurred at disk power-on lifetime: 600 hours (25 days + 0 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
84 41 20 40 00 00 40 Error: ICRC, ABRT at LBA = 0x00000040 = 64
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
60 01 20 40 00 00 40 00 00:01:07.564 READ FPDMA QUEUED
ef 02 00 00 00 00 40 00 00:01:07.564 SET FEATURES [Enable write cache]
ef aa 00 00 00 00 40 00 00:01:07.564 SET FEATURES [Enable read look-ahead]
c6 00 10 00 00 00 40 00 00:01:07.564 SET MULTIPLE MODE
ef 10 02 00 00 00 40 00 00:01:07.564 SET FEATURES [Enable SATA feature]
Error 60 occurred at disk power-on lifetime: 600 hours (25 days + 0 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
84 51 00 0f 02 00 40 Error: ICRC, ABRT at LBA = 0x0000020f = 527
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
c8 00 10 00 02 00 40 00 00:01:07.284 READ DMA
ef 02 00 00 00 00 40 00 00:01:07.284 SET FEATURES [Enable write cache]
ef aa 00 00 00 00 40 00 00:01:07.284 SET FEATURES [Enable read look-ahead]
c6 00 10 00 00 00 40 00 00:01:07.284 SET MULTIPLE MODE
ef 10 02 00 00 00 40 00 00:01:07.284 SET FEATURES [Enable SATA feature]
Error 59 occurred at disk power-on lifetime: 600 hours (25 days + 0 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
84 51 00 0f 02 00 40 Error: ICRC, ABRT at LBA = 0x0000020f = 527
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
c8 00 10 00 02 00 40 00 00:01:06.956 READ DMA
ef 02 00 00 00 00 40 00 00:01:06.955 SET FEATURES [Enable write cache]
ef aa 00 00 00 00 40 00 00:01:06.955 SET FEATURES [Enable read look-ahead]
c6 00 10 00 00 00 40 00 00:01:06.955 SET MULTIPLE MODE
ef 10 02 00 00 00 40 00 00:01:06.955 SET FEATURES [Enable SATA feature]
63 errors, the visible ones are posted above. Not sure if these are a root cause or a symptom of something else. I checked the other drives and da10 has errors as well. Does anyone know what could be the cause of this?
I would also keep an eye on temperatures overall, LSI HBA runs hot and need active cooling that wouldn't explain the issues you're seeing on the built in(?) controller though but different drives so might be something else going on.
If the HBA was overheating, would I not see random issues around all 12 drives, not just these 4? Also originally this problem was only on the internal SATA controller. I replaced the 8 port HBA with a 16 port so I could hook these drives up to that as well, thinking the problem was with the internal SATA controller. Looks like it wasn't the root cause in the end.