Hi all,
I have been grappling with a problem with my ZFS array for the last year or so, and it has reached a point now where I have run out of ideas of what may be wrong, so posting in the hope someone has some advice.
I have a machine with 14 drives, configured like so:
4x 10TB HDD in ZFS pool (storage)
1x 1TB HDD in ZFS pool (root fs)
8x 240GB SSD in ZFS pool (fast storage)
1x 6TB HDD as UFS filesystem
the 8x SSD are attached to a HBA, and the other 6 drives are on the motherboards internal SATA controller.
My issue is with the 4x10TB array. When built it worked fine, but after a year or so I found that drives would be marked "REMOVED" on the zpool. Looking at the messages I would see the following:
It looks like the drive just detached, and then destroyed and re-added a few seconds later. There are no other errors. Originally it would not happen often, once every few weeks, but it became more and more frequent. By the end of it a drive would drop off a few seconds after being re-added, and subsequently I would end up with multiple drives detaching before the array manages to resilver, resulting in data corruption.
In an effort to resolve the problem:
- I checked all the drives SMART status with smartctl, all came back healthy
- I pulled drives that were failing, and ran a badblocks test on them to see if they would drop off under sustained I/O or log any errors (no issues found)
- I cleaned and re-seated the SATA and power connectors
- I replaced the SATA cables
- I swapped the SATA cables around between the 6 drives to see if the detachments followed a certain cable/port
- I bought two new sets of 10TB drives from different manufacturers, thinking I had a faulty batch of drives despite the SMART status and testing.
- I bypassed the drive caddy and connected the drives directly, thinking there was a fault with the backplane
The only bits I did not touch are:
- Trying a different SATA controller For one it is onboard and I have no expansion slots for another controller. For another, there are 6 drives attached to that controller, and the other two drives have never detached on me, they have been rock solid. Likewise moving the cables around did not result in the drop offs moving as well, which is what I would have expected if there were some problems with specific ports.
- Trying a different PSU: There are 14 drives connected to this PSU. If it was related to the PSU failing, I would expect random drop offs to occur across all the drives (not to mention general system instability), however I only see it with the four mentioned.
At this point I have run out of ideas, there is not much else I can think of to do to work out what the actual problem is. Has anyone seen something like this before?
I have been grappling with a problem with my ZFS array for the last year or so, and it has reached a point now where I have run out of ideas of what may be wrong, so posting in the hope someone has some advice.
I have a machine with 14 drives, configured like so:
4x 10TB HDD in ZFS pool (storage)
1x 1TB HDD in ZFS pool (root fs)
8x 240GB SSD in ZFS pool (fast storage)
1x 6TB HDD as UFS filesystem
the 8x SSD are attached to a HBA, and the other 6 drives are on the motherboards internal SATA controller.
My issue is with the 4x10TB array. When built it worked fine, but after a year or so I found that drives would be marked "REMOVED" on the zpool. Looking at the messages I would see the following:
Code:
May 15 05:05:52 Mnemosyne kernel: ada1 at ahcich4 bus 0 scbus2 target 0 lun 0
May 15 05:05:52 Mnemosyne kernel: ada1: <WDC WD101EFBX-68B0AN0 85.00A85> s/n VCPTJH2P detached
May 15 05:06:04 Mnemosyne kernel: (ada1:ahcich4:0:0:0): Periph destroyed
May 15 05:06:04 Mnemosyne kernel: ada1 at ahcich4 bus 0 scbus2 target 0 lun 0
May 15 05:06:04 Mnemosyne kernel: ada1: <WDC WD101EFBX-68B0AN0 85.00A85> ACS-2 ATA SATA 3.x device
May 15 05:06:04 Mnemosyne kernel: ada1: Serial Number VCPTJH2P
May 15 05:06:04 Mnemosyne kernel: ada1: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 8192bytes)
May 15 05:06:04 Mnemosyne kernel: ada1: Command Queueing enabled
May 15 05:06:04 Mnemosyne kernel: ada1: 9537536MB (19532873728 512 byte sectors)
It looks like the drive just detached, and then destroyed and re-added a few seconds later. There are no other errors. Originally it would not happen often, once every few weeks, but it became more and more frequent. By the end of it a drive would drop off a few seconds after being re-added, and subsequently I would end up with multiple drives detaching before the array manages to resilver, resulting in data corruption.
In an effort to resolve the problem:
- I checked all the drives SMART status with smartctl, all came back healthy
- I pulled drives that were failing, and ran a badblocks test on them to see if they would drop off under sustained I/O or log any errors (no issues found)
- I cleaned and re-seated the SATA and power connectors
- I replaced the SATA cables
- I swapped the SATA cables around between the 6 drives to see if the detachments followed a certain cable/port
- I bought two new sets of 10TB drives from different manufacturers, thinking I had a faulty batch of drives despite the SMART status and testing.
- I bypassed the drive caddy and connected the drives directly, thinking there was a fault with the backplane
The only bits I did not touch are:
- Trying a different SATA controller For one it is onboard and I have no expansion slots for another controller. For another, there are 6 drives attached to that controller, and the other two drives have never detached on me, they have been rock solid. Likewise moving the cables around did not result in the drop offs moving as well, which is what I would have expected if there were some problems with specific ports.
- Trying a different PSU: There are 14 drives connected to this PSU. If it was related to the PSU failing, I would expect random drop offs to occur across all the drives (not to mention general system instability), however I only see it with the four mentioned.
At this point I have run out of ideas, there is not much else I can think of to do to work out what the actual problem is. Has anyone seen something like this before?