Hello all,
For some time now, I've been struggling with harddrives in my raidz pool becoming unavailable under heavy usage. First time it happened, I assumed a faulty disk and began replacing them with higher capacity disks, as I'd been meaning to do that anyway.
Replacing and resilvering the first disk went well, but while working on replacing the second disk, suddenly the first disk went offline and the job failed as there now wasn't enough data to maintain the pool. I searched around on various forums, seeing folks with sometimes similar issues, but ultimately finding no definite answers.
Rebooting brought the failed device back with all data intact, so I started a scrub in an attempt to conclude replacing the second drive. A couple of failed attempts later, for no discernible reason, the job finished successfully and the pool was back to full health.
At this point, I updated FreeBSD, noticed zfs went from version 13 to 14 and updated the pool to 14 before moving on.
Replacing drive #3 went without a hitch; perhaps the issue had been addressed in the last FreeBSD update.
It had not. I'm now stuck on replacing drive #4, a process that has failed numerous times at this point. I've noticed that, so far, it's always been ad4 or ad6 that goes offline.
I find it suspicious that it's always a drive with the older firmware that's acting up. I've contacted WD customer support, in hopes they'll provide a firmware upgrade, so that I at least can eliminate it as a possible cause. Nothing so far, but I don't expect a response over the weekend.
At the time of this post, the zpool status looks like this. Ugly and scary, but the errors are due to the drives spontaneously vanishing. At least it's no longer 8000 errors.
I've issued a stop command to the scrub process, but for the time being it's ignoring me.
Once a drive starts misbehaving, it fills up /var/log/messages with entries like these.
LBA is never the same, keeps decreasing and eventually hits 0.
I've tried replacing the SATA controller, but with the same make/model (Promise PDC40718 SATA300 controller), so it may still be a problem with that particular device.
Reading the FreeNAS forums suggested I tried activating ATA_REQUEST_TIMEOUT in the kernel, which I did, trying different values (5, 15, 30) but ultimately, it just made the drives go offline faster, while leaving the scrub to run much, much longer before giving up.
I realize there are many threads about issues with the sector size of the WD20EARS, but they all seem to be related to performance, not outright failures.
So, before I forget why I started this post; has anyone here had similar issues, suggestions or simply an idea about what could be happening?
For some time now, I've been struggling with harddrives in my raidz pool becoming unavailable under heavy usage. First time it happened, I assumed a faulty disk and began replacing them with higher capacity disks, as I'd been meaning to do that anyway.
Replacing and resilvering the first disk went well, but while working on replacing the second disk, suddenly the first disk went offline and the job failed as there now wasn't enough data to maintain the pool. I searched around on various forums, seeing folks with sometimes similar issues, but ultimately finding no definite answers.
Rebooting brought the failed device back with all data intact, so I started a scrub in an attempt to conclude replacing the second drive. A couple of failed attempts later, for no discernible reason, the job finished successfully and the pool was back to full health.
At this point, I updated FreeBSD, noticed zfs went from version 13 to 14 and updated the pool to 14 before moving on.
Replacing drive #3 went without a hitch; perhaps the issue had been addressed in the last FreeBSD update.
It had not. I'm now stuck on replacing drive #4, a process that has failed numerous times at this point. I've noticed that, so far, it's always been ad4 or ad6 that goes offline.
# atacontrol list
Code:
ATA channel 0:
Master: no device present
Slave: no device present
ATA channel 2:
Master: ad4 <WDC WD20EARS-00MVWB0/50.0AB50> SATA revision 2.x
Slave: no device present
ATA channel 3:
Master: ad6 <WDC WD20EARS-00MVWB0/50.0AB50> SATA revision 2.x
Slave: no device present
ATA channel 4:
Master: ad8 <WDC WD20EARS-00MVWB0/51.0AB51> SATA revision 2.x
Slave: no device present
ATA channel 5:
Master: ad10 <WDC WD20EARS-00MVWB0/51.0AB51> SATA revision 2.x
Slave: no device present
ATA channel 6:
Master: ad12 <ST9320421AS/SD13> SATA revision 2.x
Slave: no device present
ATA channel 7:
Master: no device present
Slave: no device present
I find it suspicious that it's always a drive with the older firmware that's acting up. I've contacted WD customer support, in hopes they'll provide a firmware upgrade, so that I at least can eliminate it as a possible cause. Nothing so far, but I don't expect a response over the weekend.
At the time of this post, the zpool status looks like this. Ugly and scary, but the errors are due to the drives spontaneously vanishing. At least it's no longer 8000 errors.
# zpool status
Code:
pool: tank
state: DEGRADED
status: One or more devices are faulted in response to IO failures.
action: Make sure the affected devices are connected, then run 'zpool clear'.
see: http://www.sun.com/msg/ZFS-8000-HC
scrub: resilver in progress for 6h13m, 22.97% done, 20h53m to go
config:
NAME STATE READ WRITE CKSUM
tank DEGRADED 5 0 0
raidz1 DEGRADED 12 3 0
ad4 UNAVAIL 76 184K 2.64K experienced I/O failures
ad6 ONLINE 11 3 0 162M resilvered
ad8 ONLINE 0 0 0 159M resilvered
replacing DEGRADED 0 0 0
ad10/old UNAVAIL 0 218K 0 cannot open
ad10 ONLINE 0 0 0 171G resilvered
errors: 2 data errors, use '-v' for a list
I've issued a stop command to the scrub process, but for the time being it's ignoring me.
Once a drive starts misbehaving, it fills up /var/log/messages with entries like these.
Code:
ad4: FAILURE - READ_DMA48 timed out LBA=380864299
ata2: SIGNATURE: ffffffff
ata2: timeout waiting to issue command
ata2: error issuing SETFEATURES SET TRANSFER MODE command
ata2: timeout waiting to issue command
ata2: error issuing SETFEATURES ENABLE RCACHE command
ata2: timeout waiting to issue command
ata2: error issuing SETFEATURES ENABLE WCACHE command
ata2: timeout waiting to issue command
ata2: error issuing SET_MULTI command
ad4: TIMEOUT - READ_DMA48 retrying (1 retry left) LBA=380864342
ata2: SIGNATURE: ffffffff
ata2: timeout waiting to issue command
ata2: error issuing SETFEATURES SET TRANSFER MODE command
ata2: timeout waiting to issue command
ata2: error issuing SETFEATURES ENABLE RCACHE command
ata2: timeout waiting to issue command
ata2: error issuing SETFEATURES ENABLE WCACHE command
ata2: timeout waiting to issue command
ata2: error issuing SET_MULTI command
ad4: TIMEOUT - READ_DMA48 retrying (1 retry left) LBA=380864470
LBA is never the same, keeps decreasing and eventually hits 0.
I've tried replacing the SATA controller, but with the same make/model (Promise PDC40718 SATA300 controller), so it may still be a problem with that particular device.
Reading the FreeNAS forums suggested I tried activating ATA_REQUEST_TIMEOUT in the kernel, which I did, trying different values (5, 15, 30) but ultimately, it just made the drives go offline faster, while leaving the scrub to run much, much longer before giving up.
I realize there are many threads about issues with the sector size of the WD20EARS, but they all seem to be related to performance, not outright failures.
So, before I forget why I started this post; has anyone here had similar issues, suggestions or simply an idea about what could be happening?