Something horrible has happened. I have 5 drives as follows:
ad0 and ad1 are on gm0 -- mirrored root/OS devices UFS
ad6 -- a single drive on ZFS (loner)
ad4 and ad8 -- a mirrored ZFS pool (share)
After I moved across town, the first boot showed me that ad0 had failed. Easy enough, I just unplugged it and made ad1 the new ad0. I should note that ad1 and ad0 are on IDE controller on the MB while all other drives are on the SATA controller on the MB. A couple weeks later the ZFS drives finally started to get some real IO again. A couple days later and share zpool had permanent errors (checksum). Anytime ANY access was made to the filesystems at that point the user session would hang. So I reboot and started a scrub. Scrub stalled at ~4% and it's been like this for 8 hours. Also, now the loner pool is now dead with IO errors. So thus far all my ZFS drives have failed and I can't seem to do anything. I've included some output with my tears. Tonight I hope to have a PCI SATA controller and good news. If the controller doesn't fix it I don't know what to do I really need that data. I find it very unlikely that ALL my drives would fail from a move. I can't do much debugging remotely because even things like "reboot" and "sudo" cause the user session to hang. I did a 'zpool clear' on mirror so that's why it shows no errors I guess.
The messages started last night when I noticed the failure and haven't stopped since. I checked the log since I moved and a month before and I don't see errors like this anywhere else. Any insight would be much appreciated.
ad0 and ad1 are on gm0 -- mirrored root/OS devices UFS
ad6 -- a single drive on ZFS (loner)
ad4 and ad8 -- a mirrored ZFS pool (share)
After I moved across town, the first boot showed me that ad0 had failed. Easy enough, I just unplugged it and made ad1 the new ad0. I should note that ad1 and ad0 are on IDE controller on the MB while all other drives are on the SATA controller on the MB. A couple weeks later the ZFS drives finally started to get some real IO again. A couple days later and share zpool had permanent errors (checksum). Anytime ANY access was made to the filesystems at that point the user session would hang. So I reboot and started a scrub. Scrub stalled at ~4% and it's been like this for 8 hours. Also, now the loner pool is now dead with IO errors. So thus far all my ZFS drives have failed and I can't seem to do anything. I've included some output with my tears. Tonight I hope to have a PCI SATA controller and good news. If the controller doesn't fix it I don't know what to do I really need that data. I find it very unlikely that ALL my drives would fail from a move. I can't do much debugging remotely because even things like "reboot" and "sudo" cause the user session to hang. I did a 'zpool clear' on mirror so that's why it shows no errors I guess.
Code:
[root@hermes ~]# zpool status
pool: loner
state: UNAVAIL
status: One or more devices are faulted in response to IO failures.
action: Make sure the affected devices are connected, then run 'zpool clear'.
see: http://www.sun.com/msg/ZFS-8000-HC
scrub: none requested
config:
NAME STATE READ WRITE CKSUM
loner UNAVAIL 1 7 0 insufficient replicas
ad6 UNAVAIL 8 0 0 experienced I/O failures
errors: 4 data errors, use '-v' for a list
pool: share
state: ONLINE
scrub: scrub in progress for 8h27m, 4.02% done, 201h51m to go
config:
NAME STATE READ WRITE CKSUM
share ONLINE 0 0 0
mirror ONLINE 0 0 0
ad4 ONLINE 0 0 0
ad8 ONLINE 0 0 0
errors: No known data errors
[root@hermes ~]# tail -n25 /var/log/messages
Aug 5 08:25:06 hermes kernel: ad4: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing request directly
Aug 5 08:25:26 hermes kernel: ad4: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing request directly
Aug 5 08:25:42 hermes su: otheruser to root on /dev/pts/3
Aug 5 08:25:46 hermes kernel: ad4: WARNING - SETFEATURES ENABLE RCACHE taskqueue timeout - completing request directly
Aug 5 08:26:06 hermes kernel: ad4: WARNING - SETFEATURES ENABLE WCACHE taskqueue timeout - completing request directly
Aug 5 08:26:26 hermes kernel: ad4: WARNING - SET_MULTI taskqueue timeout - completing request directly
Aug 5 08:26:26 hermes kernel: ad4: TIMEOUT - WRITE_DMA48 retrying (0 retries left) LBA=1728069120
Aug 5 08:26:46 hermes kernel: ad8: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing request directly
Aug 5 08:27:06 hermes kernel: ad8: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing request directly
Aug 5 08:27:26 hermes kernel: ad8: WARNING - SETFEATURES ENABLE RCACHE taskqueue timeout - completing request directly
Aug 5 08:27:46 hermes kernel: ad8: WARNING - SETFEATURES ENABLE WCACHE taskqueue timeout - completing request directly
Aug 5 08:28:06 hermes kernel: ad8: WARNING - SET_MULTI taskqueue timeout - completing request directly
Aug 5 08:28:06 hermes kernel: ad8: TIMEOUT - WRITE_DMA48 retrying (0 retries left) LBA=1728068864
Aug 5 08:28:26 hermes kernel: ad6: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing request directly
Aug 5 08:28:46 hermes kernel: ad6: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing request directly
Aug 5 08:29:06 hermes kernel: ad6: WARNING - SETFEATURES ENABLE RCACHE taskqueue timeout - completing request directly
Aug 5 08:29:26 hermes kernel: ad6: WARNING - SETFEATURES ENABLE WCACHE taskqueue timeout - completing request directly
Aug 5 08:29:46 hermes kernel: ad6: WARNING - SET_MULTI taskqueue timeout - completing request directly
Aug 5 08:29:46 hermes kernel: ad6: FAILURE - READ_DMA timed out LBA=0
Aug 5 08:30:06 hermes kernel: ad4: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing request directly
Aug 5 08:30:27 hermes kernel: ad4: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing request directly
Aug 5 08:30:46 hermes kernel: ad4: WARNING - SETFEATURES ENABLE RCACHE taskqueue timeout - completing request directly
Aug 5 08:31:07 hermes kernel: ad4: WARNING - SETFEATURES ENABLE WCACHE taskqueue timeout - completing request directly
Aug 5 08:31:26 hermes kernel: ad4: WARNING - SET_MULTI taskqueue timeout - completing request directly
Aug 5 08:31:26 hermes kernel: ad4: TIMEOUT - WRITE_DMA48 retrying (0 retries left) LBA=1728069376
[root@hermes /var/log]# zpool get all share
NAME PROPERTY VALUE SOURCE
share size 928G -
share used 832G -
share available 96.1G -
share capacity 89% -
share altroot - default
share health ONLINE -
share guid 10098610899381386190 -
share version 13 default
share bootfs - default
share delegation on default
share autoreplace off default
share cachefile - default
share failmode wait default
share listsnapshots off default
The messages started last night when I noticed the failure and haven't stopped since. I checked the log since I moved and a month before and I don't see errors like this anywhere else. Any insight would be much appreciated.