Major ZFS Problems

the_sadman · Aug 5, 2010

Something horrible has happened. I have 5 drives as follows:
ad0 and ad1 are on gm0 -- mirrored root/OS devices UFS
ad6 -- a single drive on ZFS (loner)
ad4 and ad8 -- a mirrored ZFS pool (share)

After I moved across town, the first boot showed me that ad0 had failed. Easy enough, I just unplugged it and made ad1 the new ad0. I should note that ad1 and ad0 are on IDE controller on the MB while all other drives are on the SATA controller on the MB. A couple weeks later the ZFS drives finally started to get some real IO again. A couple days later and share zpool had permanent errors (checksum). Anytime ANY access was made to the filesystems at that point the user session would hang. So I reboot and started a scrub. Scrub stalled at ~4% and it's been like this for 8 hours. Also, now the loner pool is now dead with IO errors. So thus far all my ZFS drives have failed and I can't seem to do anything. I've included some output with my tears. Tonight I hope to have a PCI SATA controller and good news. If the controller doesn't fix it I don't know what to do I really need that data. I find it very unlikely that ALL my drives would fail from a move. I can't do much debugging remotely because even things like "reboot" and "sudo" cause the user session to hang. I did a 'zpool clear' on mirror so that's why it shows no errors I guess.

Code:

[root@hermes ~]# zpool status
  pool: loner
 state: UNAVAIL
status: One or more devices are faulted in response to IO failures.
action: Make sure the affected devices are connected, then run 'zpool clear'.
   see: http://www.sun.com/msg/ZFS-8000-HC
 scrub: none requested
config:

	NAME        STATE     READ WRITE CKSUM
	loner       UNAVAIL      1     7     0  insufficient replicas
	  ad6       UNAVAIL      8     0     0  experienced I/O failures

errors: 4 data errors, use '-v' for a list

  pool: share
 state: ONLINE
 scrub: scrub in progress for 8h27m, 4.02% done, 201h51m to go
config:

	NAME        STATE     READ WRITE CKSUM
	share       ONLINE       0     0     0
	  mirror    ONLINE       0     0     0
	    ad4     ONLINE       0     0     0
	    ad8     ONLINE       0     0     0

errors: No known data errors


[root@hermes ~]# tail -n25 /var/log/messages
Aug  5 08:25:06 hermes kernel: ad4: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing request directly
Aug  5 08:25:26 hermes kernel: ad4: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing request directly
Aug  5 08:25:42 hermes su: otheruser to root on /dev/pts/3
Aug  5 08:25:46 hermes kernel: ad4: WARNING - SETFEATURES ENABLE RCACHE taskqueue timeout - completing request directly
Aug  5 08:26:06 hermes kernel: ad4: WARNING - SETFEATURES ENABLE WCACHE taskqueue timeout - completing request directly
Aug  5 08:26:26 hermes kernel: ad4: WARNING - SET_MULTI taskqueue timeout - completing request directly
Aug  5 08:26:26 hermes kernel: ad4: TIMEOUT - WRITE_DMA48 retrying (0 retries left) LBA=1728069120
Aug  5 08:26:46 hermes kernel: ad8: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing request directly
Aug  5 08:27:06 hermes kernel: ad8: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing request directly
Aug  5 08:27:26 hermes kernel: ad8: WARNING - SETFEATURES ENABLE RCACHE taskqueue timeout - completing request directly
Aug  5 08:27:46 hermes kernel: ad8: WARNING - SETFEATURES ENABLE WCACHE taskqueue timeout - completing request directly
Aug  5 08:28:06 hermes kernel: ad8: WARNING - SET_MULTI taskqueue timeout - completing request directly
Aug  5 08:28:06 hermes kernel: ad8: TIMEOUT - WRITE_DMA48 retrying (0 retries left) LBA=1728068864
Aug  5 08:28:26 hermes kernel: ad6: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing request directly
Aug  5 08:28:46 hermes kernel: ad6: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing request directly
Aug  5 08:29:06 hermes kernel: ad6: WARNING - SETFEATURES ENABLE RCACHE taskqueue timeout - completing request directly
Aug  5 08:29:26 hermes kernel: ad6: WARNING - SETFEATURES ENABLE WCACHE taskqueue timeout - completing request directly
Aug  5 08:29:46 hermes kernel: ad6: WARNING - SET_MULTI taskqueue timeout - completing request directly
Aug  5 08:29:46 hermes kernel: ad6: FAILURE - READ_DMA timed out LBA=0
Aug  5 08:30:06 hermes kernel: ad4: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing request directly
Aug  5 08:30:27 hermes kernel: ad4: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing request directly
Aug  5 08:30:46 hermes kernel: ad4: WARNING - SETFEATURES ENABLE RCACHE taskqueue timeout - completing request directly
Aug  5 08:31:07 hermes kernel: ad4: WARNING - SETFEATURES ENABLE WCACHE taskqueue timeout - completing request directly
Aug  5 08:31:26 hermes kernel: ad4: WARNING - SET_MULTI taskqueue timeout - completing request directly
Aug  5 08:31:26 hermes kernel: ad4: TIMEOUT - WRITE_DMA48 retrying (0 retries left) LBA=1728069376

[root@hermes /var/log]# zpool get all share 
NAME   PROPERTY       VALUE       SOURCE
share  size           928G        -
share  used           832G        -
share  available      96.1G       -
share  capacity       89%         -
share  altroot        -           default
share  health         ONLINE      -
share  guid           10098610899381386190  -
share  version        13          default
share  bootfs         -           default
share  delegation     on          default
share  autoreplace    off         default
share  cachefile      -           default
share  failmode       wait        default
share  listsnapshots  off         default

The messages started last night when I noticed the failure and haven't stopped since. I checked the log since I moved and a month before and I don't see errors like this anywhere else. Any insight would be much appreciated.

jem · Aug 6, 2010

Was your PC knocked around during your move? If so, maybe you really do have multiple disk failures.

When I transport my computer, I make sure it's strapped into a car seat so that it has some cushioning against bumps and jolts. I don't put it on the floor or in the boot/trunk.

Alternatively, you may be lucky and it's just something has shaken loose during transit. Try reseating and reconnecting all your disks and controllers inside the PC and see if that helps.

the_sadman · Aug 6, 2010

Last night I tried more debugging. I plugged the drives one-by-one with a reboot in-between and re-ordered them all on different SATA ports on the MB. ALL SATA drives worked fine this time (odd). So I decided to quickly scrub and somewhere around 60% I got the same locks as before with the same messages in /var/log/messages. Since I never get messages for the drive on IDE (nevermind that one of my IDE drives went boom a couple weeks back), I am thinking that this is a controller problem (maybe after a while it heats up too much and starts bugging out) or possibly a software bug (not likely I guess). Tonight I plan to use a PCI SATA card to coerce the data off a degraded mirror as best as possible, maybe I can be lucky and grab two cards from work

.

Typically when unsure, do you replace everything (HDs, Controller, etc) ... ?

gkontos · Aug 6, 2010

the_sadman said:
Typically when unsure, do you replace everything (HDs, Controller, etc) ... ?

It appears that the problems with your drives are related to the specific controller. From what I noticed all of them showed errors. So, I would start by replacing the controller first.

George

the_sadman · Aug 8, 2010

I replaced the motherboard (and thus the SATA controller) and haven't had a single error since. Scrubs completed with success too. Hurray!

Major ZFS Problems

the_sadman

jem

the_sadman

gkontos

the_sadman