zpool hangs/stall

Hi guys,

I have a SUN FIRE X4540 running
Code:
uname -a
FreeBSD <hostname> 9.1-RELEASE FreeBSD 9.1-RELEASE #0 r243825: Tue Dec  4 09:23:10 UTC 2012     [email]root@farrell.cse.buffalo.edu[/email]:/usr/obj/usr/src/sys/GENERIC  amd64

The problem: zpool "data" is not responding after a while, stalled completely. System itself is still working, but any command accessing the pool never finishes:
Code:
ps ax
4644  0  D+      0:00.00 ls -GF /data
The second pool "rpool" works just fine.

zpool iostat also shows no io activity on the pool after the first iteration:
Code:
zpool iostat 5
              capacity     operations    bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
data        71.9T  8.75T  1.15K     33  79.9M   220K
rpool       7.38G  7.50G      5      0   143K  4.30K
----------  -----  -----  -----  -----  -----  -----
data        71.9T  8.75T      0      0      0      0
rpool       7.38G  7.50G      0      9      0  36.4K
----------  -----  -----  -----  -----  -----  -----
data        71.9T  8.75T      0      0      0      0
rpool       7.38G  7.50G      0      0      0      0
----------  -----  -----  -----  -----  -----  -----
data        71.9T  8.75T      0      0      0      0
rpool       7.38G  7.50G      0      0      0      0
----------  -----  -----  -----  -----  -----  -----
data        71.9T  8.75T      0      0      0      0
rpool       7.38G  7.50G      0      0      0      0

I added zpool_status.txt where you can see it even hangs while doing a scrub.

When further examining the problem I discovered disk da73 is in state "CORRUPT" when running a gpart list. Note that in the zpool status output da73 has status "ONLINE" and has not been taken offline or listing the specific raidz pool as "DEGRADED", like usually should happen when a disk fails. Executing camcontrol inquiry da73 also never completes:
Code:
ps ax
4868  1  DL+     0:00.00 camcontrol inquiry da73

Any ideas on dealing with this problem?
 

Attachments

Update:

I decided to replace the corrupt disk with a spare; Initially it resulted in a stall as well, so I had to reboot the whole system.
After that I had no problems in replacing the disk:
[CMD="zpool offline data label/disk73"]17946570595209808127 OFFLINE 0 0 0 was /dev/label/disk73[/CMD]
[CMD="zpool replace data label/disk73 label/disk19"] raidz2-8 DEGRADED 0 0 0
label/disk72 ONLINE 0 0 0
spare-1 OFFLINE 0 0 0
17946570595209808127 OFFLINE 0 0 0 was /dev/label/disk73
label/disk19 ONLINE 0 0 0 (resilvering)
label/disk74 ONLINE 0 0 0
label/disk75 ONLINE 0 0 0
label/disk77 ONLINE 0 0 0
label/disk78 ONLINE 0 0 0
label/disk79 ONLINE 0 0 0
label/disk81 ONLINE 0 0 0
label/disk82 ONLINE 0 0 0
label/disk83 ONLINE 0 0 0
label/disk84 ONLINE 0 0 0
[/CMD]

And resilvering seems fine and dandy now, no stalls at this moment. The speed of resilvering seems also way faster than it ever was, it never went above 270M/s before.
Now:
[CMD="zpool status"] pool: data
state: DEGRADED
status: One or more devices is currently being resilvered. The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scan: resilver in progress since Wed Feb 20 10:16:32 2013
6.87T scanned out of 71.9T at 668M/s, 28h21m to go
292M resilvered, 9.55% done
[/CMD]

If the resilvering is finished successfully I'll start a scrub on the zpool. I'll keep you posted.

Still, remains the question why ZFS did not offline the corrupt disk though. Bug?
 
Back
Top