Strange ZFS deadlock today

I believe it is caused by ZFS, since the state looks like the code in this page about ARC: http://dtrace.org/blogs/brendan/2012/01/09/activity-of-the-zfs-arc/

To cause/reveal this problem, first, I tried to run something with OpenJDK 6.

In one window:
Code:
# ./CopyFromHd.sh
load: 0.00  cmd: java 51239 [ucond] 44.56r 0.18u 0.06s 0% 19588k
load: 0.00  cmd: java 51239 [ucond] 46.22r 0.18u 0.06s 0% 19588k
load: 0.00  cmd: java 51239 [ucond] 51.14r 0.18u 0.06s 0% 19588k
load: 0.00  cmd: java 51239 [ucond] 52.90r 0.18u 0.06s 0% 19588k
^C
load: 0.00  cmd: java 51239 [buf_hash_table.ht_locks[i].ht_lock] 58.35r 0.18u 0.06s 0% 19872k
load: 0.00  cmd: java 51239 [buf_hash_table.ht_locks[i].ht_lock] 61.73r 0.18u 0.06s 0% 19872k
^C
load: 0.00  cmd: java 51239 [buf_hash_table.ht_locks[i].ht_lock] 89.67r 0.18u 0.06s 0% 19872k
load: 0.00  cmd: java 51239 [buf_hash_table.ht_locks[i].ht_lock] 141.07r 0.18u 0.06s 0% 19872k
^C

Then since it was hung, I thought maybe it was zfs' fault rather than OpenJDK, so tried with du.

Second window:
Code:
# du -shx /data/archive2/2011/09/11/x
3.1G    /data/archive2/2011/09/11/x

# du -shx /data/archive2/2011/09/11
 24G    /data/archive2/2011/09/11

# du -shx /data/archive2/2011/
load: 0.00  cmd: du 72503 [buf_hash_table.ht_locks[i].ht_lock] 13.75r 0.00u 0.00s 0% 1012k
^C^C^C^Z^C
load: 0.00  cmd: du 72503 [buf_hash_table.ht_locks[i].ht_lock] 221.97r 0.00u 0.00s 0% 1012k

Then I thought I could just run the simplest version of the program (which does pretty much no IO)

Third window:
Code:
# ./CopyFromHd.sh --help
^C^C^C^C^C^C
load: 0.00  cmd: java 52339 [suspended] 26.33r 0.15u 0.04s 0% 25644k
load: 0.00  cmd: java 52339 [suspended] 27.38r 0.15u 0.04s 0% 25644k
^C
load: 0.00  cmd: java 52339 [suspended] 285.23r 0.15u 0.04s 0% 25644k
^Z
[1]+  Stopped                 ./CopyFromHd.sh
# jobs -p
51988
# kill -9 51988
# jobs -p
51988
[1]+  Killed: 9               ./CopyFromHd.sh
# jobs -p

Back to the first window:
Code:
load: 0.00  cmd: java 51239 [buf_hash_table.ht_locks[i].ht_lock] 459.38r 0.18u 0.06s 0% 19872k
^C^Z
[1]+  Stopped                 ./CopyFromHd.sh ...
# jobs -p
51128
# kill -9 51128
# jobs -p
51128
[1]+  Interrupt: 2            ./CopyFromHd.sh ...
# jobs -p
# ps axl | grep java
    0 51239     1   0  44  0 1264940 19904 -      T      0    0:00.25 [java]
    0 76933 77797   0  44  0  9124  1180 piperd S+     0    0:00.00 grep java
    0 52339     1   0  44  0 1266988 25676 -      T      1    0:00.20 /usr/local/openjdk6 //bin/java -Xmx1024M -classpath ...


The java above is the "--help" one from the 2nd window. So I guess the first one ended.

But "dd" commands I ran are all stuck, and so is the 2nd java, and also "zdb" "zdb zroot" "zdb tank" and "zdb data" are all stuck. Also running "find" on either tank or data hangs.

# zpool iostat # zpool status [-v] and # zfs list all run without hanging.

Then I thought maybe I would disable primarycache to see what happens.

Code:
# zfs set primarycache=none data
# zfs set primarycache=none tank
load: 0.00  cmd: zfs 80750 [tx->tx_sync_done_cv)] 5.73r 0.00u 0.00s 0% 1636k
(hang)
^Z^C
load: 0.00  cmd: zfs 80750 [tx->tx_sync_done_cv)] 87.28r 0.00u 0.00s 0% 1636k

Another window:
Code:
# zfs get primarycache data
NAME  PROPERTY      VALUE         SOURCE
data  primarycache  none          local
# zfs get primarycache tank
NAME  PROPERTY      VALUE         SOURCE
tank  primarycache  none          local


On a Linux server that nfs mounts /data on this FreeBSD server, # df hangs at the point where the nfs mount should be listed. (So I have to reboot now rather than later)

I have not run memtest on this machine.

I have no l2arc.

Uptime is 22 days.

It is normally perfectly stable. I ran CopyFromHd to process over 20 TB of data so far (2 weeks ago), so it is not normal for this to happen.

The last thing I did before this was running [cmd=]zdb -S tank[/cmd] yesterday. (tank is not the same pool as data)

FreeBSD version is 8.2-STABLE csupped on 2012-02-04.
 
Back
Top