I believe it is caused by ZFS, since the state looks like the code in this page about ARC: http://dtrace.org/blogs/brendan/2012/01/09/activity-of-the-zfs-arc/
To cause/reveal this problem, first, I tried to run something with OpenJDK 6.
In one window:
Then since it was hung, I thought maybe it was zfs' fault rather than OpenJDK, so tried with du.
Second window:
Then I thought I could just run the simplest version of the program (which does pretty much no IO)
Third window:
Back to the first window:
The java above is the "--help" one from the 2nd window. So I guess the first one ended.
But "dd" commands I ran are all stuck, and so is the 2nd java, and also "zdb" "zdb zroot" "zdb tank" and "zdb data" are all stuck. Also running "find" on either tank or data hangs.
Then I thought maybe I would disable primarycache to see what happens.
Another window:
On a Linux server that nfs mounts /data on this FreeBSD server,
I have not run memtest on this machine.
I have no l2arc.
Uptime is 22 days.
It is normally perfectly stable. I ran CopyFromHd to process over 20 TB of data so far (2 weeks ago), so it is not normal for this to happen.
The last thing I did before this was running [cmd=]zdb -S tank[/cmd] yesterday. (tank is not the same pool as data)
FreeBSD version is 8.2-STABLE csupped on 2012-02-04.
To cause/reveal this problem, first, I tried to run something with OpenJDK 6.
In one window:
Code:
# ./CopyFromHd.sh
load: 0.00 cmd: java 51239 [ucond] 44.56r 0.18u 0.06s 0% 19588k
load: 0.00 cmd: java 51239 [ucond] 46.22r 0.18u 0.06s 0% 19588k
load: 0.00 cmd: java 51239 [ucond] 51.14r 0.18u 0.06s 0% 19588k
load: 0.00 cmd: java 51239 [ucond] 52.90r 0.18u 0.06s 0% 19588k
^C
load: 0.00 cmd: java 51239 [buf_hash_table.ht_locks[i].ht_lock] 58.35r 0.18u 0.06s 0% 19872k
load: 0.00 cmd: java 51239 [buf_hash_table.ht_locks[i].ht_lock] 61.73r 0.18u 0.06s 0% 19872k
^C
load: 0.00 cmd: java 51239 [buf_hash_table.ht_locks[i].ht_lock] 89.67r 0.18u 0.06s 0% 19872k
load: 0.00 cmd: java 51239 [buf_hash_table.ht_locks[i].ht_lock] 141.07r 0.18u 0.06s 0% 19872k
^C
Then since it was hung, I thought maybe it was zfs' fault rather than OpenJDK, so tried with du.
Second window:
Code:
# du -shx /data/archive2/2011/09/11/x
3.1G /data/archive2/2011/09/11/x
# du -shx /data/archive2/2011/09/11
24G /data/archive2/2011/09/11
# du -shx /data/archive2/2011/
load: 0.00 cmd: du 72503 [buf_hash_table.ht_locks[i].ht_lock] 13.75r 0.00u 0.00s 0% 1012k
^C^C^C^Z^C
load: 0.00 cmd: du 72503 [buf_hash_table.ht_locks[i].ht_lock] 221.97r 0.00u 0.00s 0% 1012k
Then I thought I could just run the simplest version of the program (which does pretty much no IO)
Third window:
Code:
# ./CopyFromHd.sh --help
^C^C^C^C^C^C
load: 0.00 cmd: java 52339 [suspended] 26.33r 0.15u 0.04s 0% 25644k
load: 0.00 cmd: java 52339 [suspended] 27.38r 0.15u 0.04s 0% 25644k
^C
load: 0.00 cmd: java 52339 [suspended] 285.23r 0.15u 0.04s 0% 25644k
^Z
[1]+ Stopped ./CopyFromHd.sh
# jobs -p
51988
# kill -9 51988
# jobs -p
51988
[1]+ Killed: 9 ./CopyFromHd.sh
# jobs -p
Back to the first window:
Code:
load: 0.00 cmd: java 51239 [buf_hash_table.ht_locks[i].ht_lock] 459.38r 0.18u 0.06s 0% 19872k
^C^Z
[1]+ Stopped ./CopyFromHd.sh ...
# jobs -p
51128
# kill -9 51128
# jobs -p
51128
[1]+ Interrupt: 2 ./CopyFromHd.sh ...
# jobs -p
# ps axl | grep java
0 51239 1 0 44 0 1264940 19904 - T 0 0:00.25 [java]
0 76933 77797 0 44 0 9124 1180 piperd S+ 0 0:00.00 grep java
0 52339 1 0 44 0 1266988 25676 - T 1 0:00.20 /usr/local/openjdk6 //bin/java -Xmx1024M -classpath ...
The java above is the "--help" one from the 2nd window. So I guess the first one ended.
But "dd" commands I ran are all stuck, and so is the 2nd java, and also "zdb" "zdb zroot" "zdb tank" and "zdb data" are all stuck. Also running "find" on either tank or data hangs.
# zpool iostat
# zpool status [-v]
and # zfs list
all run without hanging.Then I thought maybe I would disable primarycache to see what happens.
Code:
# zfs set primarycache=none data
# zfs set primarycache=none tank
load: 0.00 cmd: zfs 80750 [tx->tx_sync_done_cv)] 5.73r 0.00u 0.00s 0% 1636k
(hang)
^Z^C
load: 0.00 cmd: zfs 80750 [tx->tx_sync_done_cv)] 87.28r 0.00u 0.00s 0% 1636k
Another window:
Code:
# zfs get primarycache data
NAME PROPERTY VALUE SOURCE
data primarycache none local
# zfs get primarycache tank
NAME PROPERTY VALUE SOURCE
tank primarycache none local
On a Linux server that nfs mounts /data on this FreeBSD server,
# df
hangs at the point where the nfs mount should be listed. (So I have to reboot now rather than later)I have not run memtest on this machine.
I have no l2arc.
Uptime is 22 days.
It is normally perfectly stable. I ran CopyFromHd to process over 20 TB of data so far (2 weeks ago), so it is not normal for this to happen.
The last thing I did before this was running [cmd=]zdb -S tank[/cmd] yesterday. (tank is not the same pool as data)
FreeBSD version is 8.2-STABLE csupped on 2012-02-04.