Upgraded from 11.2 to 11.3-RELEASE about a week ago. Since then, I've had the system "hang" (explained below) a few times. Just noticed something disturbing: It seems nearly all kernel processes are in D wait:
Is this normal? Or does it indicate an IO problem?
This system is a small home server, 32-bit Atom CPU, 3gig memory. Has 5 disks connected, four SATA: One SSD for boot/root, a sparT, with a total of about 5TB of ZFS, without compression or dedup or snapshots; the largest file system is mirrored.Yes, I know this isn't a lot of memory for ZFS, but it has worked excellently for years. Root file systems are on UFS, but /home for normal users and the web server is on ZFS.
After I upgraded to 11.3, the hangs started, so far three time. It only happens at night, when ZFS is running scrub on the large mirrored pool. When it happens, the basic OS (kernel and root file system are fine), and ZFS is completely hung. While "zpool status" claims that scrub is in progress, no disk IO is happening. Any attempt to access a file in ZFS causes the process to go into D wait. Reboots don't succeed, since ZFS can't be unmounted, so a reset is necessary (and in some cases, this caused fsck problems with the UFS file systems, which were easy to clear. Obviously, no dmesg or syslog messages at all.
I have tried to increase the kernel stack pages for ZFS from 2 to 4, but it isn't clear that it worked, discussed in this thread. However, the system has been stable and reliable for years (in FreeBSD 9.x and 11.x); the problems only started with the upgrade to 11.3. For now, my workaround is going to be: Don't do scrubbing. Clearly, this can't go on for very long.
My question is this: Is it normal for kernel processes to be in D state? Or might this be another symptom of my IO system being seriously troubled?
Code:
# ps aux | egrep "D|\["
USER PID %CPU %MEM VSZ RSS TT STAT STARTED TIME COMMAND
root 11 398.2 0.0 0 64 - RNL 08:34 209:04.44 [idle]
root 4 0.6 0.0 0 32 - DL 08:34 3:08.52 [cam]
root 42 0.5 0.0 0 368 - DL 08:34 11:44.71 [zfskern]
root 0 0.2 0.4 0 12160 - DLs 08:34 58:36.28 [kernel]
root 2 0.0 0.0 0 16 - DL 08:34 0:00.00 [crypto]
root 3 0.0 0.0 0 16 - DL 08:34 0:00.00 [crypto returns]
root 5 0.0 0.0 0 16 - DL 08:34 0:00.00 [soaiod1]
root 6 0.0 0.0 0 16 - DL 08:34 0:00.00 [soaiod2]
root 7 0.0 0.0 0 16 - DL 08:34 0:00.00 [soaiod3]
root 8 0.0 0.0 0 16 - DL 08:34 0:00.00 [soaiod4]
root 9 0.0 0.0 0 16 - DL 08:34 0:00.00 [sctp_iterator]
root 10 0.0 0.0 0 16 - DL 08:34 0:00.00 [audit]
root 12 0.0 0.0 0 528 - WL 08:34 1:56.03 [intr]
root 13 0.0 0.0 0 48 - DL 08:34 0:00.18 [geom]
root 14 0.0 0.0 0 16 - DL 08:34 0:00.00 [sequencer 00]
root 15 0.0 0.0 0 736 - DL 08:34 0:01.86 [usb]
root 16 0.0 0.0 0 16 - DL 08:34 0:03.91 [rand_harvestq]
root 17 0.0 0.0 0 48 - DL 08:34 0:00.21 [pagedaemon]
root 18 0.0 0.0 0 16 - DL 08:34 0:00.00 [vmdaemon]
root 19 0.0 0.0 0 16 - DNL 08:34 0:00.00 [pagezero]
root 20 0.0 0.0 0 16 - DL 08:34 0:00.05 [bufdaemon]
root 21 0.0 0.0 0 16 - DL 08:34 0:00.04 [bufspacedaemon]
root 22 0.0 0.0 0 16 - DL 08:34 0:00.31 [syncer]
root 23 0.0 0.0 0 16 - DL 08:34 0:00.04 [vnlru]
root 389 0.0 0.0 0 16 - DL 08:34 0:01.11 [pf purge]
...
This system is a small home server, 32-bit Atom CPU, 3gig memory. Has 5 disks connected, four SATA: One SSD for boot/root, a sparT, with a total of about 5TB of ZFS, without compression or dedup or snapshots; the largest file system is mirrored.Yes, I know this isn't a lot of memory for ZFS, but it has worked excellently for years. Root file systems are on UFS, but /home for normal users and the web server is on ZFS.
After I upgraded to 11.3, the hangs started, so far three time. It only happens at night, when ZFS is running scrub on the large mirrored pool. When it happens, the basic OS (kernel and root file system are fine), and ZFS is completely hung. While "zpool status" claims that scrub is in progress, no disk IO is happening. Any attempt to access a file in ZFS causes the process to go into D wait. Reboots don't succeed, since ZFS can't be unmounted, so a reset is necessary (and in some cases, this caused fsck problems with the UFS file systems, which were easy to clear. Obviously, no dmesg or syslog messages at all.
I have tried to increase the kernel stack pages for ZFS from 2 to 4, but it isn't clear that it worked, discussed in this thread. However, the system has been stable and reliable for years (in FreeBSD 9.x and 11.x); the problems only started with the upgrade to 11.3. For now, my workaround is going to be: Don't do scrubbing. Clearly, this can't go on for very long.
My question is this: Is it normal for kernel processes to be in D state? Or might this be another symptom of my IO system being seriously troubled?