Solved ZFS access stall when accessing device on 10.2-RELEASE amd64

Hi All.

After kernel panic problem fixed in 10.2-Release amd64 https://forums.freebsd.org/threads/kernel-panics-under-high-zfs-load-locate-database-building-etc.51470/
I can use my computer(4GB RAM, over 90% capacity @ SSD), but an access stall problem occur...
It is a new one , so I post it in new thread.

Prior 10.1-RELEASE AMD64, Work fine.
In 10.1-RELEASE AMD64, Got kernel panic.
Now 10.2-RELEASE AMD64, have some stall situation like under...
  1. Reboot/Shutdown stall at "All buffer synced"
    1. If Shutdown/Reboot immediately after boot ready (before daily periodic task start),
      Work fine.
    2. When daily periodic task started, I saw find (rebuild locate database) STATE field by top
      1. zio->io / RUN, Disk activity: Access.
      2. Several time after, It became "Empty", like
        Code:
          PID USERNAME  THR PRI NICE  SIZE  RES STATE  C  TIME  WCPU COMMAND
        48209 nanko  5  20  0  426M 57592K uwait  1  0:22  0.00% nautilus
        51158 nobody  1  52  5 12376K  1644K       0  0:17  0.00% find
        ,
        SSD activity: IDLE.
      3. Reboot/Shutdown, stall at "All buffer synced" Over night,
      4. Solution: Power bottom only.
  2. devel/ccache.
    I used it to build ports (cache directory on ZFS)
    Build run 1 minute, later stall.
  3. rm huge directory (ccache's cache directory, 5.XGB, 8-layer hash in ZFS)
    Run 1 minute approx, later SSD access idle, rm can't return to prompt.
I had been tried
Code:
vfs.zfs.arc_max="100M"
vfs.zfs.prefetch_disable="1"
won't fix the problem.

I checked ZFS status by sysutils/zfs-stats and use zfs-stats -ALand output below
Code:
------------------------------------------------------------------------
ZFS Subsystem Report         Mon Aug 31 01:09:48 2015
------------------------------------------------------------------------

ARC Summary: (HEALTHY)
   Memory Throttle Count:       0

ARC Misc:
   Deleted:         201.40k
   Recycle Misses:         228.69k
   Mutex Misses:         3
   Evict Skips:         12.33m

ARC Size:         75.17%   192.43   MiB
   Target Size: (Adaptive)     87.70%   224.50   MiB
   Min Size (Hard Limit):     12.50%   32.00   MiB
   Max Size (High Water):     8:1   256.00   MiB

ARC Size Breakdown:
   Recently Used Cache Size:   77.17%   173.26   MiB
   Frequently Used Cache Size:   22.83%   51.24   MiB

ARC Hash Breakdown:
   Elements Max:         47.53k
   Elements Current:     37.12%   17.64k
   Collisions:         14.48k
   Chain Max:         3
   Chains:           585

------------------------------------------------------------------------

L2ARC is disabled

------------------------------------------------------------------------

Have any suggest for the problem?
Thanks all a lot!
 
The only way to fix this problem is adding more available disk space. You can't use over 90% of available space on the pool without things like this happening. There is just no way around it as ZFS needs free space to function correctly which is the reason for the 80% rule. Either create a second zpool and move some of the data over or create a mirror with a second larger disk, then expand the pool space.
 
Hi All

The problem solved
Reason: Unknown AND unfixable ZFS and/or Storage(SSD) defect.

I checked process status(stalled find (rebuild locate database)) by sysutils/lsof.
Result: find jammed at specific directory
The directory is devel/ccache 's cache directory(8th layer), and locate in ZFS's dataset like under list(Actual: the list got from fixed system).
Code:
nekozpool  187G  8.89G  112K  /nekozpool
nekozpool/ccache  1.24G  2.76G  1.24G  /var/cache/ccache      <<-- HERE, 8 layer hashed directory
nekozpool/home  181G  8.89G  181G  /home

Next, I made a bootable flash thumb disk by FreeBSD-10.2-RELEASE-amd64-memstick.img and tried..
  1. Enter the jammed specific directory by cd
    1. Boot from flash thumb disk: Kernel Panic.
    2. Boot from laptop's single user mode : jammed.
  2. Checking ZFS File System Integrity by zpool scrub nekozpool
    1. Boot from flash thumb disk: No known problem.
    2. Boot from laptop's single user mode: No known problem.
I confused, why jammed at specific directory, but check result is GOOD:eek:?
I am lucky enough...
The problem occurred at separate dataset, So I remove the dataset by zfs destroy nekozpool/ccache
  1. Boot from flash thumb disk: Success.
  2. Boot from laptop's single user mode: Got error message and fail.
Finally, I created same dataset by zfs create nekozpool/ccache like upper dataset list.
Shutdown/reboot and other all work fine, the problem disappear..

Anyway, I am very luck, the problem solved.
But it is terrible, if the defect occur in root/undestroyable dataset/pool?
If true, I can't fix it by zfs destroy and zpool scrub won't fix also.

Thanks all very much.
 
Back
Top