ZFS lockup during zfs destroy

I am trying to destroy a dense, large filesystem and it's not going well.

Details:
- zpool is a raidz3 with 3 x 12 drive vdevs.
- target filesystem to be destroyed is ~2T with ~63M inodes.
- OS: FreeBSD 10.3amd with 192 GB of RAM.
- 120 GB of swap (90GB recently added as swap-on-disk)

What happened initially is the system locked after a few hours up and I had to reboot. Upon rebooting and starting zfs, I see sustained disk activity in gstat *and* that the sustained activity is usually just 6 disks reading. Two raidz3 vdevs are involved in this filesystem I am
deleting so there are 6 parity disks ... not sure if that is correlated or not.

At about the 1h40m mark of uptime I see things start to happen in top: a sudden spike in load, and drop in the amount of "Free" memory as reported in top:
(
Code:
Mem: 23M Active, 32M Inact, 28G Wired, 24M Buf, 159G Free
)

It drops down under a GB and then fluctuates up and down till eventually it reaches some small amount (41 MB). As this drop starts, I see gstat activity on zpool drives cease, and there's some light activity on the swap devices, but not much. Also, the amount of swap used is reported as very little, maybe less than a MB to 24 MB. swapinfo shows nothing used. After the memory usage settles the system eventually ends up in a locked state where:

- nothing is going on in gstat; the only non-zero number is the queue length for the swap device which is stuck at 4
- load drops to nothing, and occasionally I see the zfskern and zpool procs stuck in vmwait state*.
- shell is unresponsive, but carriage returns register
- there are NO kernel/messages of any kind on console indicating a problem or resource exhaustion

Finally, I cannot do this:
Code:
# zdb -dddd pool/filesystem  | grep DELETE_QUEUE
zdb: can't open 'pool/filesystem': Device busy
(presumably because it is pending destroy ...)
I had set:
Code:
vm.kmem_size="384G"
(and nothing else in loader)

but even removing that and setting more realistic figures like:
Code:
vm.kmem_size=200862670848
vm.kmem_size_max=200862670848
vfs.zfs.arc_max=187904819200
have not resulted in a different outcome, *though I don't see the processes in vmwait any longer, the state is just "-"

I've just lowered these to:
Code:
vm.kmem_size=198642237440
vm.kmem_size_max=198642237440
vfs.zfs.arc_max=190052302848
to see if that will make a difference.

No matter how many times I reboot, so far about 6, I never make it past the 1h40m mark and this memory dip. I don't know if I'm making any progress or just running into the same wall.

My questions:

- is this what it appears to be, a memory exhaustion?
- if so, why isn't swap utilized?
- how would I configure my way past this hurdle?
- a filesystem has a DELETE_QUEUE ... does the zpool itself have a destroy queue of some kind? I am trying to see if I can see the zpool working and how far along it is, but I do not know what to query with zdb

Thanks!
 
This sounds an awful lot like having dedup enabled and not having enough memory for it. Is/Was dedup enabled?
 
It definitely sounds like memory exhaustion. How much space are you looking to reclaim/delete with this zfs destroy? Are (were) these cloned datasets with maybe lots of dependent datasets and/or snapshots?

Have a look at the memory statistics and "memory throttle counts" from zfs-stats -MA for hints if it is indeed memory exhausting.

Regarding destruction and its performance, there have been improvements, namely asynchronous destroy [1], which were merged to FreeBSD in 2012; so I suspect they should be already included in 10.3 - check the feature flags for your pool for the "async_destroy" feature and if it is enabled.
And for further reading: Matt Ahrens of Delphix wrote a blog post [2] about the technical backgrounds of the async_destroy feature and ZFS destroy performance in general.

You might also take a look at what functions ZFS is hitting before and around/at that "tipping point" by probing it with DTrace:
dtrace -n 'fbt:zfs::* { @[probefunc] = count(); } tick-10s { exit(0); }'
Some function(s) might stand out e.g. with extremely high counts compared to the previous "baseline". Modify the probing time by varying the "tick-N" time (seconds).


[1] http://www.open-zfs.org/wiki/Features#Asynchronous_Filesystem_and_Volume_Destruction
[2] https://www.delphix.com/blog/delphix-engineering/performance-zfs-destroy
 
Back
Top