ZFS Deleting a directory with an unknown number of files

Hi all -

I have a directory containing an unknown number of files - at least 35K at one point. I would like to delete the directory, but every method I've tried runs indefinitely (> ~12 hours), hangs, or runs out of memory. The directory is on a ZFS filesystem, but it isn't a separate ZFS dataset (lesson learned for next time).


So far I have tried the following:
1. # rm -rf ocd-data/ - this hangs with a [zio->io_cv] in ctrl+t
2. Different invocations of rsync trying --delete and --delete-after; # rsync -a --delete-after empty_dir/ ocd-data/ just hang... and never return.
3. I tried a perl solution suggested, # perl -e 'for(<*>){((stat)[9]<(unlink))}', but I don't really know perl... and this, too, never returns.

Does anyone have any suggestions for dropping this directory? I would be very appreciative - thank you in advance for your time and help!

Mods, if it would be better to post this in a different forum please let me know and I'll repost it.
 
And what does that find command output? The speed of the find command should be at least many hundreds, if not thousands of file names per second.

How big is the file system? How big is the directory, in the sense: If you stat it (for example look at the output from ls -ld on it), how many bytes is it? That should be the number of entries in the directory.

Are there processes that are continuously changing the directory? Is there other activity (such as resilvering) that keeps the file system, the disks, or the machine busy?

For a directory with "just" 35K files, taking 12 hours for any operation seems way out of line. A few minutes seems believable; 12 hours probably means something is broken.

Anecdote: I actually used to ask job candidates the following interview question, for design of (parallel and clustered) file systems: A customer has complained that deleting all files in a directory takes too long, about 3 days. The directory contains a billion files. They attempt the deletion by using 1000 nodes in parallel, each node deleting 1/1000th of the files (taking care that each file is deleted exactly once). Change either how the delete is performed, or the internal design of the file system to make the delete operation faster.

The interview problem is due to a real-world customer complaint, and describes the actual situation that happened at a customer.
 
I once also had the problem a directory contained too many files they cannot be removed all with one simple
rm anymore.
(I'm not quite sure anymore - it happened six or seven years ago, and I forgot almost all the details: how I got to this directory [something I messed up {of course}]...as far as I remember I examined why rm refused to handle that amount of files was somehow in the rm's code the max number of files is defined [edit: I think I remember it wasn't in rm's code, but the max. definition for wildcards]; however when this 'internally capacity' is overflown, rm refuses to work/does not work properly. But don't nail me on that. As I said it was over six years ago, I didn't documented it, and this certain case may have its cause in other reasons.)

However,
My solution was to remove the files step by step partially:
rm a*
rm b*
rm c*
...
This may not actually suit your current situation, but you get the idea.
Depending on the way the file's are named it may be useful to put that into a small shellscript.

Another idea you could try was to get the dir's content (partially) into a textfile:
ls > mydirscontent.txt
then write a small script:
sh:
#!/bin/sh

while read filename; do
        rm ${filename}
done < mydirscontent.txt

exit 0
 
find .... | xargs rm is typically a lot more efficient than trying to delete individual files one by one.

Also note that if the pool is full, or almost full, performance is horrendous, even for deleting files.
 
find | xargs is usually better but in some cases wont work (when the absolute / relative to start point path is too long)
i tried this
sh:
i=0;while mkdir dirname$i;do cd dirname$i;touch filename$i;i=$(($i+1));echo -n $i.;done
DO NOT TRY THAT ON YOUR FS (i did on a md specially crafted for this)
and created a 3k component path / like 30k long.
rm -rf worked (uses chdir) but find|rm wont work for the leaves (filename too long)
 
This may not be related to this issue, but as a general rule, deleting a large number of files on a copy-on-write filesystem when there is little free space can be very slow.
This is because metadata needs to be copied when deleting, but it is not possible to secure contiguous free space for this purpose.
I haven't actually experienced it though.
 
Code:
$ find /<dir>/ -delete
This should be the best one. Consider "find /dir/ -print -delete" so you can watch the output as it goes. Just one access of the file metadata, and doesn't have to launch extra processes for every file.

Find to xargs rm means passing thousands of filenames and may require a second access of the file metadata (ie: lookup the name in rm after find already did). Xargs will have to spawn many rm processes for further overhead.

"rm a*" is a poor choice, because shell globbing is fragile in case of overwhelming numbers of files. It will likely also require multiple file metadata accesses (shell globbing then rm). It may also iterate the whole directory metadata in order to filter out the files with "A". Every rm will start from scratch.

Typically when deleting/copying/listing directories with huge numbers of files, the bulk of the time is spent in iterating over the filenames and metadata for each file. File allocation tables are often slow and single threaded, depending on the filesystem type. Anything you can do to minimize the number of times you iterate over the file list should help.
 
And what does that find command output? The speed of the find command should be at least many hundreds, if not thousands of file names per second.

How big is the file system? How big is the directory, in the sense: If you stat it (for example look at the output from ls -ld on it), how many bytes is it? That should be the number of entries in the directory.

Are there processes that are continuously changing the directory? Is there other activity (such as resilvering) that keeps the file system, the disks, or the machine busy?

For a directory with "just" 35K files, taking 12 hours for any operation seems way out of line. A few minutes seems believable; 12 hours probably means something is broken.

Anecdote: I actually used to ask job candidates the following interview question, for design of (parallel and clustered) file systems: A customer has complained that deleting all files in a directory takes too long, about 3 days. The directory contains a billion files. They attempt the deletion by using 1000 nodes in parallel, each node deleting 1/1000th of the files (taking care that each file is deleted exactly once). Change either how the delete is performed, or the internal design of the file system to make the delete operation faster.

The interview problem is due to a real-world customer complaint, and describes the actual situation that happened at a customer.
hi ralphbsz - thanks for your response!

Code:
) ls -ld ocd-data
drwxr-xr-x  2 bridger bridger 31408165 Feb  2  2025 ocd-data/


Are there processes that are continuously changing the directory? Is there other activity (such as resilvering) that keeps the file system, the disks, or the machine busy?

For a directory with "just" 35K files, taking 12 hours for any operation seems way out of line. A few minutes seems believable; 12 hours probably means something is broken.

No processes are changing the directory, and, to the best of my knowledge, no reslivering. It definitely seems like something is broken - hopefully someone can help me figure out the "what"!
 
I once also had the problem a directory contained too many files they cannot be removed all with one simple
rm anymore.
(I'm not quite sure anymore - it happened six or seven years ago, and I forgot almost all the details: how I got to this directory [something I messed up {of course}]...as far as I remember I examined why rm refused to handle that amount of files was somehow in the rm's code the max number of files is defined [edit: I think I remember it wasn't in rm's code, but the max. definition for wildcards]; however when this 'internally capacity' is overflown, rm refuses to work/does not work properly. But don't nail me on that. As I said it was over six years ago, I didn't documented it, and this certain case may have its cause in other reasons.)

However,
My solution was to remove the files step by step partially:
rm a*
rm b*
rm c*
...
This may not actually suit your current situation, but you get the idea.
Depending on the way the file's are named it may be useful to put that into a small shellscript.

Another idea you could try was to get the dir's content (partially) into a textfile:
ls > mydirscontent.txt
then write a small script:
sh:
#!/bin/sh

while read filename; do
        rm ${filename}
done < mydirscontent.txt

exit 0
Hi Maturin - I think that may be part of what happened here: too many args (or files) for the rm command. I don't remember the syntax of the file names in the directory, which prevents your helpful suggestion, and ls > dir_contents.txt won't ever complete.

Thanks for the response!
 
find .... | xargs rm is typically a lot more efficient than trying to delete individual files one by one.

Also note that if the pool is full, or almost full, performance is horrendous, even for deleting files.
hi SirDice - thanks for the response.
Code:
$ find /<dir>/ -delete
thanks, mro - I appreciate the response.

Edit: RussellASC, thank you, too, for the suggestion and thought behind the command choice. I'll post back when an update shortly.

Globbing the find... suggestions together. I'll give the find /<dir>/ -delete a try, but since a basic find ocd-data/ never returns, I don't have high hopes for this approach (but will soon see!).

Code:
root@dustbin:/astral/errata # find ocd-data/
ocd-data/
load: 0.19  cmd: find 73678 [zio->io_cv] 9037.96r 1.09u 7.40s 0% 871892k
load: 0.14  cmd: find 73678 [zio->io_cv] 10543.53r 1.34u 8.47s 0% 1038768k
load: 0.17  cmd: find 73678 [zio->io_cv] 11129.11r 1.39u 8.80s 0% 1102472k

🫣
 
ls > dir_contents.txt won't ever complete.
f4##! You really have a problem.
drwxr-xr-x 2 bridger bridger 31408165 Feb 2 2025 ocd-data/
that's not 35k files, that's more kind of 31M - that's in deed a lot more I had dealt with.

Since this dir was created on Feb 2nd, a quick calculation tells me that's (in average) 138974 files per day, which was 96 new files every minute - are you sure the "production of new files" to that directory is stopped?

Can you do a df -h on the disk/partition the filesystem in question is on?
And maybe an zfs list, too?

Anyway I would at least try something like a rm /path/to/dir/111* rm /path/to/dir/112* on it - partially, but try to kill what can be killed, just to reduce the number of files to get to a state where actual work can be done again.

Do you have any idea how the filenames may look like?
At this magnitude I guess they are somehow numerated.
 
Use the -f option with ls to prevent sorting.

You can try "rm -rvx ocd-data" to see what it prints -- you should see some progress or error messages. The -x option to avoid crossing mounts. You can ^C rm any time so this is just to see what it does. Given 31M files, you may want to point its stderr and stdout to a file.
 
f4##! You really have a problem.

that's not 35k files, that's more kind of 31M - that's in deed a lot more I had dealt with.
I probably should have said "at least 35K". 🤦‍♂️
Since this dir was created on Feb 2nd, a quick calculation tells me that's (in average) 138974 files per day, which was 96 new files every minute - are you sure the "production of new files" to that directory is stopped?
Yes - no new files are being written to the directory.
Can you do a df -h on the disk/partition the filesystem in question is on?
And maybe an zfs list, too?

root@dustbin:/astral # df -h astral
Filesystem Size Used Avail Capacity Mounted on
astral 1.7T 239G 1.5T 14% /astral


Anyway I would at least try something like a rm /path/to/dir/111* rm /path/to/dir/112* on it - partially, but try to kill what can be killed, just to reduce the number of files to get to a state where actual work can be done again.

Do you have any idea how the filenames may look like?
At this magnitude I guess they are somehow numerated.
Unfortunately, the files were generated with a random string - that's making small batch deletion tricky.
 
Back
Top