ZFS Deleting a directory with an unknown number of files

I once also had the problem a directory contained too many files they cannot be removed all with one simple
rm anymore.
(I'm not quite sure anymore - it happened six or seven years ago, and I forgot almost all the details: ...
Typically the problem with "rm *" from the shell is the following: The shell first expands the glob, meaning it turns the command line internally from "rm *" to "rm aaa aab aac ...". But the shell has a limit on the length of the command line (used to be 10K bytes), and it just runs out of memory. That's why the trick with "rm a*", "rm b*" ... may work, if each rm command individually fits into the memory limit.

From this viewpoint, it is better to do "find ... -print | xargs rm", or "ls | while read f ; do rm $f ; done" or some variation. The problem with these ideas is manyfold. First, the rm program will be started once for each file, which is a huge overhead. The way to fix that is "find ... -print | xargs -n 100 rm", which starts rm for every group of 100 files. (Side remark, not related to the size: if there are any file names that need to be quoted, use -print0 on the find command and the -0 option on xargs).

Second, the find operation (or the ls) iterates over the directory, while it is being changed. That can cause performance havoc due to caching, depending on how the file system is implemented. The system may need to have two copies of the directory in cache (one for the readdir() operation in the find, one that is being modified), or the find may have to internally restart its cache all the time. If there is enough space to create one complete copy of the listing, this would be more efficient: "ls ... > /tmp/file.list; cat file.list | xargs -n 100 rm".

Third problem: many file systems store directories linearly on disk, as a kind of file (big array of bytes) containing file names and internal pointers (such as inode numbers). Deleting the files from the front may require that directory file on disk may have to be repacked regularly, meaning all its content shifted. Usually, the case of recent files being deleted is more efficient, because they are at the end of the directory, and the directory content can just be truncated. So to do the deletion, use a program to get all the file names that does not sort them (find will do fine), and then reverse the list: "find ... -type f > /tmp/file.list" followed by "tac /tmp/file.list | xargs -n 100 rm", where tac is a version of cat that reverses the lines in the file.


Code:
) ls -ld ocd-data
drwxr-xr-x  2 bridger bridger 31408165 Feb  2  2025 ocd-data/
As said above, that's 31 million files. Your file system has been busy. I hope it is not currently as busy. To make sure your deletion has a chance, I would put the system into single user mode, and carefully check that no other processes are running. You said that nothing is being written right now ... but given that you have been writing creating about two files per second for the last ~8 months, I'm not sure we can trust that statement.

hi covacat - no, it doesn't.
In that case, I strongly suspect that your ZFS file system has become corrupted. Even with 31 million files, just reading the directory once (which is what find does) should finish, and it should make progress. Unfortunately, that leads to the following solution:

Maybe do it the other way round. Move all the stuff you want to keep elsewhere and then format the filesystem.
Since you are not interested in debugging how ZFS got broken, this is likely the easiest answer. Requires an extra set of disks though. Once you have the new file system formatted, it's easy to do: rsync with an --exclude option.

In the future, may I ask to reconsider the architecture of your system? You're creating ~2 files per second, sustained. Do you actually need all of them? If the files are just a write-only cache of recent results (commonly used for checkpointing processes), they are mostly unneeded once the process finishes. Could you implement a cleaning mechanism, which limits the number of files? Or maybe have a small number of files (for example just two), and overwrite them in place with a ping-pong mechanism? If your files are created and deleted at random times, and they are individually small, perhaps a database is a better mechanism, rather than a file system.

Another question is the following: How a directory is stored on disk depends on the file system implementation. I do not know how ZFS does it (never having studied its internals), but I know that some file systems really struggle with large directories, because they internally rely on linear searches of the directory for all operations. You might be much better off structuring your workload to create many smaller subdirectories. For example, if your data is storing recently acquired data, you could do ocd-data/September/Week3/Tuesday/09am/40:37.123.data for the file created today at 09:40:37.123, and the first few layers of directories would only have a handful of entries (the last one, however, would still have about 7200). This might overall run much faster, and make deletion and cleanup much easier.
 
Unfortunately, the files were generated with a random string - that's making small batch deletion tricky.
Okay. That cleared a lot of things:
- Your partition is not even near its capacity limit
- no new files are produced
- the files seem to be small, a few kB each
- random names makes it hard to try what I suggested by removing partially

Sorry, but I guess others will come to a solution.
Good Luck! (I will continue reading here, but I'm out of ideas.)
 
hi cracauer@ - it's hard to say: rm -rf never gets far enough to hit swap. I just started a # rm -rvx ocd-data/; I'll keep an eye on it.

Thanks for your help!
 
The shell first expands the glob, meaning it turns the command line internally from "rm *" to "rm aaa aab aac ...". But the shell has a limit on the length of the command line (used to be 10K bytes), and it just runs out of memory.
Yeah, right! That was it. I remember now.
Thanks for clearing that up.

Very interesting insight on some technical details!
Thanks for that even more!
 
Why is this better than just doing "rm -rf", unless you only want to remove a specific subset? If "rm -rv" shows progress, eventually it will finish.
Good point, "rm -rf ocd-data" is better than "find ... | xargs ... rm".

Here is an interesting experiment CanOfBees could do: Quiesce the system completely, to make sure no other process is creating or deleting files. Measure the number of files in the offending directory, by doing simply an "ls -ld ocd-data". Start "rm -rf ocd-data", watch it using CPU time and doing IO with control T (yes, I mean tee) every second, and stop it (with control C) after for example 10 or 100 seconds. Then measure the number of files in the directory again. That gives you an initial guess at the speed at which deletes are progressing, which should be on the order of hundreds or thousands of files per second. Then lather, rinse and repeat that procedure, and check whether the speed is increasing, decreasing, or coming to a full stop.

I have a question that may sound unrelated, but isn't: Is there a way to configure a zpool or zfs file system so all disk IOs have to be synchronous? If yes, the speed of deleting files might be limited to roughly 100 per second, probably less than that if a file deletion requires multiple disk IOs. If someone knows whether such an option exists, CanOfBees should check that it isn't set. I just spent 15 seconds looking at the man page and didn't see one.
 
As said above, that's 31 million files. Your file system has been busy. I hope it is not currently as busy. To make sure your deletion has a chance, I would put the system into single user mode, and carefully check that no other processes are running. You said that nothing is being written right now ... but given that you have been writing creating about two files per second for the last ~8 months, I'm not sure we can trust that statement.


In that case, I strongly suspect that your ZFS file system has become corrupted. Even with 31 million files, just reading the directory once (which is what find does) should finish, and it should make progress. Unfortunately, that leads to the following solution:
No new files have been written to this directory for a long while.
Since you are not interested in debugging how ZFS got broken, this is likely the easiest answer. Requires an extra set of disks though. Once you have the new file system formatted, it's easy to do: rsync with an --exclude option.
I am interested in how/why ZFS is broken, but I'll readily confess that I may lack the skills to debug completely.
In the future, may I ask to reconsider the architecture of your system? You're creating ~2 files per second, sustained. Do you actually need all of them? If the files are just a write-only cache of recent results (commonly used for checkpointing processes), they are mostly unneeded once the process finishes. Could you implement a cleaning mechanism, which limits the number of files? Or maybe have a small number of files (for example just two), and overwrite them in place with a ping-pong mechanism? If your files are created and deleted at random times, and they are individually small, perhaps a database is a better mechanism, rather than a file system.

Another question is the following: How a directory is stored on disk depends on the file system implementation. I do not know how ZFS does it (never having studied its internals), but I know that some file systems really struggle with large directories, because they internally rely on linear searches of the directory for all operations. You might be much better off structuring your workload to create many smaller subdirectories. For example, if your data is storing recently acquired data, you could do ocd-data/September/Week3/Tuesday/09am/40:37.123.data for the file created today at 09:40:37.123, and the first few layers of directories would only have a handful of entries (the last one, however, would still have about 7200). This might overall run much faster, and make deletion and cleanup much easier.
This was a data processing project - I would definitely reconsider serializing files like this again; definitely not without using a ZFS dataset, and I would strongly consider using a pear tree for handling on-disk file storage.
 
I have a question that may sound unrelated, but isn't: Is there a way to configure a zpool or zfs file system so all disk IOs have to be synchronous?
zfs behavior is far too complex for that to work! Also recall that zfs is copy on write so even changing one byte will result in multiple writes all the way to zfs superblock (or whatever they call it). It may be that deletes are indeed write-through (in some higher level sense) which is why they are slow. But if you want to see what is going on at the disk level, may be use ktrace on bhyve (for a VM that is using zfs).

None of the above should be necessary. First we have to see if rm makes progress or is stuck. If it is stuck then we need to understand why. If it is making progress, we need to see how many files/dirs are deleted / second. For 31M files, at 100 deletes per second it will take 86 hours.
 
Do you have an alias for the rm command? Sometimes people do that.
the command "alias" would show you.
if so do /bin/rm

you can also do some regex stuff on rm:
rm -f [a-c]* will force delete anything started with a, b or c
 
Rebooted to single-user mode
I can start a rm -rf ocd-data/, but I'm not quite sure how to proceed with ktrace.

Does this look right: # ktrace -i rm -rf ocd-data/? Based on the
ktrace man page that seems like a good start, but if there's a better/more appropriate invocation I'd like to know it.

Thanks all for your help!
 
if ls -f ocd-data gets stuck and does not list any files you can do/try this
get the dir inode by ls -lid ocd-data (inode is first column)
zdb -ddddd <poolname>/<dataset> inode
 
You can also attach ktrace to a running process.

But you don't need that if you do "rm -rvx ocd-data". What does it show? it should show filenames as it deletes them. Does it continue showing more names or does it stop?

PS: an interesting choice of directory name :-)
 
PS: an interesting choice of directory name :-)
Indeed. Two old jokes: I don't suffer from OCD, I enjoy it. And my neighbors call it "obsessive cycling disease", the spend several hour per day bicycling around the hills.

Seriously: I like the suggestion of rm -rvx; just by looking at the output on the console, one can see roughly how fast it is going.
 
You can also attach ktrace to a running process.

But you don't need that if you do "rm -rvx ocd-data". What does it show? it should show filenames as it deletes them. Does it continue showing more names or does it stop?

PS: an interesting choice of directory name :-)
bakul - yes, :D the directory name was unintentional but... surprisingly apt.
Indeed. Two old jokes: I don't suffer from OCD, I enjoy it. And my neighbors call it "obsessive cycling disease", the spend several hour per day bicycling around the hills.

Seriously: I like the suggestion of rm -rvx; just by looking at the output on the console, one can see roughly how fast it is going.
# truss -o ocd-data-delete-2.txt rm -rvx ocd-data/ is rolling right along, it seems. While the output file has grown too long to tail -f ..., lc shows the file is growing.

) lc -l ocd-data-delete-2.txt
-rw-r--r-- 1 root bridger 271M Sep 16 16:23 ocd-data-delete-2.txt


I'll let it run overnight and see where it winds up.
Thank you all very much for your helpful ideas - I'm very grateful!
 
  • Like
Reactions: mro
Just checking:
Have you changed any ZFS parameters/sysctl's, especially anything related to ZFS' ARC? If so, which ones and how?
Are you perhaps using dedup, if so is it still enabled?
What version of FreeBSD are you running; when, approximately, was the pool created, mainly under what OS version?
 
Just checking:
Have you changed any ZFS parameters/sysctl's, especially anything related to ZFS' ARC? If so, which ones and how?
Are you perhaps using dedup, if so is it still enabled?
What version of FreeBSD are you running; when, approximately, was the pool created, mainly under what OS version?
The pool was created in 2017 (
Code:
2017-08-02.20:49:11 zpool create astral /dev/ada2
), probably on FreeBSD 10.something?

I don't recall modifying any properties or enabling dedup, though.
 
All,

thanks again for your patient help! I've possibly learned some new things, and had some old lessons reinforced, with your help.
I think I've approached this problem incorrectly - I have been letting the following command run for long periods of time:
# truss -o ocd-delete.txt rm -rvx ocd-data/

but I've just noticed (after too many hours) that this process doesn't delete files as it appends them to the `ocd-delete.txt` file - it waits until all files are accumulated(?)/touched by the `rm` process, and then they are `unlinked`. My mistake.

So, I'm instead moving the important work off of this filesystem and I'll just nuke it from orbit and re-create the filesystem.
 
So, I'm instead moving the important work off of this filesystem and I'll just nuke it from orbit and re-create the filesystem.

Noooo

Don't give in. For all we know the delete would work if you execute it on a server with -say- 256 GB RAM.

If you don't have a spare computer with that much RAM, there is an easy way to do it. Rent a AWS EC2 machine with 256 GB, export the disks the pool is on via iSCSI, mount that on the EC2 server and there you go. Easy peasy.
 
Noooo

Don't give in. For all we know the delete would work if you execute it on a server with -say- 256 GB RAM.

If you don't have a spare computer with that much RAM, there is an easy way to do it. Rent a AWS EC2 machine with 256 GB, export the disks the pool is on via iSCSI, mount that on the EC2 server and there you go. Easy peasy.
🤣
 
# truss -o ocd-delete.txt rm -rvx ocd-data/

but I've just noticed (after too many hours) that this process doesn't delete files as it appends them to the `ocd-delete.txt` file - it waits until all files are accumulated(?)/touched by the `rm` process, and then they are `unlinked`. My mistake.
It’s truss that is appending to your ocd-delete.txt file as it traces what syscalls rm makes. You should read truss and rm manpages carefully. Note that rm -v will output names as it deletes them. See my last reply. Sigh.
 
It’s truss that is appending to your ocd-delete.txt file as it traces what syscalls rm makes. You should read truss and rm manpages carefully. Note that rm -v will output names as it deletes them. See my last reply. Sigh.
Hi bakul - don't sigh :). rm -vxw ocd-data/ would sit and sit and sit and never print a thing: I don't know if the problem was because of trying to buffer the list, or some other reason. rm... never showed excessive RAM usage in top/ htop, but also never gave any appearance that it was actually deleting the files. Sadly, I think in this particular case of "mistake", the right approach was to move important files off the drive, delete the dataset, and recreate it. The original, delete-all-the-files approach would have taken me another week - the ZFS fix took about 30 minutes, counting the time it took to move around files with rsync.

In any case, I am really appreciative of your time and thoughtful help! Have a good one!
 
  • Like
Reactions: mro
sit and sit and sit and never print a thing
It's impossible from external reading forum posts only to tell how long you wait until you decide a process will "never end."
By all my experience with FreeBSD in most cases you either get an error message, or the process will end. It may take a long time, maybe even days, but eventually it ends. Then it's either done, or you get a (late) error message. But that a process simply stucks infinite without any sign of life whatsovever under FreeBSD is a very rare exception. To be clear: I'm not talking the software you add with ports/packages. I'm talking FreeBSD.
This ain't Windows 😉
But I also know there can be circumstances, when this may not be the case, and errors are not detected and handled correctly in time, depening on many possibilities. And I don't know your directory, and how it came to it, and what else might be...

I thought about to attach an etxra SSD at my machine, and do some time measurement experiments on dirs with 5k, 10k, and 15k files, just to get some values one can extrapolate to get at least a roughly estimated idea about which times can be expected on >31M files for certain actions. By my gut feeling only I would say you have to wait several hours until something like ls can even finish on such a large amount of files - you are way beyond those "normal default" 10k ralphbsz mentioned. But that doesn't mean the OS couldn't handle it at all.
Anyway one cannot expect the reactions of the OS with 31M files is as fast as with directories containing <=10k files, simply because it's a lot more to be handled. Even the fastest hardware working at lightspeed needs time.

However,
I'm also thinking practical, which means:
If there is no valuable data needed to be rescued, what is the easiest, quickest solution?
Of course, you already got to it on your own: copy the valueable data, wipe the crap clean, and start all over on a clean drive again. I would do it the exact same way.

Point is there are lessons to be learned from that (I tell this, because this is an open forum anybody on the internet can read here, so don't take that personal on you):

Maybe you kind of "heired" this directory. But if you "produced" it by yourself, there had been some error in testings. If for example some one wants to log data on files, after a while one looks into the directory just to check the shit works as intended. In this case it should have been attracted attention:"Almost 400 files within 3 minutes! crap. Something must went wrong." So one had to check if either the amount of files produced per minute was what was intended, or to think of an routine that limits the number of files, and automatically deleted all old ones.
On the other hand there are cases such an amount of data really needs to be saved, e.g. some measurement of a technical device, physical experiment, data collecting buoy in the ocean,...whatever.
Then there is to think of how to organize this data, since 31M files is nothing any human will ever analyse by hand, but being processed by computers.
And even if there is for whatever reasons no other way to place it all into single files, then those files for sure don't get random names, but be named by some kind of a comprehandable system, and better be distributed into directories, also named senseful...because random file names on 31M files is garbage, no matter what they contain.
🧐🤓🥸😎:beer:😂
 
Last edited:
Back
Top