ZFS rsync - Bad file descriptor

CyberCr33p · Mar 16, 2024

FreeBSD 13.3 but I am not sure if it's related to FreeBSD / ZFS version.

Doing a rsync between the same ZFS dataset:

rsync --delete -a /home/www/user/example.com/www/ /home/www/user/example.com/dev/

Output:

Code:

rsync: [receiver] failed to set times on "/home/www/user/example.com/dev/wp-content/uploads/2019/11/.riddle-3-goddesses-723x394.png.UH9aeU": Bad file descriptor (9)
rsync: [receiver] failed to set times on "/home/www/user/example.com/dev/wp-content/uploads/2024/02/.shantanu-kumar-_CquNNr1744-unsplash.jpg.58uTiq": Bad file descriptor (9)

And the permissions are not the same, for example:

 ls -la /home/www/user/example.com/dev/wp-content/uploads/2019/11/riddle-3-goddesses-723x394.png  /home/www/user/example.com/www/wp-content/uploads/2019/11/riddle-3-goddesses-723x394.png

Output:

Code:

-rw-------  1 user  user  399841 Mar 16 10:29 /home/www/user/example.com/dev/wp-content/uploads/2019/11/riddle-3-goddesses-723x394.png
-rw-r--r--  1 user  user  399841 Nov 15  2019 /home/www/user/example.com/www/wp-content/uploads/2019/11/riddle-3-goddesses-723x394.png

zpool status shows no issues.

Any idea what happens and shows bad file descriptor?

zirias@ · Mar 16, 2024

I had (among lots of other issues) these errors in poudriere builds with ZFS and 13.3. The patches from PR 275594 solved it.

CyberCr33p · Mar 16, 2024

I hope we get an errata notice soon. I use rsync to clone websites but more importantly I use it between ZFS datasets to copy my data and then snapshot it.

cracauer@ · Mar 16, 2024

Anything in `dmesg`?

CyberCr33p · Mar 16, 2024

No it didn't show something.

Cath O'Deray · Mar 16, 2024

Re <{link removed}> please follow:

ZFS - disappointing ZFS(?) performance from 13.3

ralphbsz · Mar 17, 2024

Can someone explain to me how a set of performance issues turns into "bad file descriptor", which is a correctness issue? In my understanding, EBADF usually happens because of a bug in the user-space program (in this example rsync), for example using a random integer as a file descriptor number, or reading from a file after it is closed, or trying to write to a file that is opened readonly. How does this relate to performance issues?

zirias@ · Mar 17, 2024

ralphbsz said:
Can someone explain to me how a set of performance issues turns into "bad file descriptor", which is a correctness issue?

I guess the guy currently offering these patches might be able to explain this. Fact is, I've seen them (inside poudriere builds) after upgrading only the kernel to 13.3.

My naive guess would be that there's some kind of "give up" code path in the kernel which then somehow invalidates the fd without the userspace application ever getting a chance to know about it.

edit -- I've also seen sudden EBADF when testing some 9pfs client implementation on -CURRENT, so sure, bugs in the kernel (vfs or concrete fs) can trigger that. The question remains how it can be related to a performance issue, well, see my guess above ;-)

cracauer@ · Mar 17, 2024

ralphbsz said:
Can someone explain to me how a set of performance issues turns into "bad file descriptor", which is a correctness issue? In my understanding, EBADF usually happens because of a bug in the user-space program (in this example rsync), for example using a random integer as a file descriptor number, or reading from a file after it is closed, or trying to write to a file that is opened readonly. How does this relate to performance issues?

No, I was wondering the same thing.

It might be a bug that is still lurking, now covered up by a performance fix.

Cath O'Deray · Mar 17, 2024

ralphbsz said:
Can someone explain to me how a set of performance issues turns into "bad file descriptor",

For what it's worth, I currently assume that the opening post here <https://forums.freebsd.org/posts/647559> is not directly related to reports such as 277717.

PMc · Mar 17, 2024

ralphbsz said:
Can someone explain to me how a set of performance issues turns into "bad file descriptor", which is a correctness issue? In my understanding, EBADF usually happens because

"usually". Now we would need to stroll thru the kernel code and see at which kind of exceptions this errno is given back...
The party around PR 275594 concerns the inode cache - which is not all too far away from (bad) file descriptors.
What happens -simplified and as far as I currently understand it- is the kernel decides that the zfs vnodes take to much kernel memory in relation to other things, and then triggers arc_prune jobs, and -for some strange reason- lots and lots of them, and they cannot do anything because the vnodes are locked, but these arc_prune themselves now lock lots of other things.

Probably this EBADF is a secondary bug which gets only triggered due to this situation, and therefore never before got noticed. We all know there are lots of such flaws lurking, as nobody can do full logical verification.

cracauer@ · Mar 17, 2024

PMc said:
Probably this EBADF is a secondary bug which gets only triggered due to this situation, and therefore never before got noticed. We all know there are lots of such flaws lurking, as nobody can do full logical verification.

Exactly. Myself I am not aware of too many situations of "giving up" on a non-networked file descriptor without a specific error such as a disk hardware error. Even on NFS it should just hang indefinitely.

ssw01 · Sep 28, 2024

I know this is a relatively old thread, but fwiw I'm still seeing the same behavior in a shiny new 14.1-RELEASE-p5 system. A poudriere bulk build died almost right out of the gate while building dependency librsvg2-rust:

Code:

[r141amd64-default-job-02] Extracting rust-1.79.0_1: ...
pkg-static: Fail to chown /html/core/arch/x86_64/fn._mm256_set_m128.html:Bad file descriptor

PR 281749
Haven't had a chance to see how repeatable this is yet. In my experience, it was pretty random in 13.x.

Cath O'Deray · Sep 28, 2024

ssw01 said:
PR 281749

"… I think it's really a kernel issue …"

If you believe so, after reading the pkg-related comment, please edit your report accordingly. Thanks.

ssw01 · Sep 28, 2024

Per your comment in the PR, yes obviously the problem is triggered by pkg-static rather than poudriere itself. It seems to me that either

pkg-static is multi-threaded (is it?) and some race is causing a descriptor to be closed before fchownat() in libpkg/pkg_add.c tries to act on it, or
There's an issue down in the kernel code invoked by fchownat()

I know which way I'd bet, but that doesn't mean anything.

I have repeated the experiment and am now 2 for 2 on failed extracts of the same package but different files within the package. I will add this to the PR.

It might be worth noting that this is occurring on a virtual build machine running under KVM (DigitalOcean). Maybe there's something about this environment that exacerbates the problem. Otherwise, I'm at a loss to explain how everybody else isn't tripping over this. If anyone has theories or explanations, I'm all ears.

Cath O'Deray · Sep 29, 2024

Thanks. pkg-static is from a port, not base; you can ask for comments to be moved to a different sub-forum.