ZFS rsync - Bad file descriptor

CyberCr33p · Mar 16, 2024

FreeBSD 13.3 but I am not sure if it's related to FreeBSD / ZFS version.

Doing a rsync between the same ZFS dataset:

rsync --delete -a /home/www/user/example.com/www/ /home/www/user/example.com/dev/

Output:

Code:

rsync: [receiver] failed to set times on "/home/www/user/example.com/dev/wp-content/uploads/2019/11/.riddle-3-goddesses-723x394.png.UH9aeU": Bad file descriptor (9)
rsync: [receiver] failed to set times on "/home/www/user/example.com/dev/wp-content/uploads/2024/02/.shantanu-kumar-_CquNNr1744-unsplash.jpg.58uTiq": Bad file descriptor (9)

And the permissions are not the same, for example:

 ls -la /home/www/user/example.com/dev/wp-content/uploads/2019/11/riddle-3-goddesses-723x394.png  /home/www/user/example.com/www/wp-content/uploads/2019/11/riddle-3-goddesses-723x394.png

Output:

Code:

-rw-------  1 user  user  399841 Mar 16 10:29 /home/www/user/example.com/dev/wp-content/uploads/2019/11/riddle-3-goddesses-723x394.png
-rw-r--r--  1 user  user  399841 Nov 15  2019 /home/www/user/example.com/www/wp-content/uploads/2019/11/riddle-3-goddesses-723x394.png

zpool status shows no issues.

Any idea what happens and shows bad file descriptor?

zirias@ · Mar 16, 2024

I had (among lots of other issues) these errors in poudriere builds with ZFS and 13.3. The patches from PR 275594 solved it.

CyberCr33p · Mar 16, 2024

I hope we get an errata notice soon. I use rsync to clone websites but more importantly I use it between ZFS datasets to copy my data and then snapshot it.

cracauer@ · Mar 16, 2024

Anything in `dmesg`?

CyberCr33p · Mar 16, 2024

No it didn't show something.

grahamperrin · Mar 16, 2024

Re <https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=277717#c4> please follow:

ZFS - disappointing ZFS(?) performance from 13.3

ralphbsz · Mar 17, 2024

Can someone explain to me how a set of performance issues turns into "bad file descriptor", which is a correctness issue? In my understanding, EBADF usually happens because of a bug in the user-space program (in this example rsync), for example using a random integer as a file descriptor number, or reading from a file after it is closed, or trying to write to a file that is opened readonly. How does this relate to performance issues?

zirias@ · Mar 17, 2024

ralphbsz said:
Can someone explain to me how a set of performance issues turns into "bad file descriptor", which is a correctness issue?

I guess the guy currently offering these patches might be able to explain this. Fact is, I've seen them (inside poudriere builds) after upgrading only the kernel to 13.3.

My naive guess would be that there's some kind of "give up" code path in the kernel which then somehow invalidates the fd without the userspace application ever getting a chance to know about it.

edit -- I've also seen sudden EBADF when testing some 9pfs client implementation on -CURRENT, so sure, bugs in the kernel (vfs or concrete fs) can trigger that. The question remains how it can be related to a performance issue, well, see my guess above ;-)

cracauer@ · Mar 17, 2024

ralphbsz said:
Can someone explain to me how a set of performance issues turns into "bad file descriptor", which is a correctness issue? In my understanding, EBADF usually happens because of a bug in the user-space program (in this example rsync), for example using a random integer as a file descriptor number, or reading from a file after it is closed, or trying to write to a file that is opened readonly. How does this relate to performance issues?

No, I was wondering the same thing.

It might be a bug that is still lurking, now covered up by a performance fix.

grahamperrin · Mar 17, 2024

ralphbsz said:
Can someone explain to me how a set of performance issues turns into "bad file descriptor",

For what it's worth, I currently assume that the opening post here <https://forums.freebsd.org/posts/647559> is not directly related to reports such as 277717.

PMc · Mar 17, 2024

ralphbsz said:
Can someone explain to me how a set of performance issues turns into "bad file descriptor", which is a correctness issue? In my understanding, EBADF usually happens because

"usually". Now we would need to stroll thru the kernel code and see at which kind of exceptions this errno is given back...
The party around PR 275594 concerns the inode cache - which is not all too far away from (bad) file descriptors.
What happens -simplified and as far as I currently understand it- is the kernel decides that the zfs vnodes take to much kernel memory in relation to other things, and then triggers arc_prune jobs, and -for some strange reason- lots and lots of them, and they cannot do anything because the vnodes are locked, but these arc_prune themselves now lock lots of other things.

Probably this EBADF is a secondary bug which gets only triggered due to this situation, and therefore never before got noticed. We all know there are lots of such flaws lurking, as nobody can do full logical verification.

cracauer@ · Mar 17, 2024

PMc said:
Probably this EBADF is a secondary bug which gets only triggered due to this situation, and therefore never before got noticed. We all know there are lots of such flaws lurking, as nobody can do full logical verification.

Exactly. Myself I am not aware of too many situations of "giving up" on a non-networked file descriptor without a specific error such as a disk hardware error. Even on NFS it should just hang indefinitely.