ZFS rsync - Bad file descriptor

FreeBSD 13.3 but I am not sure if it's related to FreeBSD / ZFS version.

Doing a rsync between the same ZFS dataset:

rsync --delete -a /home/www/user/example.com/www/ /home/www/user/example.com/dev/

Output:

Code:
rsync: [receiver] failed to set times on "/home/www/user/example.com/dev/wp-content/uploads/2019/11/.riddle-3-goddesses-723x394.png.UH9aeU": Bad file descriptor (9)
rsync: [receiver] failed to set times on "/home/www/user/example.com/dev/wp-content/uploads/2024/02/.shantanu-kumar-_CquNNr1744-unsplash.jpg.58uTiq": Bad file descriptor (9)

And the permissions are not the same, for example:

ls -la /home/www/user/example.com/dev/wp-content/uploads/2019/11/riddle-3-goddesses-723x394.png /home/www/user/example.com/www/wp-content/uploads/2019/11/riddle-3-goddesses-723x394.png

Output:

Code:
-rw-------  1 user  user  399841 Mar 16 10:29 /home/www/user/example.com/dev/wp-content/uploads/2019/11/riddle-3-goddesses-723x394.png
-rw-r--r--  1 user  user  399841 Nov 15  2019 /home/www/user/example.com/www/wp-content/uploads/2019/11/riddle-3-goddesses-723x394.png

zpool status shows no issues.

Any idea what happens and shows bad file descriptor?
 
I hope we get an errata notice soon. I use rsync to clone websites but more importantly I use it between ZFS datasets to copy my data and then snapshot it.
 
Can someone explain to me how a set of performance issues turns into "bad file descriptor", which is a correctness issue? In my understanding, EBADF usually happens because of a bug in the user-space program (in this example rsync), for example using a random integer as a file descriptor number, or reading from a file after it is closed, or trying to write to a file that is opened readonly. How does this relate to performance issues?
 
Can someone explain to me how a set of performance issues turns into "bad file descriptor", which is a correctness issue?
I guess the guy currently offering these patches might be able to explain this. Fact is, I've seen them (inside poudriere builds) after upgrading only the kernel to 13.3.

My naive guess would be that there's some kind of "give up" code path in the kernel which then somehow invalidates the fd without the userspace application ever getting a chance to know about it.

edit -- I've also seen sudden EBADF when testing some 9pfs client implementation on -CURRENT, so sure, bugs in the kernel (vfs or concrete fs) can trigger that. The question remains how it can be related to a performance issue, well, see my guess above ;-)
 
Can someone explain to me how a set of performance issues turns into "bad file descriptor", which is a correctness issue? In my understanding, EBADF usually happens because of a bug in the user-space program (in this example rsync), for example using a random integer as a file descriptor number, or reading from a file after it is closed, or trying to write to a file that is opened readonly. How does this relate to performance issues?

No, I was wondering the same thing.

It might be a bug that is still lurking, now covered up by a performance fix.
 
Can someone explain to me how a set of performance issues turns into "bad file descriptor", which is a correctness issue? In my understanding, EBADF usually happens because
"usually". Now we would need to stroll thru the kernel code and see at which kind of exceptions this errno is given back...
The party around PR 275594 concerns the inode cache - which is not all too far away from (bad) file descriptors.
What happens -simplified and as far as I currently understand it- is the kernel decides that the zfs vnodes take to much kernel memory in relation to other things, and then triggers arc_prune jobs, and -for some strange reason- lots and lots of them, and they cannot do anything because the vnodes are locked, but these arc_prune themselves now lock lots of other things.

Probably this EBADF is a secondary bug which gets only triggered due to this situation, and therefore never before got noticed. We all know there are lots of such flaws lurking, as nobody can do full logical verification.
 
Probably this EBADF is a secondary bug which gets only triggered due to this situation, and therefore never before got noticed. We all know there are lots of such flaws lurking, as nobody can do full logical verification.

Exactly. Myself I am not aware of too many situations of "giving up" on a non-networked file descriptor without a specific error such as a disk hardware error. Even on NFS it should just hang indefinitely.
 
Back
Top