Solved "mount -u -o ro" gives "device busy", but nothing is open for write

1: To have a file open, you need a process. Let's assume (comments below) that the process that is holding a file open for write is a new process started by the pkg command. So run "ps aux", save the output. Run the pkg command. Run "ps aux" again, and compare the results. This is going to be a bit tedious, as there will be lots of changes; you may want to write a small script to do the compare. Is there a new process, left over from the pkg run?

Did that long ago, there are no such processes. And, as I said at the start, fstat shows no write-open files on /usr.

Strip your X setup down as much as possible, then enable things one at a time.

Oh I went much further. I replaced my .xinitrc contents with "sleep 3600", started X, switched back to ttyv0, and ran my test. Which failed as before. I could only go further by starting X manually, but that would only eliminate xinit and a shell. And both of those would have been doing something like waitpid() when I ran the test so it's unlikely that they're the culprit.

Anyway, I seriously doubt that trying to match open/close pairs will do any good because there are no files open for write when the remount returns "device busy", according to fstat.

My working hypothesis is that when X starts it does something dodgy which lays dormant until "pkg add" does something to make /usr seem busy. It does not have to be a file open for write. I took a look at the remount code and there are a whole bunch of conditions that can give EBUSY. So it would seem that my first order of business is to track down the actual condition.

Is there any reason to think that sticking a bunch of prints into the remount code will cause any problems? That seems like the obvious way to track it down, unless I want to start learning how to use a kernel debugger.
 
OK, I have a clue what's going on. The relevant code is in vfs_subr.c:vflush(). First:

C:
error = VOP_GETATTR(vp, &vattr, td->td_ucred);
VI_LOCK(vp);

if ((vp->v_type == VNON ||
    (error == 0 && vattr.va_nlink > 0)) &&
    (vp->v_writecount <= 0 || vp->v_type != VREG)) {
        VOP_UNLOCK(vp);
        vdropl(vp);
        continue;
}

Per my debugging, the type is VREG, v_writecount is 0, and presumably error is 0, so that "if" effectively becomes:

C:
if (vattr.va_nlink > 0) {

and thus doesn't trigger for a deleted file. Second:

C:
if (vp->v_usecount == 0 || (flags & FORCECLOSE)) {
        vgonel(vp);
} else {
        busy++;

and v_usecount is, per debugging, not zero. Thus busy is set, preventing the remount.

The culprits are two vnodes. They are for shared libraries memory mapped by the X server that were replaced (and hence deleted) by my "pkg add". It appears, then, that a file system cannot be downgraded if it has any files that have a zero link count and are also memory mapped, regardless of whether the memory mapping allows writing.

This is not documented behavior AFAIK. I can't think why it should be valid behavior. Any thoughts?
 
Your conclusion seems correct. A file that is memory mapped but not for writing should not prevent the downgrade, independent of link count. Suggestion: Kernel mailing list, or PR, or do something like "git blame" and send an e-mail to the last person to work in this area. This could be an (uncommented) workaround to some bizarre limitation of memory mapped files that is not obvious.
EDIT: See below, if the file has zero link count but is still open (memory mapped perhaps), the inode has to be deleted upon close, and that can't be done on a readonly mount.
 
This is not documented behavior AFAIK. I can't think why it should be valid behavior. Any thoughts?
That logic (it's actually about open and unlinked files in general, not necessarily memory mapped) is quite old and there is an explanation in the commit that added it: https://svnweb.freebsd.org/base?view=revision&revision=89384
The rationale seems to be UFS centered but it makes sense in general: it's not possible to remove a file on a read-only filesystem, so how to deal with a file that is (1) open, (2) unlinked (when the fs was still r/w), and (3) now getting closed.
This delta also fixes a long standing bug in which a file open for
reading has been unlinked. When the last open reference to the file
is closed, the inode is reclaimed by the filesystem. Previously,
if the filesystem had been down-graded to read-only, the inode could
not be reclaimed, and thus was lost and had to be later recovered
by fsck. With this change, such files are found at the time of the
down-grade. Normally they will result in the filesystem down-grade
failing with `device busy'. If a forcible down-grade is done, then
the affected files will be revoked causing the inode to be released
and the open file descriptors to begin failing on attempts to read.

Great sleuthing, by the way!
 
That logic (it's actually about open and unlinked files in general, not necessarily memory mapped) is quite old and there is an explanation in the commit that added it: https://svnweb.freebsd.org/base?view=revision&revision=89384
The rationale seems to be UFS centered but it makes sense in general: it's not possible to remove a file on a read-only filesystem, so how to deal with a file that is (1) open, (2) unlinked (when the fs was still r/w), and (3) now getting closed.

Well, this is mildly embarrassing. When I saw that commit, I was like, "Oh duh! I vaguely remember a discussion about that!" But we're talking >20 years ago, so I think I'll forgive myself the lapse. :)

Great sleuthing, by the way!

It was fun, and I learned a couple of things along the way. My /usr/src and /usr/ports are compressed /dev/md* files (saves space and greatly speeds up bulk access to those trees). It would have been a pain to undo all that. But I discovered that null mounts work with files as well as directories, so I could instead null mount any files that needed changing. The other was -DKERNFAST on "make kernel" to avoid doing an entire kernel recompile.

So I copied the source files I was debugging and edited the copies, and I had a script that null mounted the copies over the source tree files, did the "make kernel", and unmounted the source files. An entire edit/install/boot/test cycle, minus the actual code reading and insertion of printf()s, was less than 2 minutes. Once I had a reproducible test case, tracking the thing down was inevitable, and took only a few hours.
 
Your conclusion seems correct. A file that is memory mapped but not for writing should not prevent the downgrade, independent of link count. Suggestion: Kernel mailing list, or PR, or do something like "git blame" and send an e-mail to the last person to work in this area. This could be an (uncommented) workaround to some bizarre limitation of memory mapped files that is not obvious.

As it turns out, what I'm seeing is intended behavior. An unlinked but opened file still has an inode assigned to it and when the file is closed, the file system needs to be updated to free up the inode. So an unlinked but opened file is, effectively, opened for write, regardless of what the open mode is. Using -f on the remount has the effect of forcing the on-disk delete of the inode, and then it's legitimate to downgrade to read-only.

I see three reasonable approaches. One would be to add a mount(1) option like -f
and a corresponding flag bit to mount(2) that is like FORCECLOSE but which only force closes files when they'd otherwise only be kept open because of the zero link count. The second is to always delete the inode when there is a zero link count. The last is simply to document the current behavior.

In any case, I think we can mark this "solved"; the next step is a PR. Is there a standard way to mark a problem solved, or should I just edit the thread title to reflect what we discovered?
 
… Is there a standard way to mark a problem solved, or should I just edit the thread title to reflect what we discovered?

 
As it turns out, what I'm seeing is intended behavior. An unlinked but opened file still has an inode assigned to it and when the file is closed, the file system needs to be updated to free up the inode. So an unlinked but opened file is, effectively, opened for write, regardless of what the open mode is. Using -f on the remount has the effect of forcing the on-disk delete of the inode, and then it's legitimate to downgrade to read-only.

I see three reasonable approaches. One would be to add a mount(1) option like -f
and a corresponding flag bit to mount(2) that is like FORCECLOSE but which only force closes files when they'd otherwise only be kept open because of the zero link count. The second is to always delete the inode when there is a zero link count. The last is simply to document the current behavior.
Given that this is not a common thing to do, I'd say leave the current behavior alone, so as to reduce the potential for footshooting. From mount(2):
Code:
     MNT_FORCE        Force a read-write mount even if the file system appears
                      to be unclean.  Dangerous.  Together with MNT_UPDATE and
                      MNT_RDONLY, specify that the file system is to be
                      forcibly downgraded to a read-only mount even if some
                      files are open for writing.
...
     The flag MNT_UPDATE indicates that the mount command is being applied to
     an already mounted file system.  This allows the mount flags to be
     changed without requiring that the file system be unmounted and
     remounted.  Some file systems may not allow all flags to be changed.  For
     example, many file systems will not allow a change from read-write to
     read-only.
 
Hilariously, I had pulled up just that menu, looked straight at it...and did not see the "Solved" item! Anyway, I've made the change.
 
As it turns out, what I'm seeing is intended behavior. An unlinked but opened file still has an inode assigned to it and when the file is closed, the file system needs to be updated to free up the inode. So an unlinked but opened file is, effectively, opened for write, regardless of what the open mode is.
Thank you for figuring that out, and I updated my post above. This is really a bizarre corner case: The file itself is not open for write. The file itself will never be re-opened for write (because it has no directory entry and can't be found). The problem is that the file's inode needs to be cleared and data blocks de-allocated; logically (from a user-space interface viewpoint) that is a read-only operation, but internally it has to be a write operation.

How to implement a fix? I'm so glad it's not me!

Here is a bit of advice, from my (unfortunately recently passed away) colleague who had decades of experience in FS and DB research: Don't even bother implementing clean shutdown of a system. Just crash it. Because it real life, it will crash occasionally anyway, at the latest when power to the server fails. So build your system such that restart from a crash is fast, efficient, and correct. And if you always restart from a crash, that code will be well exercised, optimized and tested. In this particular case, this leads to the following implementation choice: If it is inconvenient to clear an inode and the allocated data right now (for example because the file system has become read-only), then just don't do it. Instead, upon restart quickly locate all inodes that have zero link count and clear them at that time (perhaps starting a background task, if slow deletion is fine). The problem with this choice is: you end up implementing part of fsck within the mount operation; and if the metadata isn't organized to find these inodes efficiently, mount might become pretty slow.
 
How to implement a fix? I'm so glad it's not me!
Imagine this implementation: Allow in the root directory of a file system (or some other designated place) links that are not visible to the user. Whenever the last link of an open file is to be removed, create an invisible link to that file before removing what was the last link. By this trick, you guarantee that every open file always has at least one link. When a file is finally closed, if its inode indicates that it has an invisible link and the file system is mounted r/w, remove the invisible link. If the file system is downgraded, there is no need for special treatment for zero-link open files because they can't exist and no need to worry about the invisible links because it's harmless to leave them around. Next time the file system is (re)mounted r/w, scan the root directory (or wherever) for these invisible links, removing any for files that are not open. Finally, modify fsck to delete the invisible links. Aside from implementing these invisible links, this has the advantage of being built with file system primitives that already exist, so it's unlikely that it would break things in strange ways.
 
Back
Top