UFS Why kernel crashes with dirty filesystems?

MassimoM · Jan 1, 2022

I have a small server with 13.0 installed, and, when i try to mount an external hd, freebsd crashes.
I thought that the / disk could be damaged (maybe the mount directory is on a bad sector, although i run fsck and smartctl), and i removed the disk, installed a new one and reinstalled freebsd, but it happens again.
I run the fsck on the external disk and it is full of errors, but i think that it a very bad behaviour of freebsd, it should not to crash with an EXTERNAL disk (UFS, too!). it could be "acceptable" with the root partition or swap partition, but not with an external disk.
I attach the picture of the problem.

covacat · Jan 1, 2022

to protect data
mounting readonly wont cause a panic

MassimoM · Jan 1, 2022

covacat said:
to protect data
mounting readonly wont cause a panic

aborting mount could "save" the situation.
with crashing, other data could be damage.

mark_j · Jan 1, 2022

I've always thought it in-eloquent for a non-root filesystem to crash the kernel, but it seems to have always been that since UFS1.

Having said that, remember at "single-user" mode or read-only mount, you can use fsdb to try to analyse the problem and find the corrupted directory entry.
You could also use recoverdisk to get data off it, if you need to.
Remember if fsck can't fix it, you'll just crash again... and again... and ...

MassimoM · Jan 1, 2022

mark_j said:
Having said that, remember at "single-user" mode or read-only mount, you can use fsdb to try to analyse the problem and find the corrupted directory entry.
You could also use recoverdisk to get data off it, if you need to.
Remember if fsck can't fix it, you'll just crash again... and again... and ...

yes, i can repair the problem, but i'm thinking about a main server in a company, working 24/7, and the sysadmin put in a port a corrupted key (i have a collection of damaged key), and the server goes down...

kpedersen · Jan 2, 2022

This is labelled as a "BUG" in the mount manpage.

Just make sure you run an fsck (a userland program) on it first if you are worried (the fstab entries by default already do this). The fsck is very fast so it is worth doing.

I suppose it is a fairly difficult problem to solve. The amount of input checking would also slow down usage of the filesystem considerably. My guess is that most operating systems would crash if an invalid filesystem is mounted. I recall Windows 7 freezing on me a few times when trying to read faulty memory sticks.

mark_j · Jan 2, 2022

Perhaps they should change it to "Feature" because the bug has been there a long, long time?

ralphbsz · Jan 2, 2022

In theory, that shouldn't happen. One of the general rules of software construction is: software should never crash when it gets incorrect inputs. It may give error messages and continue, it might give a fatal error message and quit, it should never crash. In particular, the kernel should never crash (although if it can't make progress, for example because the root disk is unavailable, it may give a message and halt). To a file system, the disk is a form of "input". Therefore, your logic is correct: the kernel should not crash because of something bad on disk.

That's the theory. In practice, implementing this is really hard, and a lot of extra work. This sort of bug may be tolerable to the authors and maintainers of the code, because they occur very rarely. If you want, you could volunteer to fix these kinds of bugs. But that would be quite difficult.

grahamperrin · Jan 2, 2022

MassimoM said:
/

tunefs -p /

What is reported?

Also, please take the essential hint from what was on screen, near the top of your screenshot.

freebsd-version -kru
uname -aKU

What's reported?

Then, with the other disk at /dev/da0:

geom disk list da0
geom part show da0
fstyp /dev/da0a

Step 3 exactly as written, please: without a number after the letter a

Technical

MassimoM said:
UFS

Around a year ago, I knew how to produce a disk with a detectably mangled entry in a partition that reportedly comprised a UFS file system. This production is not a file system issue.

Mangling of this type did reliably cause the kernel to stop, or panic, at mount time, which is not necessarily a bad thing.

The then version: FreeBSD 13.0-CURRENT #75 main-c572-g82397d791: Sun Jan 3 20:00:09 GMT 2021, <https://cgit.freebsd.org/src/commit/?id=82397d791966b09d344251bc709cd9db2b3a1902> (2021-01-03).

Touch wood, the same volume does not crash FreeBSD 14.0-CURRENT 4bae154fe8c (2021-12-24). More than five thousand commits between those two, <https://github.com/freebsd/freebsd-...02...4bae154fe8c7a84e3062bb0569ea190fc2e83182>

It's possible that this particular bug is fixed, at least in main/head, although I'm almost certain that there remain other edge cases (around ten) in Bugzilla that have not explicitly received attention in recent months.

MassimoM · Jan 2, 2022

kpedersen said:
This is labelled as a "BUG" in the mount manpage.

Just make sure you run an fsck (a userland program) on it first if you are worried (the fstab entries by default already do this). The fsck is very fast so it is worth doing.

I suppose it is a fairly difficult problem to solve. The amount of input checking would also slow down usage of the filesystem considerably. My guess is that most operating systems would crash if an invalid filesystem is mounted. I recall Windows 7 freezing on me a few times when trying to read faulty memory sticks.

It's not a matter of time, as ralphbsz said, a wrong input must not crash a program.
I never heard a similar behaviour in linux, because bad-input=crash is never accepted in a production-grade environment, expecially for the kernel.
i had similar behaviour in windows, but never in linux, solaris and so on.
i think it is not a mount bug. if a userland application is buggy, the kernel should be smart enough to kill the faulty application without crashing.

MassimoM · Jan 2, 2022

@grahamperrin
sorry, yesterday after i posted the picture of the problem, i formatted the disk, i needed it for a backup.
i only know that it was made with a previous version of freebsd, maybe two year old.

grahamperrin · Jan 2, 2022

Thanks,

MassimoM said:
… a userland application …

I guess, the effect of a bug at that level will be quite different from:

a bug that involves the power to mount something at a point within a file system that more broadly contains the operating system

– with great power comes great responsibility (see below) …

grahamperrin said:
… edge cases (around ten) in Bugzilla …

135690 – [panic] [ata] ufs_dirbad: /backuphd: bad dir ino 22259126 at offset 0: mangled entry
228555 – panic: ufs_dirbad: /home: bad dir ino 449428 at offset 1024: mangled entry
244342 – [1] Kernel panic observed while plugging the UFS USB drive on FreeBSD13-CURRENT, FreeBSD 12.1-RELEASE r354233 and FreeBSD 12.1-STABLE r358121
244344 – [2] Kernel panic observed while plugging the UFS USB drive on FreeBSD13-CURRENT, FreeBSD 12.1-RELEASE r354233 and FreeBSD 12.1-STABLE r358121
244346 – [3] [Kernel panic: vm_fault_lookup: fault on nofault entry, addr: 0xfffffe0032000000] observed while plugging the UFS USB drive on FreeBSD13-CURRENT
244348 – [4] Kernel panic observed while plugging the UFS USB drive on FreeBSD13-CURRENT, FreeBSD 12.1-RELEASE r354233 and FreeBSD 12.1-STABLE r358121
244349 – [5] [Kernel panic: wrong length 34560 for sectorsize 512] observed while plugging the UFS USB drive on FreeBSD 13-CURRENT
244350 – [6] [Kernel panic: getblk: size(75776) > maxbcachebuf(65536)] observed while mouting the UFS USB drive on FreeBSD13-CURRENT, FreeBSD 12.1-RELEASE r354233 and FreeBSD 12.1-STABLE r358121
244351 – [7] Kernel panic observed while plugging the UFS USB drive on FreeBSD13-CURRENT, FreeBSD 12.1-RELEASE r354233 and FreeBSD 12.1-STABLE r358121
244352 – [8] [Kernel panic: ufs_dirbad: /mnt/test: bad dir ino 2 at offset 154: mangled entry] observed while mouting the UFS USB drive on FreeBSD13-CURRENT, FreeBSD 12.1-RELEASE r354233 and FreeBSD 12.1-STABLE r358121

grahamperrin said:
… a disk with a detectably mangled entry in a partition that reportedly comprised a UFS file system. …

With great power comes great responsibility!

Around a year ago: to the best of my recollection, I irresponsibly allowed an automated mount routine to coincide with creation of a file system, on an 8 G USB flash drive that previously held a different type of file system. A kernel panic occurred.

Today: for giggles, I allowed the file system checker to perform repairs. More than ten million lines of output from ffsck_ffs(8), so much that it was impossible to save a log of the session in its entirety. The part of the log that was saved is around 240 MiB. Pretty cool, for a repair to a file system to which no file was ever written

The background to the panic – I simply should not have allowed auto-mounting at the time – is ridiculous enough for me to not add a separate bug to Bugzilla. And I stretch the truth

with my description of the file system as one to which no file was ever written, because (I guess) creation of the file system was interrupted by a kernel panic.

grahamperrin · Jan 2, 2022

MassimoM

Your server

grahamperrin said:
tunefs -p /

What is reported?

– depending on your answer, there might be important advice. Something simple, not technobabble

UFS generally

If you subscribe to this one bug, you'll be notified whenever a change occurs to any of the eight or more linked bugs that involve kernel panics:

244384 – UFS fuzz metabug

There's no lack of attention to development of fsck_ffs and UFS. See for example:

<https://cgit.freebsd.org/src/log/?qt=grep&q=fsck_ffs>
<https://cgit.freebsd.org/src/log/?qt=grep&q=UFS> with two commits within the past fifteen minutes

kpedersen · Jan 2, 2022

MassimoM said:
It's not a matter of time, as ralphbsz said, a wrong input must not crash a program.
I never heard a similar behaviour in linux, because bad-input=crash is never accepted in a production-grade environment, expecially for the kernel.
i had similar behaviour in windows, but never in linux, solaris and so on.

Yes of course but like I said, it is a difficult problem to solve (evidently, or you would not be experiencing issues!). Kernel panics happen (including on Linux and commercial UNIX) where there is faulty code and due to the complexity of filesystems and the almost infinite combination of input, this is a prime candidate. And again, like I said, most operating systems have flaws and edge cases in their filesystem drivers. Programmers are human.

So your best option would be to check the filesystem with fsck and fix it *before* you attempt to mount it. Then you can sidestep the issue entirely.

If you want to be purposely mounting broken filesystems in a production environment(?) then in theory FUSE as a userspace filesystem driver (i.e ntfs-3g) should be much more safe against crashes but ironically I have had more lockups using that (probably because ntfs is even more complex and was undocumented when ntfs-3g was written).

(Back in the day, FreeBSD (5.x?) used to crash when a USB stick was pulled out without unmounting first. So it is ultimately improving

)

MassimoM said:
i think it is not a mount bug. if a userland application is buggy, the kernel should be smart enough to kill the faulty application without crashing.

Just to clarify; it isn't a bug with mount as a program. This simply tells the kernel what to do. The lockup is due to issues much deeper in the kernel rather than the userland mount util.

mark_j · Jan 2, 2022

MassimoM said:
It's not a matter of time, as ralphbsz said, a wrong input must not crash a program.
I never heard a similar behaviour in linux, because bad-input=crash is never accepted in a production-grade environment, expecially for the kernel.
i had similar behaviour in windows, but never in linux, solaris and so on.
i think it is not a mount bug. if a userland application is buggy, the kernel should be smart enough to kill the faulty application without crashing.

Perhaps it's time to log this bug into the system?
The bug manifests in ufs_lookup.c (ufs_lookup_ino() in the crash) and seems to just continue to search for a "good" directory offset when it encounters a "mangled" one.
Why it then crashes the OS is up to someone like Kirk who obviously knows this inside-out.
It's a pity you don't have the dubious USB and its file system to image and provide with a bug report.

covacat · Jan 2, 2022

the panic is not a bug per se, it's rather a design decision (possibly not the best one)
ufs_dirbad() just bombs if the fs is mounted rw

mark_j · Jan 2, 2022

It's, on first glance, just a lazy thing to try to find the offset for the directory and eventually just give up in a panic.
I sort of understand this on an active disk, but on a mount? It's just lazy and a poor "design decision".
Again, as I said, I don't understand UFS to that degree, but surely a flag with "mounting" as an attribute would be coded as (in pseudo-code):

if mp->mount_in_progress then
"Disk is screwed, mount as r/o and run fsck"
exit
end

I still think it's a bug, that has morphed into a "Feature". Perhaps because ZFS is now the go-to FS for business, they don't care; or care even less than before when the "design decision" was made to panic.

The.Silicon.Projects · Jan 3, 2022

I use exclusively UFS, and from wihat I can remember for many years... I have encountered many crashes and full lost of OS with internal UFS drives long before switching to "gjournal" far more performant and secure (but slower, and not required with high grade server class controller, the problem being the very cheap integrated controller found on consumer range station) than UFS regular journaling, but I never crashed a system connecting a usb UFS drive, even dirty.

Same thing with Windows...

Except in some situations.

This thing most commonly happens ON EVERY KIND OF OS (including Linux) with an external drive power issue
If power delivered to the external drive is not sufficient... this makes any filesystem dirty, and can run to a system crash
The crash is not directly caused by the dirty filesystem, but by a hardware issue.

Typically... when you connect an usb drive, this is a critical point when hardware will requires 2 to 3 times the nominal power to start the drive.
Then power consumption will get lower. You can reach some power peaks (but lower than the start process) when rotating drive is seeking, or when SSD is writing a big "sequence" of data.

Also note that if you use a SSD drive in an external USB enclosure... this kind of drive requires far much power than a usb key, and in many case more power than a classical rotating drive.
Compared to a USB Key, SSD drive reaches high speed write process by "heating" the cells to authorize "on the fly" change of state. So "Heating" requires high power.
In an exclusive reading mode, SSD should theroretically consume less than a mechanical drive... but this is not even guaranted as most of modern 5400 rpm drive have relatively low consumption (this is not the same thing for 7200 / 10000 / 15000 rpm drives)

Similarly, if the external drive has some "severe hardware issues", most likely an issue on the electronic board, this can also create some power losses or very weird things that confuse the OS driver.

This is my experience. A good usb drive, with enough power should never make crash any recent OS even if the filesystem is dirty and even if this OS is Windows.

I currently encounter a power issue when I recently installed an NVME on an old notebook, and transferred the OS on it.
System crashes from time to time, or lock à start, under Windows AND Linux, having said that I use a rolling release openSUSE, so with all the brand new hardware corrections.

Using the highest level of journaling (data=journal instead of distribution def00ault data=ordered) on ext4 save my data when any crashes occur. Before, I lost a fuill linux OS, and was not quite happy at all with this fucking default value.

No issue with Windows, NTFS is setup by default with a high journaling level.

The issue being a bullshit laptop with a notorious power issue on NVME port. Some NVME may work better if power required is a little lower. Mine is a Kingston A1000, known to cause many power management issues on Linux, Windows. The next time I buy exclusively Samsung.

This is more or less stabilized on my system triggering some kernel options under Linux, but not fully stable.

grahamperrin · Jan 3, 2022

Pause for thought … ▼

covacat said:
… ufs_dirbad() just bombs if the fs is mounted rw

Is this true for FreeBSD 14.0-CURRENT?

Rewind around one year, to 5th January 2021:

grahamperrin said:
The then version: FreeBSD 13.0-CURRENT #75 main-c572-g82397d791: Sun Jan 3 20:00:09 GMT 2021,

Then: <https://pastebin.com/raw/ErBDidYG> with Panic String: ufs_dirbad: /media/da1p1: bad dir ino 2 at offset 0: mangled entry in four places on the page.

Now, with added emphasis:

grahamperrin said:
Touch wood, the same volume does not crash FreeBSD 14.0-CURRENT

▲ this.

covacat · Jan 3, 2022

the code looks the same to me in -CURRENT.
if it reaches the "mangled entry" it will panic unless fs is r/o

ralphbsz · Jan 3, 2022

MassimoM said:
I never heard a similar behaviour in linux, ...

Happens all the time: OS crashes due to Linux file systems not handling corrupted on-disk data structures, or IO errors. It happens somewhat frequently in the IO stack; disk errors will regularly cause kernel crashes. It is pretty rare in ext2/3/4 (but not unheard of). It is quite common in other file systems, in particular the ones that have less professional (paid) developer support. I'm not going to talk about commercial (non-free) file systems that run on Linux, because I'd either have to boast or incriminate myself.

if a userland application is buggy, the kernel should be smart enough to kill the faulty application without crashing.

This problem is not a userland problem ... the crash happened inside the kernel. But it shouldn't happen there either. And to be clear: A mount error on the root file system is a different story; that has every right to (cleanly) panic the kernel.

grahamperrin · Jan 3, 2022

Thanks covacat

Afterthought (sorry for previously omitting this detail):

% uname -aKU
FreeBSD mowa219-gjp4-8570p-freebsd 14.0-CURRENT FreeBSD 14.0-CURRENT #118 main-n251923-4bae154fe8c: Sat Dec 25 08:03:37 GMT 2021     root@mowa219-gjp4-8570p-freebsd:/usr/obj/usr/src/amd64.amd64/sys/GENERIC-NODEBUG  amd64 1400045 1400045
%

– is my GENERIC-NODEBUG kernel less likely to panic with CURRENT?

(A year ago, with CURRENT, GENERIC-NODEBUG was not my habit.)

covacat · Jan 3, 2022

in theory and probably in practice a -current kernel is more likely to panic than a -release one (not for the same reasons though)
as for ufs/ffs im by no means an expert
also i don't have UFS on any busy fs

covacat · Jan 3, 2022

seems that netbsd/openbsd/dragonflybsd all panic on ufs_dirbad unless fs is r/o
netbsd's code looks the coolest

Code:

void
ufs_dirbad(struct inode *ip, doff_t offset, const char *how)
{
    struct mount *mp = ITOV(ip)->v_mount;
    void (*p)(const char  *, ...) __printflike(1, 2) =
        (mp->mnt_flag & MNT_RDONLY) == 0 ? panic : printf;

    (*p)("%s: bad dir ino %ju at offset %d: %s\n",
        mp->mnt_stat.f_mntonname, (uintmax_t)ip->i_number,
        offset, how);
}

plata o plomo

grahamperrin · Jan 3, 2022

grahamperrin said:
mount something at a point within a file system that more broadly contains the operating system

In retrospect, that smells like me talking out of my arse.

Maybe I meant something like this:

mount a file system at a point within a directory structure that more broadly contains the operating system

I dunno. It's past midnight, which is rarely an excuse for me to stop writing, but I'll take that excuse in a few minutes, and stop embarrassing myself