Solved panic: ufs_dirbad

aragats · Apr 15, 2020

This happens when I run pkg upgrade, I've never seen such error before in any of my FreeBSD boxes:

Code:

panic: ufs_dirbad: /: bad dir ino 18220049 at offset 0: mangled entry

The corresponding screenshot is attached below.
This is a brand new Dell Precision 7540, the disk is a Micron NVMe, which worked in the same model box for several months without any issues, smartctl() reports everything is okay. Rebooting in single user mode and running fsck() doesn't help (it finds some minor issues).

Thanks for advises!

ralphbsz · Apr 15, 2020

Reboot, run fsck, hope for the best. You managed to mangle some UFS metadata. How? I have no idea.

Minbari · Apr 15, 2020

You need to clear the inode, it's the only way. (boot in single user mode and keep your device mounted read-only.)

 fsdb /dev/adXsYa

fsdb (inum: X)> inode 18220049

fsdb (inum: 18220049)> clri 18220049

fsdb (inum: 18220049)> quit

After that run fsck as told by debuger.
eg. fsck -y -t ufs /dev/ad1s1a

aragats · Apr 15, 2020

Thanks,Minbari !
This helped. However, after fsdb I had to run fsck several times with "force" flag until it fixed the filesystem.

Minbari · Apr 15, 2020

aragats said:
Thanks,Minbari !
This helped. However, after fsdb I had to run fsck several times with "force" flag until it fixed the filesystem.

I'm glad my tips helped but keep in mind: one corruption may happen once in a while and means nothing, two are still possible but you don't have to worry too much, three corruptions are a bad sign, there’s usually more to come.

aragats · Apr 15, 2020

Minbari said:
one corruption may happen once in a while and means nothing, two are still possible but you don't have to worry too much, three corruptions are a bad sign

That's true, but with SSDs is not clear what to do besides backups, since smartctl() reports everything is fine.
With spinning HDDs you could check every block and estimate how the read time is deviated from the nominal.

ralphbsz · Apr 15, 2020

You seem to be assuming that this is an SSD error. I would not assume that at all. More likely, it is a operator error, software bug, memory corruption, hardware memory error, hardware communications error (in roughly descending order of likelyhood). The probability of it being silent corruption on the SSD (or disk drive) is very very low in comparison.

aragats · Apr 15, 2020

ralphbsz said:
You seem to be assuming that this is an SSD error. I would not assume that at all. More likely, it is a operator error, software bug, memory corruption, hardware memory error, hardware communications error (in roughly descending order of likelyhood). The probability of it being silent corruption on the SSD (or disk drive) is very very low in comparison.

Well, although I mostly agree with you, but for whatever reason I've never seen serious/unrecoverable errors when I mostly used spinning HDDs for many years with FreeBSD. This is the second time I see such bad error with an SSD. The first time it wasn't so bad, but still destructive - it happened last year in another box used as both desktop and server.

Could it be somehow related to the much better SSD's timing?

ralphbsz · Apr 16, 2020

Interesting hypothesis: the much lower latency of SSDs might be exposing bugs in UFS, which are either extremely rare or actually impossible on spinning disks? Possible, but unlikely. Because UFS does get used on disk arrays, and those tend to have large RAM caches (usually battery-backed), and therefore they can have nearly zero latency for writes.

Here is another hypothesis, which is actually backed by (unpublished) data: Did you have power outages? It turns out that the FTL (the Flash Translation Layer, the firmware in the SSD that maps its internal data structures and flash chips to a pretend SATA or SCSI disk) in consumer disks is buggy, and often doesn't handle power outages well. SSDs tend to hold recently written sectors in RAM, and consumer SSDs sometimes don't do a good job writing those back to flash when a power outage happens. Or they forget to update internal metadata as they go down. So data corruption on SSDs on power outages is a known phenomenon; I don't know whether there has ever been a publication of measured rates (in a trustworthy place, like a peer-reviewed conference or journal), but I can assure you that the result was pretty ugly about 7 or 8 years ago when I was trying it.

aragats · Apr 16, 2020

ralphbsz said:
Did you have power outages? It turns out that the FTL (the Flash Translation Layer, the firmware in the SSD that maps its internal data structures and flash chips to a pretend SATA or SCSI disk) in consumer disks is buggy, and often doesn't handle power outages well

A day before I had Firefox frozen with 20-30 tabs open (as well as the whole X server frozen). I had to forcefully turn the laptop off. Retrospectively, I was interpeting the cause and effect in opposite way: i.e. Firefox died because of UFS defects. Your hypothesis may explain it better. Anyway, after recovery, Firefox forgot many things.