UFS fsck on big filesystems (>100TB)

Hi, everyone.

I have several servers with large UFS partitions (>100TB).
Physically it is RAID-6 on MegaRaid 9361.

If i try to run fsck on them, i get same error on all servers:
Code:
** Phase 5 - Check Cyl groups
fsck_ffs: inoinfo: inumber 18446744071562087424 out of range

inumber is exact same on all servers.
tested on clean and drity filesystems with same result.

And without succsessful run of fsck i can't remove dirty flag from filesystem

Is there any limits with UFS and fsck?
Maybe i doing something wrong?

Code:
# uname -a
FreeBSD *** 12.1-RELEASE FreeBSD 12.1-RELEASE r354233 GENERIC  amd64

# gpart show mfid3
=> 40  250031243184  mfid3  GPT  (116T)
   40  250031243184      1  freebsd-ufs  (116T)

# df -i /mnt/data2
Filesystem      Size    Used   Avail Capacity iused ifree %iused  Mounted on
/dev/mfid3p1    116T     15T     91T    14%    300k  3.9G    0%   /mnt/data2

# tunefs -p /dev/mfid3p1
tunefs: POSIX.1e ACLs: (-a)                                disabled
tunefs: NFSv4 ACLs: (-N)                                   disabled
tunefs: MAC multilabel: (-l)                               disabled
tunefs: soft updates: (-n)                                 enabled
tunefs: soft update journaling: (-j)                       enabled
tunefs: gjournal: (-J)                                     disabled
tunefs: trim: (-t)                                         disabled
tunefs: maximum blocks per file in a cylinder group: (-e)  4096
tunefs: average file size: (-f)                            16384
tunefs: average number of files in a directory: (-s)       64
tunefs: minimum percentage of free space: (-m)             8%
tunefs: space to hold for metadata blocks: (-k)            8336
tunefs: optimization preference: (-o)                      time
tunefs: volume label: (-L)
 
That inode number is 0xFFFF FFFF 8000 4C00. That's negative in 64 bits. I don't think inode numbers should ever go negative. And in my source tree, ino_t is defined as uint32, so the largest inode number is ~4 billion, and they can't go negative. I suspect you have found a bug. Not clear that the bug is in fsck though; a walk of the on-disk data structures might make this clearer. My suspicion is that someone is taking an inode number, treating it as an int, sign-extending it to a native int type, and ends up with this ridiculous number.

I like Olli's advice of contacting developers via the mailing list.
 
That inode number is 0xFFFF FFFF 8000 4C00. That's negative in 64 bits. I don't think inode numbers should ever go negative. And in my source tree, ino_t is defined as uint32, so the largest inode number is ~4 billion, and they can't go negative.
Not sure how current your source tree is, but inode numbers were extended to 64-bit in 2017. In particular, see this diff. This change is in FreeBSD 12. (I don’t think it was MFC’ed to 11, though.)

However, the above inode number is certainly invalid, even for huge file systems in the range of hundreds of TB. Assuming that some bug has accidentally treated it as a signed number and extended a 32-bit value to 64-bit, it still was a large number in the 2 billion range. While it seems unlikely, it is theoretically possible to reach such a number, because the file system in question was created with 3.9G inodes (see the df(1) output in the first post).

I am quite sure that the problem is caused by a bug that is very difficult to trigger, because the OP’s file system is quite special and uncommon: it is rather huge (> 100 TB), it has an inode count that is just slightly below the limit of an unsigned 32-bit value (3.9G), and my guess is that it was created with FreeBSD 11 or earlier, i.e. before the extension of ino_t. (ramirez, can you confirm this?)

That’s why I recommend to post the issue to the freebsd-fs mailing list. I think that mckusick@ (who is on that list) is able to track it down more quickly than anybody else.
 
Sorry about that, my machine is still running 11.3, that's why I saw 32-bit inode numbers in my source code.

Seeing a file system with a few billion files is not uncommon. Depending on average file size, biochemistry and genetics projects regularly run at file sizes of a few KiB, so in a 100 TB file system, you'd easily have several billion files. But getting to nearly 4 billion files on a single-node file system is definitely uncommon. But getting to "negative" 64-bit inode numbers (highest bit of the unsigned 64-bit number is set) is still impossible, even on the world's largest file systems. So the inode number the OP saw was definitely a bug.
 
My recommendation is to post your issue to the freebsd-fs mailing list.

I post to fs-mailing list very soon.

my guess is that it was created with FreeBSD 11 or earlier, i.e. before the extension of ino_t. (@ramirez, can you confirm this?)

No, this is a fresh install of FreeBSD 12.1 and filesystem created in 12.1 from scratch.

Can you please show the output of dumpfs /dev/mfid3p1 | head -30

Code:
# dumpfs /dev/mfid3p1 | head -30
magic   19540119 (UFS2) time    Sat Oct 10 04:46:19 2020
superblock location     65536   id      [ 5eab6d97 34710b41 ]
ncg     149965  size    31253905398     blocks  31006762468
bsize   32768   shift   15      mask    0xffff8000
fsize   4096    shift   12      mask    0xfffff000
frag    8       shift   3       fsbtodb 3
minfree 8%      optim   time    symlinklen 120
maxbsize 32768  maxbpg  4096    maxcontig 4     contigsumsize 4
nbfree  3369882144      ndir    48351   nifree  3915586242      nffree  36903
bpg     26051   fpg     208408  ipg     26112   unrefs  0
nindir  4096    inopb   128     maxfilesize     2252349704110079
sbsize  4096    cgsize  32768   csaddr  1672    cssize  2400256
sblkno  24      cblkno  32      iblkno  40      dblkno  1672
cgrotor 80650   fmod    0       ronly   0       clean   0
metaspace 8336  avgfpdir 64     avgfilesize 16384
flags   soft-updates+journal
check hashes    cylinder-groups
fsmnt   /mnt/data2
volname         swuid   0       providersize    31253905398

cs[].cs_(nbfree,ndir,nifree,nffree):
        (12392,3,26098,5) (11514,2,26104,4) (1034,4,26097,3) (5089,2,25977,7)
        (17633,1,26110,5) (992,7,26085,5) (14608,3,26098,5) (4047,3,26099,5)
        (17634,1,26110,5) (17636,1,26110,5) (13027,3,26099,0) (1089,3,26031,3)
        (13310,1,26106,1) (17632,1,26110,5) (8562,3,26098,5) (8185,1,26102,2)
        (14592,2,26107,3) (13173,2,26086,5) (13138,2,26020,2) (17687,1,26065,0)
        (13328,1,26016,7) (14754,1,26023,6) (13742,2,25995,1) (17825,1,26073,0)
        (6824,2,25958,4) (16919,2,26054,4) (1663,3,26098,5) (14690,3,26070,4)
        (2495,3,26099,4) (17636,3,26108,1) (10462,3,26099,3) (17631,1,26110,5)

From raid controller side all volumes also looks ok:
Code:
# mfiutil show volumes
mfi0 Volumes:
  Id     Size    Level   Stripe  State   Cache   Name
 mfid0 (  223G) RAID-0     256K OPTIMAL Disabled <SYSTEM1>
 mfid1 (  223G) RAID-0     256K OPTIMAL Disabled <SYSTEM2>
 mfid2 (  116T) RAID-6     256K OPTIMAL Disabled <DATA1>
 mfid3 (  116T) RAID-6     256K OPTIMAL Disabled <DATA2>
 
Yes, i have backup on second volume. But seems they both have this problem XD
This is file storage for videoproduction. So, there is not too much files and they is relatively big.

Code:
# find . -type f -print | wc -l
  251485
 
# find . -type d -print | wc -l
   48351
 
This is file storage for videoproduction. So, there is not too much files and they is relatively big.
In that case you should have created the file system with a much lower inode density, i.e. a much higher bytes-per-inode number (for example newfs(8) with -i 262144 or even higher). That will reduce the number of inodes considerably, so you don’t run into the multi billion range, and I guess that would have prevented the bug from being triggered. It also makes a forced fsck(8) run much faster.
 
Why you think it's a bug? His max number of inods are 3 915 886 080 (ncg*ipg)
Of course I can’t say for sure, I can only guess. But that number looks suspicious because it is very close to the limit of an unsigned 32-bit number, and it is beyond the range of a signed 32-bit number, so the problem could be explained by a type-conversion bug. Having more than 2 billion inodes is rather uncommon, so the bug might have gone undetected so far, especially since the ino_t change is relatively new (FreeBSD 12+). Furthermore, the huge inode number reported by the OP looks suspicious, too: 0xFFFF FFFF 8000 4C00 looks very much like a number that was accidentally sign-extended from 32-bit to 64-bit. I wouldn’t be surprised if there is a small piece of code that was missed during the ino_t conversion.

That’s why I urge the OP to approach the freebsd-fs mailing list.

Why not some data corruption or bad firmware of the controller? A consistency check will help if there's a wrong parity.
Of course that might be possible, too. On the other hand, the drive should report I/O errors if there was something wrong with the on-disk data that could not be corrected. And even if that failed (because of a firmware bug or whatever), mckusick@ recently introduced check-hashes to UFS meta data (cylinder groups, inodes, superblocks), so if there was corruption, the kernel should have noticed that.
 
Why you think it's a bug? His max number of inods are 3 915 886 080 (ncg*ipg)
As olli said, the number 0xFFFF FFFF 8000 4C00 can not possibly be an inode number. It's far too large. I think if you add up all the inodes in the world, you would get around 0x0003 xxxx xxxx xxxx (educated guess: 800 trillion, which includes several file systems that are many exabytes in size); this inode number is several ten thousand times larger, too large to be realistic. Also, the bit pattern is way too suspicious: the fact that all the upper bits are 1 makes it very likely that someone took a 32-bit number and sign-extended it to 64 bits, which they shouldn't do. And even the lower 32 bits are a bit smelly: out of the 32 bits, only 4 are even set; that's unlikely to be a random large number, and more likely to be some sort of flag word mis-interpreted as inode number.

Why not some data corruption or bad firmware of the controller? A consistency check will help if there's a wrong parity.
Possible, but a well-run and well-built RAID system would have reported and data corruption already. Note that he's running RAID-6, so a single or double fault would not possible do this, and a triple fault is unlikely. That leaves firmware bugs ... which exist in droves. But for software debugging saying "hardware fault" is a bit cheesy.

There was another question I was thinking of asking the OP, but then I decided that now would not be a good time: For a file system of this size, why aren't they using ZFS? It has the RAID layer built right in, it can do two-fault-tolerant RAID, and it has checksums everywhere. Quite possibly the OP has good reasons to not use ZFS, and even if they should or could use it in the future, this is not a good time to think about what file system to use, it is a good time to get their current data back online.
 
bad firmware of the controller?
I agree that the firmware version is worth investigating.
What I have noticed is that these LSI controllers work best when the firmware on the card is close to the same as the firmware that FreeBSD driver uses.
For example from dmesg:
mps0: Firmware: 21.00.01.00, Driver: 21.02.00.00-fbsd
Notice how they are both in the 21.xx.xx range. That is what is ideal.
 
I have latest firmware on megaraid.

from freebsd-fs list:

The UFS filesystem uses 32-bit inode numbers,
so if you build a big enough filesystem, the default number of inodes is
too big and you run into the problem that you discovered. In the short
term, you need to build a filesystem with a larger number of bytes per
inode as Stefan has suggested.

The correct fix is to check for inode overflow in newfs and
scale up the bytes per inode so that the total number of inodes
does not exceed what fits in a 32-bit inode field.

I have not run into this bug before as most people switch to ZFS when
their filesystems grow much past 10Tb since ZFS is much better able
to operate on such huge filesystems.

Kirk McKusick

You may need to recreate the file system with an appropriate
value if "-i", the number of bytes per inode.

If your files are generally in the order of 1 GB, then you
should try -i 100000000 (i.e. 100 mio to allow for up to
1 million files to have some slack).

Regards, STefan

you can fetch my script fs_summarize.awk to figure out the newfs(8) parameters if you want to recreate your filesystem as suggested by @olli@ (leave some headroom, i.e. slightly decrease the bytes/inode than what you actually have on your system).

yes, thank you.
my results:

Code:
type             avg/[kB]  total/[MB] 512B-blocks     count
-----------------------------------------------------------
regular files:    64330.8 15799058.63  2147483648    251485

now i thinking about newfs -i <value>
 
fixed versoin of fsck_ffs works perfect.
after all, i recreated filesystems with inode size = 256k. System now is much more responsive.
 
Back
Top