FreeBSD-provided disk images: problems with UFS, fsck_ufs (fsck_ffs(8)); partition sizes …

mer said:
There is a performance penalty for doing all writes sync.

For a disk with platters that's correct, due to movement of the heads and spinning. Does it hold true for an SSD as well?
I switched off both SU and SU+J on my root FS on an SSD after a powerloss. Can't say I noticed much of a difference in performance, though it's hard to compare performance before and after of course.
 
For a disk with platters that's correct, due to movement of the heads and spinning. Does it hold true for an SSD as well?
I switched off both SU and SU+J on my root FS on an SSD after a powerloss. Can't say I noticed much of a difference in performance, though it's hard to compare performance before and after of course.
Yep, technically yes there is still a penalty, but it's much smaller than spinning platters. If you look at todays devices vs 10 yrs ago, there may not even be a huge penalty even for spinning platters. Just as CPUs and RAM got faster and faster, so have physical devices.
 
… "Storage Essentials" book calls sync writes "stupidly safe". …

Good.

In case of interruption (e.g. power loss or kernel panic) will losses be less than thirty seconds?

For simplicityleast risk of chaosminimal losses (less than thirty seconds) with UFS, is it reasonable to work with the combination of what's below?
  1. Disable soft updates
  2. restart the operating system
  3. mount with option syncall I/O to the file system should be done synchronously.
 
And in case of interruption (e.g. power loss or kernel panic) will losses be less than thirty seconds?
I would say "most likely, way less than 30 secs".
At one point in time, "sync" was probably the default. When work was done via deck of punch cards submitted as a batch run overnight, disk and IO performance wasn't noticeable by your average person.
As work became more and more interactive, people would complain about it taking minutes to save a file. So people started looking harder at performance (you always had a small group after performance) which led to async, noasync, soft updates and journaling.
The physical hardware has gotten faster so the performance penalties of sync mounts may not even be noticeable on modern hardware.

As I was trying to point out, ZFS also has something that is similar: transaction groups. There are also various levels of "log" devices for ZFS (ZIL or intent log, SLOGs, etc) that act kind of like journaling, so the concept of UFS with "noasync" and SU+J is not a bad thing, it just may need tuning for specific use cases. I don't know if the process that does the SU is tunable, but something that would let you adjust the interval would be nice. Lets say at the moment it runs every 30 secs; then you need to ask what is worst case backlog it can build up, how long will it take to flush out that backlog? If every 30 secs you can't stay ahead of the backlog, then maybe it should run every 5 secs.
 
Lets say at the moment it runs every 30 secs; then you need to ask what is worst case backlog it can build up, how long will it take to flush out that backlog? If every 30 secs you can't stay ahead of the backlog, then maybe it should run every 5 secs.
And: If the user (or the person who wrote the application) really cares, they can use the sync command (which is not guaranteed to do anything, but on all existing implementations hardens the outstanding writes to disk), or they can call fsync() on an individual file, or open the file in an appropriate sync mode. The fact that the user/implementor didn't do that simply means that they don't care that ~30 seconds of work can simply vanish (for various values of 30).

Now, the worrisome aspects are: Somewhere above, someone said that they had files vanish in spite of saying sync. That would be bad. And if updates within a file are partially applied out of order (like writes 1, 3 and 5 were applied, but writes 2, 4 and 6 were not), that would be a bug. But note that between files, there are technically no ordering constraints (it would be ok if files A, C and E have their updates applied, while files B, D and F do not); but in reality, most implementations guarantee that within a file system, all updates are applied in order. It's not clear to me from all the posts above whether any of these style of bugs have been seen in this thread.

But: All this has to assume that the underlying storage system (typically called the disk or the storage layer) is honest about when data has been written: before a write() call to the the "hardware" returns. In the case of complicated storage hardware (more complicated than a single disk), there is often doubt whether the writes have actually happened. In this particular case, I strongly suspect that the real root cause is an interaction between the UFS file system on FreeBSD and the underlying simulated block layer, which runs in a VM on the same physical hardware. If one powers off the hardware, I suspect that the simulated block layer actually loses writes, and that's what may be causing much of the problems seen in this thread. Although some tests were also run on bare metal hardware, with known good disks.
 
… the underlying simulated block layer, which runs in a VM on the same physical hardware. If one powers off the hardware, I suspect that the simulated block layer actually loses writes, …

If I understand correctly: no evidence of loss with VirtualBox when I performed a reset whilst diskchecker.pl ran – <https://forums.FreeBSD.org/threads/80655/post-515871>

Postscript

To clarify:
  • the virtual machine reset for the diskchecker.pl test was without an interruption to the physical machine
  • I'm not specifically testing virtual machine behaviours in cases of trouble with physical machines … still, the comment about physical power off is appreciated (and the screenshot below is partly to show that I'm watchful for trouble at the physical level).
 
… The physical hardware has gotten faster so the performance penalties of sync mounts may not even be noticeable on modern hardware. …

I installed gdb and its dependencies three times in a test virtual machine:
  1. with sync, 69 seconds
  2. without sync, 68 seconds
  3. with sync, 65 seconds
– probably meaningless, in that I wasn't stress-testing.

… complicated storage hardware (more complicated than a single disk) …

On rare occasions, I find a VirtualBox guest with non-complicated virtualised hardware behave as if a show-stopping I/O error has occurred. There'll be its screen, but (loosely speaking) the machine is non-responsive. Not all such rarities involve an I/O issue at the host level, but in the example below there was an issue.



I sensed something wrong when there was no progress beyond 28% for extraction of a package. After confirming that the guest OS was non-responsive I took a look at /var/log/messages on the host and found two retries of a READ command, which probably coincided with the problem with the virtual machine. No problem with the third try.

Not enough of an issue for the host (OpenZFS pool with an L2ARC device) to encounter an error, but (I suspect) enough of an issue for the guest OS to behave as if it lost access to its boot disk.

2021-06-08 01:20:01.png


Code:
Jun  8 01:12:56 mowa219-gjp4-8570p kernel: (da1:umass-sim1:1:0:0): READ(10). CDB: 28 00 2e d3 01 d8 00 00 20 00
Jun  8 01:12:57 mowa219-gjp4-8570p kernel: (da1:umass-sim1:1:0:0): CAM status: CCB request completed with an error
Jun  8 01:12:57 mowa219-gjp4-8570p kernel: (da1:umass-sim1:1:0:0): Retrying command, 3 more tries remain
Jun  8 01:12:57 mowa219-gjp4-8570p kernel: (da1:umass-sim1:1:0:0): READ(10). CDB: 28 00 2e d3 01 d8 00 00 20 00
Jun  8 01:12:57 mowa219-gjp4-8570p kernel: (da1:umass-sim1:1:0:0): CAM status: CCB request completed with an error
Jun  8 01:12:57 mowa219-gjp4-8570p kernel: (da1:umass-sim1:1:0:0): Retrying command, 2 more tries remain

Note that I don't treat this as a bug. When there's imperfection at the host level, I don't expect perfection in a VirtualBox guest.
 
… worrisome … someone said that they had files vanish in spite of saying sync. That would be bad. …

For sync(8) (not the sync option of mount(8)):
  • three quotes below
  • files vanished in the second and third cases
  • since beginning to understand the nature of UFS soft updates, I'm no longer surprised by the losses.
(Should I be surprised?)

stable/13 in VirtualBox

restored the snapshot, ran sync (and waited for the run to complete), reset the machine. Result:

Code:
…
/dev/gpt/rootfs: LINK COUNT INCREASING
/dev/gpt/rootfs: UNEXPECTED SOFT UPDATE INCONSISTENCY; RUN fsck MANUALLY.
Automatic file system check failed; help!
ERROR: ABORTING BOOT (sending SIGTERM to parent)!
…

Again I restored the snapshot, ran sync (and waited for the run to complete), reset the machine. Result:
  • the gdb binary is found
  • gdb-related files are missing
Screen recording: <https://photos.app.goo.gl/3yKzv35Zeimf3Pjn7> …

Physical machine

fresh installation of 13.0-RELEASE, updated to 13.0-RELEASE-p2. Hardware, not virtualised.

Installed then ran nano, sync, pressed and held the power button, booted, nano not found.
 
Now, the worrisome aspects are: Somewhere above, someone said that they had files vanish in spite of saying sync.
Agreed. It could boil down to is the sync command absolute or is is merely advisory "Hurry up and tell me you wrote the stuff to disk and the disk told you ok I got it"?. I was under the impression that it was absolute (similar to the syncing of buffers at shutdown) but perhaps not.

grahamperrin started out with seeing the behavior in a VM, but has been able to reproduce on physical hardware.

I guess I'm coming down to "is this a bug or is it expected behavior". I'm leaning towards expected behavior but then I reread the parts that sound like fsck couldn't recover the device; that is more concerning and could be something.
 
For the sync option of mount(8), I performed a few tests after booting from a snapshot of the FreeBSD-provided disk image for 13.0-RELEASE. I took this snapshot on 2021-05-30 after enlarging the disk to 128 GB, before a first boot of the machine.

UFS soft updates disabled, latest instead of quarterly.

Tests involved installation of gdb (twelve packages), with interruptions (resets of the virtual machine) towards the end of the routine.

In all cases, the subsequent run of pkg install gdb behaved as if most things were installed. For example:
  • reset during extraction (4%) of gdb – then gdb alone was installed
  • reset during extraction (51%) of python38 – then python38, libiconv and gdb were installed.
I most cases, automated file system checks were followed by multi-user mode and (at a glance) nothing missing. I judged this by glancing at outputs from pkg autoremove.

In one case (the first):
  • automated file system checks were insufficient
  • the operating system remained in single user mode
  • fsck -y led to marking of file systems as clean
  • exit to multi-user mode succeeded
  • missing files were observed in response to pkg autoremove.
 
Agreed. It could boil down to is the sync command absolute or is is merely advisory "Hurry up and tell me you wrote the stuff to disk and the disk told you ok I got it"?. I was under the impression that it was absolute (similar to the syncing of buffers at shutdown) but perhaps not.

The sync command (and the corresponding sync system call) is a weird beast. It is sort of optional, but sort of not. It's meaning is: When the command is issued, the kernel has to *begin* writing dirty buffers to disk, but the command can return before all of them are written. The assumption is that it will finish nearly instantaneously (sub-second), since kernels are usually tuned to not hold that much data in RAM. But: while sync is running, new buffers may be getting dirtied (new data is written that has not been sync'ed yet).

The traditional Unix way of syncing disks is to do "sync;sync;sync". This just adds enough delay to make it highly likely that the data has been written. Another good way is to look at the disk activity light: When it goes out after saying sync, all data that was dirty when sync started should have been written.

The only correct way to guarantee this is to unmount the file system.
 
syncing disks is to do "sync;sync;sync".
You sound like you've been around as long as I have :)

"...begin writing dirty buffers..." which could mean "if softupdates are active, tell the softupdater process to start dumping but don't actually wait for it to say All Done Boss".

Hence I have UPS on my PCs.
 
… sync system call) …

For the system call (but not for the command), sync(2) does acknowledge a bug.

From the earliest record in cgit (1994-05-27):

may return before the buffers are completely flushed.

From syncer(4):

… a sync(2) occurring simultaneously with a crash may cause file system damage. …

With mount(8) option sync:

missing files were observed in response to pkg autoremove.

Should we treat those missing files as symptomatic of disorder – files written out of order?

If there was disorder, should I accept it as (probably) the file system damage bug in syncer?

<https://man.freebsd.org/syncer(4)#BUGS>
 
For the system call (but not for the command), sync(2) does acknowledge a bug.
While the man page calls it a bug (quite generous of it), in theory it isn't one. I used to do file system implementation, and read the POSIX documents in gory detail, and it says very clearly there: the file system has to have begun the process of writing dirty buffers when sync() returns, but it does not need to finish it. There is actually a very good reason: Due to parallelism, it is very hard (and performance killing) to guarantee that there are no dirty buffer when sync() returns. The POSIX committee doesn't want to require something that would be hard and/or performance killing to implement. And even if one could create a system call that guarantees no dirty buffers, the moment sync() returns, new buffers can get dirtied. If someone wants to have no dirty buffers: remount the file system read-only, then call sync. Or just unmount the file system.

Should we treat those missing files as symptomatic of disorder – files written out of order?
Again, technically this is not a bug. If you write file A and do not use fsync() on it, and then write file B and do not use fsync(), and after a crash only file B survives and not A, that is not a bug. If the file system crashes, whether a file survives or not could technically be a lottery. Now, all implementations I've seen from the inside try to do some ordering, although usually not strict ordering. Instead, the ordering tends to be more complex, for example if a file is in a directory, make sure the directory is updated before the file. In particular, if all that was done before the crash was a sequence of non-overlapping "creat; write; close" cycles on many files, I think most implementations would get those hardened in order.

I think the big message is this: If you crash the system, and all your files that have been recently messed with are "fubar" (gone, damaged, ...), that is not actually a bug in the strict sense. If you care about these particular files, mount the file system in sync mode, or open the files in some sync mode, or use fsync commands afterwards, or unmount the file system before cutting power. On the other hand, file system implementors put a lot of effort into minimizing the impact of unclean shutdowns, and giving examples of where they could do better might help them.
 
Excellent.

… file system implementors put a lot of effort into minimizing the impact of unclean shutdowns, …

I never doubted this 👍

The default scope of potential loss with UFS – up to sixty seconds, if I understand correctly – was a surprise.

I'm probably unlucky only in that:
  • my first encounters of losses were, repeatedly, whilst wishing to gather information about kernel panics
– and it's pure coincidence that these lossy encounters began a couple of days after this topic began (with a different problem, affecting UFS in a FreeBSD-provided image for 14.0-CURRENT).

… examples of where they could do better might help them.

I could do better by putting more manual page content in context for myself.

These sysctl(8) tunable variables, for example:

VariableDefaultDescription
kern.filedelay30time to delay syncing files
kern.dirdelay29time to delay syncing directories
kern.metadelay28time to delay syncing metadata

Maybe I should – in addition to mount(8) option sync for UFS, and UFS soft updates disabled – lower those three values (and maintain a difference between each value).



The RELEASE README for disk images could do better by forewarning that images have:
  • defaults that are tuned more for performance, than for protection against loss of data.
(I did pay attention to the README, long ago.)

… write file A and do not use fsync() on it, and then write file B and do not use fsync(), and after a crash only file B survives and not A, that is not a bug. If the file system crashes, whether a file survives or not could technically be a lottery. …

So, in an unlucky situation (loss of power, for example) the recently filed results of pkg-install(8) might effectively lose a lottery … true?

… If you crash the system, and all your files that have been recently messed with are "fubar" (gone, damaged, ...), that is not actually a bug …

ZFS

I do sense that ZFS and OpenZFS are less likely to present a mess after unlucky situations.

Free from mess, without the need for me to refer to the tuning section of a book.
 
Honestly, I think the delay on writes being guaranteed defaulting to 30 seconds is ancient. I think it was already that value when I started using Unixes, which was in the late 80s. With today's hardware, I bet you would get quite acceptable performance if you lowered this on UFS to just a few seconds. On a desktop machine, the performance impact might not be noticable.
 
The best way would be to check in code to see why that large chunk is allocated.
mmap() does not allocate anything. It just makes sure that accessing that range will not fault directly and where the data can be read from to be read, in case you do touch it.
 
mmap() does not allocate anything. It just makes sure that accessing that range will not fault directly and where the data can be read from to be read, in case you do touch it.
If you read my other posts you'll see I did mention that something has to be writing to it as system does run out of space. Most likely calloc() or alike call. This was fixed in commit I mention here too.
Size passed to mmap was a bogus number.
While it's just a word game the man page of mmap(2) says: allocate memory.
 
That's what happens when you read a thread and comment before finishing.
 
… One can have softupdates without journaling; that causes metadata writes to be ordered for on device consistency.
SU+J (softupdates plus journaling) journals all the metadata updates; that helps on system boot for fsck or integrity checks to run faster.

… Soft updates and journaling aren't really designed to prevent absolute data loss, what they are trying to do is maintain on device consistency. …

… UFS is VERY VERY good about writing things to disk relatively quickly (seconds), and consistently (soft updates, journals). …

The FreeBSD Handbook uses the word guarantee. From <https://docs.freebsd.org/en/books/handbook/config/#soft-updates>, with added emphasis:

Soft Updates guarantee file system consistency in the case of a crash …

With soft updates enabled, soft updates journaling enabled, (?) and mount(8) option sync:


Code:
root@mowa219-gjp4-ev631-freebsd-13:~ # date ; uptime
Sun Jun 20 18:01:22 BST 2021
 6:01PM  up  2:17, 5 users, load averages: 0.13, 0.21, 0.16
root@mowa219-gjp4-ev631-freebsd-13:~ # kgdb /boot/kernel/kernel /var/crash/vmcore.0
GNU gdb (GDB) 10.2 [GDB v10.2 for FreeBSD]
…
Reading symbols from /boot/kernel/kernel...
Reading symbols from /usr/lib/debug//boot/kernel/kernel.debug...

Unread portion of the kernel message buffer:
dev = ada0s1a, block = 69640254, fs = /
panic: ffs_blkfree_cg: freeing free frag
cpuid = 1
time = 1624088886
KDB: stack backtrace:
#0 0xffffffff80c57515 at kdb_backtrace+0x65
#1 0xffffffff80c09ef1 at vpanic+0x181
#2 0xffffffff80c09d63 at panic+0x43
#3 0xffffffff80ecf3f6 at ffs_blkfree_cg+0x5f6
#4 0xffffffff80ecc104 at ffs_blkfree+0xa4
#5 0xffffffff80ee6a58 at handle_workitem_freefrag+0xe8
#6 0xffffffff80ee2a0e at process_worklist_item+0x22e
#7 0xffffffff80edcf66 at softdep_process_worklist+0xd6
#8 0xffffffff80ee0eff at softdep_flush+0x11f
#9 0xffffffff80bc7e2e at fork_exit+0x7e
#10 0xffffffff810629fe at fork_trampoline+0xe
Uptime: 1m34s
Dumping 320 out of 4028 MB:..5%..15%..25%..35%..45%..55%..65%..75%..85%..95%

__curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:55
55              __asm("movq %%gs:%P1,%0" : "=r" (td) : "n" (offsetof(struct pcpu,
(kgdb) bt
#0  __curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:55
#1  doadump (textdump=<optimized out>) at /usr/src/sys/kern/kern_shutdown.c:399
#2  0xffffffff80c09ae6 in kern_reboot (howto=260) at /usr/src/sys/kern/kern_shutdown.c:486
#3  0xffffffff80c09f60 in vpanic (fmt=<optimized out>, ap=<optimized out>)
    at /usr/src/sys/kern/kern_shutdown.c:919
#4  0xffffffff80c09d63 in panic (fmt=<unavailable>) at /usr/src/sys/kern/kern_shutdown.c:843
#5  0xffffffff80ecf3f6 in ffs_blkfree_cg (ump=<optimized out>, ump@entry=0xfffff80003078200,
    fs=<optimized out>, devvp=devvp@entry=0xfffff8001a8ea7a0, bno=<optimized out>, bno@entry=69640254,
    size=<optimized out>, inum=<optimized out>, dephd=0xfffffe00639c9b88)
    at /usr/src/sys/ufs/ffs/ffs_alloc.c:2325
#6  0xffffffff80ecc104 in ffs_blkfree (ump=ump@entry=0xfffff80003078200, fs=<unavailable>,
    devvp=0xfffff8001a8ea7a0, bno=69640254, size=<optimized out>, inum=<optimized out>, vtype=VREG,
    dephd=0xfffffe00639c9b88, key=2) at /usr/src/sys/ufs/ffs/ffs_alloc.c:2656
#7  0xffffffff80ee6a58 in handle_workitem_freefrag (freefrag=freefrag@entry=0xfffff800423cc900)
    at /usr/src/sys/ufs/ffs/ffs_softdep.c:5996
#8  0xffffffff80ee2a0e in process_worklist_item (mp=mp@entry=0xfffffe006babb580, target=target@entry=10,
    flags=flags@entry=512) at /usr/src/sys/ufs/ffs/ffs_softdep.c:2016
#9  0xffffffff80edcf66 in softdep_process_worklist (mp=mp@entry=0xfffffe006babb580, full=full@entry=0)
    at /usr/src/sys/ufs/ffs/ffs_softdep.c:1804
#10 0xffffffff80ee0eff in softdep_flush (addr=addr@entry=0xfffffe006babb580)
    at /usr/src/sys/ufs/ffs/ffs_softdep.c:1589
#11 0xffffffff80bc7e2e in fork_exit (callout=0xffffffff80ee0de0 <softdep_flush>, arg=0xfffffe006babb580,
    frame=0xfffffe00639c9d00) at /usr/src/sys/kern/kern_fork.c:1069
#12 <signal handler called>
(kgdb) q
root@mowa219-gjp4-ev631-freebsd-13:~ #
 
Last edited:

Thanks, I did view that bug 193364 – and a number of others for freeing free block – before reporting. You'll find my name in cc lists at <https://bugs.freebsd.org/bugzilla/show_activity.cgi?id=193364> and elsewhere.

The four that I chose for See Also – 6203, 88555, 132960 and 195544 – are for the slightly different string that's in my bug report:

freeing free frag

Still, it's useful to have background on comparable bugs. Thanks.

In addition to what I linked from 256712, for myself I bookmarked:
 
Apologies in advance if this gets long. …

It's fine. Thanks. Juggling the order a little:

… Mount sync with no SU or SU+J should be the overall safest (minimal data loss) …

As far as I can tell, from recent test results, this combination is good:
  • mount(8) sync in /etc/fstab ⊕ soft updates disabledkern.filedelay, kern.dirdelay and kern.metadelay reduced to 10, 9 and 8 (seconds) respectively.

… soft updates (SU) and soft updates with journaling (SU+J):

SU is basically "organize and arrange disk writes so that filesystem metadata remains consistent at all times". …

With reference to a January commit, it seems that background file system checking for soft updates without journaling has been broken for a few months.

…SU+J is "record metadata updates outside of the filesystem before updating the filesystem". The journal comes into play on system boot/fsck: comparison of whats in the journal against what's on the physical device and "replay transactions" as needed.…
 
Back
Top