UFS Random UFS fsck errors in zvol-backed bhyve VM

Marco Styner · Jan 3, 2017

Hi all

This is the first time I could not resolve an issue by checking the information that is somewhere in forums or handbooks, so I really hope for your help on this:

I have a hardware machine that runs release 11p6. The filesystem is ZFS on two SSDs in mirror mode.

On this machine I have 4 virtual machines running. They use zvols with volmode=dev, volblksize 32k as devices in bhyve (I use chyves as a frontend).
Inside the VMs I have again release 11p6, this time with UFS/Swap (mostly default configuration).

Everything works fine. But if i do a fsck on the guest, it reports random errors (see bottom). I say random, because the list changes every time I run fsck. Switching to single user mode, really fix the errors and go back multi user changes nothing.

I am mildly scared, that something terrible happens over time and I am just not aware of it yet...

What am I missing or what am I doing wrong...?

Many thanks for your help,
Marco

Code:

#####
** /dev/ada0p2 (NO WRITE)

USE JOURNAL? no

** Skipping journal, falling through to full fsck

SETTING DIRTY FLAG IN READ_ONLY MODE

UNEXPECTED SOFT UPDATE INCONSISTENCY
** Last Mounted on /
** Root file system
** Phase 1 - Check Blocks and Sizes
** Phase 2 - Check Pathnames
** Phase 3 - Check Connectivity
** Phase 4 - Check Reference Counts
UNREF FILE  I=810042  OWNER=root MODE=100644
SIZE=0 MTIME=Jan  3 15:17 2017
RECONNECT? no


CLEAR? no

UNREF FILE  I=810045  OWNER=root MODE=100644
SIZE=0 MTIME=Jan  3 15:17 2017
RECONNECT? no


CLEAR? no

UNREF FILE  I=810049  OWNER=root MODE=100644
SIZE=0 MTIME=Jan  3 15:17 2017
RECONNECT? no


CLEAR? no

** Phase 5 - Check Cyl groups
SUMMARY INFORMATION BAD
SALVAGE? no

BLK(S) MISSING IN BIT MAPS
SALVAGE? no

262128 files, 1120140 used, 806227 free (5811 frags, 100052 blocks, 0.3% fragmentation)

#####[CODE]

[/CODE]

ASX · Jan 3, 2017

Marco Styner said:
Everything works fine. But if i do a fsck on the guest, it reports random errors (see bottom). I say random, because the list changes every time I run fsck. Switching to single user mode, really fix the errors and go back multi user changes nothing.

I'm under the impression you are trying to run fsck when running in multiuser mode, which is something you should not do. Please clarify if that is the case.

tingo · Jan 4, 2017

ASX said:
I'm under the impression you are trying to run fsck when running in multiuser mode, which is something you should not do. Please clarify if that is the case.

Still, fsck shouldn't report errors as long as the machine is "idle" (in other words not doing anything else) unless there really *are* errors.

ASX · Jan 4, 2017

tingo said:
Still, fsck shouldn't report errors as long as the machine is "idle" (in other words not doing anything else) unless there really *are* errors.

I disagree.

Simply deleting a file while it is still open from some process would produce an UNREF FILE error like posted above. (the inode and the associated blocks, if any, would be released when the process is terminated).

Code:

UNREF FILE  I=810042  OWNER=root MODE=100644
SIZE=0 MTIME=Jan  3 15:17 2017
RECONNECT? no

The only "idle" state I can think of is that one while running in single-user, else there will always be some daemon running ...

Marco Styner · Jan 12, 2017

Thank you for your responses.
I admit that I started fsck in both single and multi user and the exact report above was indeed taken when running multi user.

But I also tried to run fsck repeatedly in single user and had consecutive errors (ie. fsck->fix->reboot->fsck->fix again...)
So after a while I suspected a hardware issue - but if it is one, it seems to be hard to track down.... The bare metal machine is a D1518 Xeon with 32G of ECC ram, and the mirrored zfs as well as the smart status of the drives show no anomalies at all.
Again - everything is running just fine, no strange behaviour at all. The only reason why I started checking at all was because of an abrupt power outage behind the USV and I suspected that the UFS inside the VM might not have liked that...

To tingos/ASXs comments: one of the VMs that behaves like this runs asterisk and nothing else, so I would not expect dramatic activity all the time, especially no deletes, but that is just an unconfirmed guess...

ASX · Jan 12, 2017

Marco Styner said:
But I also tried to run fsck repeatedly in single user and had consecutive errors (ie. fsck->fix->reboot->fsck->fix again...)

Hmm ... fsck can fix errors only when a filesystem is unmounted or mounted read-only.

If you enter in single user mode from boot menu, the main filesystem will be already mounted read-only.
If you switch from multiuser to single user, the root filesystem will remain mounted as read-write.
You need to remount it read-only before running fsck.

Code:

mount -u -r /

The fsck report you posted before show a single error, about an empty file, most likely a temporary file.
That's to say there is no need for high disk activity to obtain some fsck error ...

UFS Random UFS fsck errors in zvol-backed bhyve VM

Marco Styner

ASX

Guest

tingo

ASX

Guest

Marco Styner

ASX

Guest