Other Root drive read-only after boot

Is the situation like this from the boot or did it happen on the fly?

My brainstorming questions:

Any chance there is process of fw being upgraded that locked the whole disk? Any other processes running on the system that might indicate something similar ?
Is this disk part of any raid controller? How is this disk connected to the box? ( id1,enc@n306***/type@0/slot@1/elmdesc@Slot_00 suggest path to some sort of disk bay?).

But given this is a prod server you don't have some sort of remote management access to; it's hard to guess and give further advice.
 
Upgrading consists of 1) dd-ing a filesystem image on top of the root filesystem partition that is not currently mounted/in use, 2) telling the bootloader to bootonce from there and 3) rebooting the server. I've tested same root filesystem image on a VM and the problem does not appear there (and also we use this method for 7 or 8 years and this is a first). Before the upgrade the system was running FreeBSD 13.2-RELEASE-p9 (roughly 2 years old image). I've noticed the issue (drive seemingly being read-only) couple of minutes after the OS loaded after rebooting; that is 5 minutes after last write to the very same disk drive. If the issue was present while 13.2 was running, I was not gonna be able to write the new root filesystem image on the SSD drive.

There are 20+ NVMe drives attached and I don't see RAID controller in pciconf list.

The platform is likely SuperMicro SuperServer 2029U-TN24R4T.

Edit: I had never heard about sesutil(8) until now...
Code:
# sesutil show
ses0: <AHCI SGPIO Enclosure 2.00>; ID: 3061686369656d30
Desc            Dev     Model                     Ident                Size/Status
Slot 00         -       -                         -                    Not Installed
Slot 01         -       -                         -                    Not Installed
Slot 02         -       -                         -                    Not Installed
Slot 03         -       -                         -                    Not Installed
Slot 04         -       -                         -                    Not Installed
Slot 05         -       -                         -                    Not Installed

ses1: <AHCI SGPIO Enclosure 2.00>; ID: 3061686369656d31
Desc            Dev     Model                     Ident                Size/Status
Slot 00         ada0    KINGSTON SV300S37A120G    50026***             120G
Slot 01         -       -                         -                    Not Installed
Slot 02         -       -                         -                    Not Installed
Slot 03         -       -                         -                    Not Installed
Slot 04         -       -                         -                    Not Installed
Slot 05         -       -                         -                    Not Installed
Slot 06         -       -                         -                    Not Installed
Slot 07         -       -                         -                    Not Installed
 
That seems to be directly attached disk then.

Your upgrade procedure is .. unusual; but it's not subject of the thread.

The test program you showed doesn't test much - this doesn't work on a disk locked by geom (or otherwise busy disk). The same reason why you can't start dd-ing into a disk.
Now .. again; production box without remote mgmt (eventhough server you pasted does have ipmi) .. I understand you need to tread lightly.

Any chance there's some sort of misconfiguration where system thinks this disk is part of something else? (i.e. zfs; check zpool list, zpool status .. )

You could do a dd test like this (extra careful with debugflags; handing you loaded gun):
Code:
sysctl kern.geom.debugflags=0x10
dd if=/dev/ada0 of=lba0 bs=512 count=1
# verify what you read and what you got is the same
hexdump -C -n 512 /dev/ada0| sha256
hexdump -C ./lba0 |sha256
# if ok; write back
dd if=./lba0 of=/dev/ada0
 
That seems to be directly attached disk then.

Your upgrade procedure is .. unusual; but it's not subject of the thread.

The test program you showed doesn't test much - this doesn't work on a disk locked by geom (or otherwise busy disk). The same reason why you can't start dd-ing into a disk.
Now .. again; production box without remote mgmt (eventhough server you pasted does have ipmi) .. I understand you need to tread lightly.

Any chance there's some sort of misconfiguration where system thinks this disk is part of something else? (i.e. zfs; check zpool list, zpool status .. )

You could do a dd test like this (extra careful with debugflags; handing you loaded gun):
Code:
sysctl kern.geom.debugflags=0x10
dd if=/dev/ada0 of=lba0 bs=512 count=1
# verify what you read and what you got is the same
hexdump -C -n 512 /dev/ada0| sha256
hexdump -C ./lba0 |sha256
# if ok; write back
dd if=./lba0 of=/dev/ada0
I have ssh access. Don't need another kind of access. I take care for, write and support the software side.

Upgrades are very stable -- atomic, with easy rollback. There are tens of other systems set-up the same way. Have systems with uptimes of 600+ days.

Back this case. Don't think the dd test will be possible, because the OS denies opening of /dev/ada0 for write access. Non-root UFS filesystems can be mounted and unmounted (in read-only). fsck succeeds in read-only mode. dd bs=32768 if=/dev/ada0 | wc -c completes successfully.

ZFS is on top of the NVMes only; zpool status list does not include this drive.

everything points to "the SSD has emergency-locked itself to read-only" tbh
Could be. But neither smartmontools general health check, nor various counters suggest this. Also there are no I/O errors, but permission errors.

Could be. Personally I'd expect I/O error floods in syslog then.
Me too. There is nothing in the kernel ring buffer. Had past experience with failing disks, but this one doesn't behave like this.

Can someone suggest a "safe" dtrace probe I can use to peek into the kernel, to try to see where permission errors originate from?
 
Don't think the dd test will be possible, because the OS denies opening of /dev/ada0 for write access.
That's why I mentioned the "loaded gun" - kern.geom.debugflags will allow you to do so. This way you can test if ssd's fw is doing this or if kernel was a reason for doing so.
 
Back
Top