Minor upgrade went really wrong. Desperate for help.

I have a few servers running FreeBSD at a datacenter far from where I live. They've been running (with the occasional hardware upgrades, and lots of software upgrades) for years, with minimal trouble.

Today, I went to upgrade ports, and realized two of the servers were running 13.2, which reached EOL a couple of weeks ago. I upgraded the first of them to 13.3, with zero problems - freebsd-update upgrade -r 13.3-RELEASE, reboot for the kernel, cleanup, no problem.

The second one, the early stuff went fine. Fetched the new stuff, installed, upgraded to 13.3, merged config files. Rebooted. The machine never came back up.

I logged into the KVM console supplied by the datacenter, so I could watch the bootup. I let it run through... it got to

Dual Console: Serial Primary, Video Secondary

and just hung.

Rebooted, turned on verbose in Boot Options... I get one line farther:

start_init: trying /sbin/init

I've tried booting into Single user mode. I've tried turning on safe mode. I've tried booting into default/kernel_old. Nothing gets me farther than one of these two hang spots.

The data center officially stopped supporting FreeBSD earlier this year. I was okay with that, because I've never really needed them for anything in the decade and a half I've been with them... but it means I have no way of asking them to put together an emergency disk I can boot from on a USB stick or whatever. (All of the googling I've done suggests that this might be fixable if I can edit /boot/loader.conf... but I have no way of doing that, that I know of.)

Do I have ANY options to get this machine running again?

One thing that's super-weird to me: the load screen shows 'Cons: Dual (Serial Primary)' when it's rebooted, but regardless of whether I change it to Video Primary or leave it alone, the hang line reads "Serial Primary'.

Am I totally screwed?
 
(All of the googling I've done suggests that this might be fixable if I can edit /boot/loader.conf... but I have no way of doing that, that I know of.)
You can get to the boot loader from the menu? Are you loading anything special in loader.conf (DRM driver for example)? And is this ZFS or UFS?
 
One thing that's super-weird to me: the load screen shows 'Cons: Dual (Serial Primary)' when it's rebooted, but regardless of whether I change it to Video Primary or leave it alone, the hang line reads "Serial Primary'.
Are the serial ports disabled in the BIOS? You might want to try disabling those.

Assuming you can get to the loader prompt:
Code:
unload
load /boot/kernel/kernel
load /boot/kernel/zfs.ko          # If this is ZFS, don't need it if it's UFS
boot -s
 
If the KVM console is connected via serial port (including emulated over USB) and the servers are actually, so called, headless, it would be natural that serial port is primary and no video.
 
You can get to the boot loader from the menu? Are you loading anything special in loader.conf (DRM driver for example)? And is this ZFS or UFS?
I can look at the loader.conf file from the menu. It's nearly empty - just two lines:

Code:
if_lagg_load=YES
hw.ixgbe.num_queues="4"

I don't have any way to edit it, though.

I THINK it's ZFS, but I'm not sure, and don't actually know how I'd check. :(
 
Are the serial ports disabled in the BIOS? You might want to try disabling those.

Assuming you can get to the loader prompt:
Code:
unload
load /boot/kernel/kernel
load /boot/kernel/zfs.ko          # If this is ZFS, don't need it if it's UFS
boot -s

Tried this - same result (hangs after the Dual Console line). Or the /sbin/init line, maybe, if that's just not showing unless verbose is on.

T-Aoki's point that it might be this way on purpose is a good one; I hadn't actually thought of that. And again - this server has been running for years (I think it was running 10.x when it was spun up, I've upgraded it to everything along the way, with zero problems, until now). I've never actually watched the boot process before, though.
 
I suspect it doesn't actually "hang" it's just not showing the kernel messages (probably because it's sending them to a non-existent serial console). If you wait some time then try to access it over ssh(1), does that work?

How do you get remote KVM? Is that through IPMI? Or do they have networked KVM switches attached to the keyboard/video of the server? Can you access the BIOS via the KVM?
 
you may try to set init_path to /rescue/init at the loader prompt if you suspect /sbin/init is broken
Tried this - didn't get an error when setting the variable, but nothing changed.

I can't even really tell what's going on in the load process; the scrolling happens a little too fast. The KVM interface allows me to save it, but only in picture/movie format, and the movie format is .webm, and unreadable (on my mac) by most things that can read .webm (VLC, Chrome, Firefox).
 
I can't even really tell what's going on in the load process; the scrolling happens a little too fast.
You can record your monitor with a simple digi cam and watch that frame by frame.
 
I suspect it doesn't actually "hang" it's just not showing the kernel messages (probably because it's sending them to a non-existent serial console). If you wait some time then try to access it over ssh(1), does that work?

How do you get remote KVM? Is that through IPMI? Or do they have networked KVM switches attached to the keyboard/video of the server? Can you access the BIOS via the KVM?
No, ssh still doesn't work, it's actually hung. (I watched its twin boot up yesterday, just for reference, and I actually get to a regular # prompt eventually.) I can't even ping it.

I access the KVM through IPMI, yes. I THINK I can probably access the BIOS; there are a few 'Press X to enter Setup' type messages before we get to the FreeBSD boot screen. I have absolutely no idea what I'm doing there, though.

One thing that might be useful: I can actually see all the things that are in /boot on another machine. I may not have remembered to say so, but there were multiple machines spun up at the same time, and they were all the same. There are actually a whole bunch of loader files:

Code:
-r-xr-xr-x  3 root  wheel  495616 Jul 11 12:00 loader
-r--r--r--  1 root  wheel    7800 Jul 11 12:00 loader.4th
-rw-r--r--  1 root  wheel      41 Dec  8  2021 loader.conf
-r-xr-xr-x  2 root  wheel  906752 Jul 11 12:00 loader.efi
-r--r--r--  1 root  wheel   13653 Jan 23 18:57 loader.help.bios
-r--r--r--  1 root  wheel   13653 Jan 23 18:57 loader.help.efi
-r--r--r--  1 root  wheel   13653 Jan 23 18:57 loader.help.userboot
-r--r--r--  1 root  wheel     382 Jul 11 12:00 loader.rc
-r-xr-xr-x  1 root  wheel  434176 Jul 11 12:00 loader_4th
-r-xr-xr-x  1 root  wheel  818688 Jul 11 12:00 loader_4th.efi
-r-xr-xr-x  3 root  wheel  495616 Jul 11 12:00 loader_lua
-r-xr-xr-x  2 root  wheel  906752 Jul 11 12:00 loader_lua.efi
-r-xr-xr-x  1 root  wheel  372736 Jul 11 12:00 loader_simp
-r-xr-xr-x  1 root  wheel  759808 Jul 11 12:00 loader_simp.efi

Loader.rc includes a bunch of them (loader.4th, efi.4th, beastie.4th), but none of those really gave me any clues as to what might be happening, or what I can do to fix it. :(

I'd like to say I'm really, really appreciative of the suggestions. I feel utterly stuck, and poking around in parts of the guts of this thing that I've never had to poke around in before... and my usual ways of solving problems of this type (ie google) are coming up short.
 
Okay, I have no real idea what I'm looking for, but I turned on Verbose in boot options, then filmed the entire process. I actually (for the first time since this started) got a bit farther in the process! (That is: it still hung, but it recognized a few peripherals - or at least ports - after trying /sbin/init.) I'm attaching a pic of the end of the process (at the hang spot) - might be useless, I really have no clue.

[edit] Bah - the 'got farther in the process' was a mirage. I just rebooted, to see if maybe something had gotten better, and those lines showed up just BEFORE the Dual Console line - so it was just a matter of speed of posting, I guess. Still stuck at (roughly) the same place.
 

Attachments

  • end-of-boot-process.jpg
    end-of-boot-process.jpg
    93.6 KB · Views: 26
I THINK it's ZFS, but I'm not sure, and don't actually know how I'd check.
kldstat | grep zfs
(if you see zfs.ko, then you can check the available datasets by running zfs list


I can't even really tell what's going on in the load process; the scrolling happens a little too fast. The KVM interface allows me to save it, but only in picture/movie format, and the movie format is .webm, and unreadable (on my mac) by most things that can read .webm (VLC, Chrome, Firefox).
Maybe it's time to update VLC/Chrome/Firefox, or maybe check the .webm file elsewhere, it could be corrupt by accident. Also, IIRC, macs offer a screen recording utility that is independent of the KVM interface.
 
kldstat | grep zfs
(if you see zfs.ko, then you can check the available datasets by running zfs list

None of these commands are available at the loader prompt, but I can run them on one of the other boxes.

Code:
root@xm3:~ # kldstat |grep zfs
18    1 0xffffffff82800000   3c6a88 zfs.ko
root@xm3:~ # zfs list
no datasets available

These two machines were configured identically at launch, so this information should be valid for the problem box, too.
 
kldstat | grep zfs
That isn't available from the boot loader prompt.

I can look at the loader.conf file from the menu. It's nearly empty - just two lines:

Code:
if_lagg_load=YES
hw.ixgbe.num_queues="4"
I don't have any way to edit it, though.

I THINK it's ZFS, but I'm not sure, and don't actually know how I'd check. :(
From the boot loader prompt after typing more /boot/loader.conf and all you see is the two lines you mentioned, then I'd say it's not a ZFS on root system. A system using ZFS for a ZFS on root system, and/or a UFS on root system with additional ZFS pool(s), should normally have zfs_load=YES in /boot/loader.conf. Absence of zfs_enable=YES in /etc/rc.conf also indicates that it's not a ZFS on root system.

Also, from the boot loader typing show at the end of the output you will probably notice the absence of zfs_load=YES. If that line isn't there too, I really cannot see how you'd have a ZFS on root system.
 
(I watched its twin boot up yesterday, just for reference, and I actually get to a regular # prompt eventually.)
That would be single user mode, sshd(8) isn't running in single user mode, nor is the network configured.
 
The datacenter has decided to try and boot me into their standard Linux rescue drive; they're not promising it will work (it hasn't been tested recently with FreeBSD, because, again, it hasn't been supported for a bit), but if it does... what should I be testing/trying? (Getting a jump on the question, even though I don't have access yet, because of the lag time between posting and getting approved.)
 
Okay - problem is solved - I'm back up and running. In the end, it was disk corruption.

Data center techs booted off a Linux rescue disk - worked fine. I ran fsck on the old / and the old /usr partitiions (probably should have done EVERYTHING, but I was impatient from the 18 hour downtime). /usr had a bunch of problems (not even worth listing them all, I let fsck fix them), / was clean. I rebooted... and everything came back up properly.

I'm running a full backup right now (and I'll write a 'clear out old cruft' script to make sure I don't run out of space again on the backup machine), and I guess I'll keep an eye on the server integrity. (Is there anything I should set up so I get warned if this might happen again?)

Once again - I thank everyone for your suggestions. I was feeling really lost, and even the stuff that didn't work gave me information about what was going on. I'm a little more familiar with the whole boot process now (though still woefully underprepared for emergencies), and at the VERY least, the next time I have a problem, I'll have met the new user criteria here, so my posts won't need to be approved by moderators.

Clients can whine about the downtime, but in the end, their data still exists, and is (once again) being backed up regularly.
 
Okay - problem is solved - I'm back up and running.
That's good to hear.
In the end, it was disk corruption.
That's not good of course.
(Is there anything I should set up so I get warned if this might happen again?)
One thing, sysutils/smartmontools. It has smartctl(8) that lets you run tests and show interesting drive statistics. Run the smartd(8) daemon and have it regularly check and signal warnings if something's off. It's not perfect but it'll give you some indication the drive might die some time soon or develop bad sectors.

and even the stuff that didn't work gave me information about what was going on.
That's generally how troubleshooting works. You go step by step hoping for clues along the way. I get the stressful situation, sweaty palms, thoughts racing everywhere. It often prevents you from thinking clearly and methodically. Been there, done that. Experience is what helps you through.

I'm a little more familiar with the whole boot process now
Something good came from it.
(though still woefully underprepared for emergencies)
Read my tag line ;)

their data still exists
That's the single most important thing. Downtime can happen, it sucks but it's going to happen, the law of inevitability.
 
Back
Top