Upgrade messed up. Can I overwrite with new install, but preserve data?

I somehow messed an upgrade on a remote server using freebsd-update, possibly because I jumped straight from 11.1-R to 12.2-R. The server no longer boots, and via a rescue image I can see nothing has been written to /var/log/messages since I restarted to boot the new kernel. (The tech said it shows a "login prompt" which seems to contradict this behaviour, but he didn't offer any further detail.)

The server has minimal customisation, but the data is important. Is it possible to do an install that overwrites the existing kernel, /etc/, utils and so on, but otherwise leaves everything else alone?

Could I do it manually by copying the contents of relevant directories from a working server, restoring this server's rc.conf/passwd/master.passwd, and writing out the ZFS bootcode? ( gpart bootcode -b /boot/pmbr -p /boot/gptzfsboot -i 1 ada0)

I tried building from source, but make buildworld bombed out with an error after compiling for more than 24 hours. :( The rescue image is only 11.2-R, so again I may be trying to leap too many versions.

Basically just looking to get it to a stage where it can boot and permit me remote access, without wiping out nearly 2TB worth of backup data. Is anything I've suggested here plausible?
 
possibly because I jumped straight from 11.1-R to 12.2-R.
That shouldn't be an issue.

The server no longer boots,
There are different solutions for different issues. "Not booting" probably has a different meaning for me than it has for you. What exactly happens?

I tried building from source, but make buildworld bombed out with an error after compiling for more than 24 hours
Is this a Raspberry Pi or something like that? Building world shouldn't take that long on a reasonably modern PC.
 
I suspect it's crashing at boot. Might be because it's trying to load some (external) kernel module that was built for 11.1 on the new 12.2 kernel. The trick is to drop to the loader prompt, unload the kernel (this will also unload those 'extra' modules) and loading the kernel (and only the kernel). Then continuing to boot the system. If you can get it to boot that way, check your /boot/loader.conf and remove or disable everything you don't explicitly need during the upgrade. You can enable those again after you finished the upgrade (and updated all your ports/packages).

Not booting means not booting, in other words not even loading the loader(8) or even the boot code. That's a problem, but if it's the above you can easily recover if you know what to do.
 
If the strategy of trying to restore individual components of the install becomes difficult to follow, it's possible to use the rescue image to fish out the backup data to another location (Just make sure the rescue image can 'see' both your files and the remote target location. As in, if you can use the rescue image to mount a remote NFS/SMB share or a USB stick, and copy files there.

As I see it, rebuilding from source is pointless here. Even if there are no build errors, the kernel is unlikely to boot due to boot config mistakes that were made very early on. I remember my early days of trying to undo the mistakes [that I made] by backtracking. Make one mistake early on, try to deal with it by changing something down the road, then again, arriving at an unbootable system. Backtracking on something like that was more than I had the patience for. Especially with the stress of knowing that valuable data was buried in that mess. I learned my lessons the hard way - that's how I know the value of ZFS, BE's (Boot Environments) and rollbacks.
 
The CPU is an ageing Atom N2800, so it's standard Intel PC hardware, just slow. It takes several hours just to bootstrap the compiler. I don't have any access to the console, only the tech telling me he saw a login prompt before closing the ticket.

"Not booting" I guess could be more accurately described as not being reachable via network even several minutes after restart. :)

boot.conf is fairly unremarkable:

Code:
geom_mirror_load="YES"
kern.geom.label.disk_ident.enable="0"
kern.geom.label.gptid.enable="0"
zfs_load="YES"
ipfw_load="YES"
net.inet.ip.fw.default_to_accept="1"

Note that until this point I had been using a custom kernel, but I decided to change to GENERIC+freebsd-update to simplify things.

/boot/kernel/kernel contains the strings GENERIC and 11.2-RELEASE-p14, so I'm confused - it's certainly not one I've compiled, and it's not the original install kernel, either. Even if I did something silly like accidentally specify 11.2-RELEASE instead of 12.2-RELEASE, surely it should have still been able to boot. I wonder if I've somehow managed to install a 11.2 kernel onto a system previously upgraded to 12.2. I'm going to try a source compile of the latter version kernel, and see what happens.
 
If the strategy of trying to restore individual components of the install becomes difficult to follow, it's possible to use the rescue image to fish out the backup data to another location (Just make sure the rescue image can 'see' both your files and the remote target location. As in, if you can use the rescue image to mount a remote NFS/SMB share or a USB stick, and copy files there.

I am currently using a rescue image, although OVH only supply 11.2-RELEASE #0 from more than 3 years ago. I can mount zroot just fine, so everything is there, just not in a bootable format.

The data is backups from other servers, so if it comes to the worst, I can do a 100% fresh install overwrite and recopy that data, but I'd rather not have to transfer 2TB of data via the internet :)

If that's the case, how were you able to use the rescue image, much less start a compilation process?

OVH let you netboot to one of their rescue images via their website, but the images are ageing, and the file systems are very strange:

Code:
root@rescue-bsd:/tmp/mnt/usr/src # uname -a
FreeBSD rescue-bsd.ovh.net 11.2-RELEASE FreeBSD 11.2-RELEASE #0 r335510: Fri Jun 22 04:32:14 UTC 2018     root@releng2.nyi.freebsd.org:/usr/obj/usr/src/sys/GENERIC  amd64
root@rescue-bsd:/tmp/mnt/usr/src # df
Filesystem                                       1K-blocks       Used      Avail Capacity  Mounted on
91.121.126.137:/home/pub/freebsd11-amd64-rescue 1848410796  493864480 1260629388    28%    /
devfs                                                    1          1          0   100%    /dev
/dev/md0                                             29596       2996      24236    11%    /etc
/dev/md1                                              7132          8       6556     0%    /mnt
/dev/md2                                            239516       1384     218972     1%    /opt
/dev/md3                                              7132        128       6436     2%    /root
procfs                                                   4          4          0   100%    /proc
<above>:/opt/local                              1848650312 1848412180     218972   100%    /usr/local
<above>:/opt/var                                1848650312 1848412180     218972   100%    /var
/dev/md4                                             63004         36      57928     0%    /tmp
/var/empty                                      1848650312 1848412180     218972   100%    /opt/ovh
 
I don't have any access to the console, only the tech telling me he saw a login prompt before closing the ticket.
Ok, did you set console to insecure in /etc/tttys? If not then the system does boot, even all the way to multi-user mode or else you would never see a login prompt.

"Not booting" I guess could be more accurately described as not being reachable via network even several minutes after restart.
If it's stuck in single user mode you won't be able to access it remotely either because no services are started (including sshd(8)) yet. That all happens later when the system switches to multi-user mode.

I am currently using a rescue image, although OVH only supply 11.2-RELEASE #0 from more than 3 years ago.
I don't like OVH simply because they don't provide you with a way to access the console. With real iron (as opposed to a VPS) access to IPMI is a must-have for me. Especially when doing things remotely. Things are slightly different if I can just walk into a server room, grab a console cart and plug that in.
 
Ok, did you set console to insecure in /etc/tttys? If not then the system does boot, even all the way to multi-user mode or else you would never see a login prompt.

No, typically all I do with /etc/ttys is set ttyv1 and higher to 'off.'

Code:
# If console is marked "insecure", then init will ask for the root password
# when going to single-user mode.
console none                            unknown off secure
#
ttyv0   "/usr/libexec/getty Pc"         xterm   on  secure
ttyv1   "/usr/libexec/getty Pc"         xterm   off  secure

I wonder if perhaps it showed "Enter full pathname of shell or RETURN for /bin/sh:" or similar, and he mistook that for a login prompt. I'm assuming he was pressed for time, because he helpfully booted a Linux rescue console for me (what?) and promptly closed the ticket.
 
Wouldn't a remotely accessible IPMI make more sense to have in a VPS package?
With a VPS you often get a control panel with console access. But yes, I do pick my providers that have that. I hate things like AWS or Azure because of the lack of console access. But when we're talking cloud systems you have to deal with those in an entirely different manner (throw the whole thing away and just spawn a new one, preferably set up automatically using Ansible, Puppet or some other tool).
with real iron, you can just walk into a server room.
Not if that server room (or datacenter) is in another town or the other side of the country.
 
Just like Facebook was doing about 24 hours ago? ;)

I don't have a Facebook account, so I was vaguely aware of that when it was a blip on the TV news last night, among COVID-related coverage. :P Facebook should just switch to FreeBSD, then that kind of outage would be less likely :P
 
I don't have a Facebook account, so I was vaguely aware of that when it was a blip on the TV news last night, among COVID-related coverage. :p Facebook should just switch to FreeBSD, then that kind of outage would be less likely :p
Word is that it was a BGP misconfig, which began a series of cascading failures, resulting in Facebook withdrawing all routes from the global internet; all of their IP ranges were suddenly unreachable. There was also mention that high level sysops were having trouble accessing essential routers remotely because of the failures, and had to coach staff at the D/C over the phone (that was my joke above).
 
Tried installing 11.2-R GENERIC kernel. No change in behaviour.

Trying a compile of my previously working 11.1-R custom kernel now, but for 11.2-R system. Maybe there's some forgotten feature I've previously enabled or disabled that breaks with GENERIC.

If that doesn't work, I might resort to installing the contents of the 11.2-R *.txz files from the FreeBSD FTP server. Or maybe just ordering a new server in the same D/C.
 
Tried installing 11.2-R GENERIC kernel. No change in behaviour.
You mentioned the tech saw a login prompt. That at least indicates that things are booting to multi-user mode (and ttyv0 is available). When you boot from that old rescue CD they have and import your pools can you read the /var/log/messages? The system does appear to boot so I'm thinking it's not getting/setting an IP address correctly and/or sshd(8) isn't starting for some reason. By looking at the system's /var/log/messages (especially the entries after the kernel booting) you might find some clues.

I might resort to installing the contents of the 11.2-R *.txz files from the FreeBSD FTP server.
Use that as a last resort. If you just extract kernel.txz and base.txz you basically have a "complete" system. Complete enough to boot. I'm not entirely sure if base.txz contains a /etc/rc.conf, it might get overwritten with a blank or default one. Make sure at least your interfaces are correctly configured and sshd(8) is set to start.
 
You mentioned the tech saw a login prompt. That at least indicates that things are booting to multi-user mode (and ttyv0 is available). When you boot from that old rescue CD they have and import your pools can you read the /var/log/messages? The system does appear to boot so I'm thinking it's not getting/setting an IP address correctly and/or sshd(8) isn't starting for some reason. By looking at the system's /var/log/messages (especially the entries after the kernel booting) you might find some clues.
I mentioned that in the OP, unfortunately /var/log/messages has not been updated for about 4 days. The last entry was the shutdown after running freebsd-update install. That's why the mention of the login prompt being visible doesn't make sense. (I suspect it's a form letter.)

I wonder if there's some metadata somewhere that would show the kernel had successfully mounted zroot, before I subsequently mount it via rescue? I had a quick look at zdb, but that seems to operate only on mounted file systems.
 
That's why the mention of the login prompt being visible doesn't make sense.
Yeah, combined with the other information (or lack thereof) that's starting to make little sense too. It's possible the tech either didn't look closely enough or was simply looking at the wrong machine (I've done this myself a number of times, simply hooked up the console to the wrong machine).

I wonder if there's some metadata somewhere that would show the kernel had successfully mounted zroot, before I subsequently mount it via rescue?
Nothing I'm aware of.


Lets circle back to the beginning of the upgrade. I need to get a better idea of the things that happened. When you did the upgrade, after the first freebsd-update install it tells you to reboot. Did you reboot at this point? And it immediately failed to boot (or better to come online)? Or did you run freebsd-update install multiple times then rebooted?

With the first run of freebsd-update install only the kernel gets updated. If you haven't done anything else to the system then the userland is still on the old version, only the kernel is new. In that case you can simply copy /boot/kernel.old over /boot/kernel to restore the old kernel. As nothing else has been touched at this point the system would simply be restored in its original state.

On the second run of freebsd-update install the userland gets updated. That means binaries, libraries, etc. from the base OS. So tools like ifconfig(8) and important libraries like /usr/lib/libc.so are all in sync with each other. You should not have any problems with the base OS tools and libraries at this point. Tools from ports/packages shouldn't have a problem either as the old versions of the system libraries are still available. This is the point where you should reinstall all your ports/packages to get all of them properly linked to the new version of the system libraries.

The third and final freebsd-update install will remove all the old files and libraries. Any old port or package that hasn't been updated yet will start to fail now because it can't find the old version of libraries anymore.
 
Ahh... freebsd-update(8) really requires paying attention... Just blindly running it without correctly specifying the options can really mess up the system. I'd be scared shitless to run something like that without zfs rollback -r and bectl(8) to test things. It's easier to make sure the data is easy to separate from the main OS.
 
I'd be scared shitless to run something like that without zfs rollback -r and bectl(8) to test things.
You can still run into problems, you have to have something to run the rollback or bectl(8) on. If your system doesn't boot and you don't have access to a console or some other bootable media you're just as screwed. It certainly helps recovering from a botched upgrade, that's for sure. But you need to be able to boot something before you can even attempt to recover. If you know a bit how freebsd-update(8) does a major version upgrade you can also fix things on an UFS system. ZFS would make things easier, sure. But it's not required to have if you know what you're doing. I've literally done hundreds, if not thousands, of upgrades in the past 20+ years. Quite a number of them long before we had ZFS and freebsd-update(8). Still managed to botch some of them, almost always found a way to recover without having to resort to a fresh install (that happened exactly once with FreeBSD 5.0 and the UFS to UFS2 change).
 
Lets circle back to the beginning of the upgrade. I need to get a better idea of the things that happened. When you did the upgrade, after the first freebsd-update install it tells you to reboot. Did you reboot at this point? And it immediately failed to boot (or better to come online)? Or did you run freebsd-update install multiple times then rebooted?

With the first run of freebsd-update install only the kernel gets updated. If you haven't done anything else to the system then the userland is still on the old version, only the kernel is new. In that case you can simply copy /boot/kernel.old over /boot/kernel to restore the old kernel. As nothing else has been touched at this point the system would simply be restored in its original state.
Damn, I wish I'd thought of this earlier. I do have kernel.old, but it contains GENERIC as I subsequently did freebsd-update (through rescue) again, to try to reinstall the kernel. I tried copying it back anyway, but it does not boot.

Based on the contents of /var/log/messages and my memory (it's been a few days) this is what I did:

1. zpool scrub zroot
2. pkg upgrade
3. freebsd-update -r 11.2-RELEASE upgrade
...follow prompts...
3. shutdown -r now

And now, compiling the 11.2-RELEASE kernel - even via a rescue image that is 11.2-RELEASE itself - bombs out. [EDIT: Looks like I need to build the kernel toolchain separately because the source uses a new directive unsupported by CLANG4]

I'm at my wits end here. Without console access I'm pretty much stuck.

Code:
--- support.o ---
cc -c -x assembler-with-cpp -DLOCORE -O2 -pipe -fno-strict-aliasing  -g -nostdinc  -I. -I/tmp/mnt/usr/src/sys -I/tmp/mnt/usr/src/sys/contrib/libfdt -D_KERNEL -DHAVE_KERNEL_OPTION_HEADERS -include opt_global.h  -fno-omit-frame-pointer -mno-omit-leaf-frame-pointer -MD  -MF.depend.support.o -MTsupport.o -mcmodel=kernel -mno-red-zone -mno-mmx -mno-sse -msoft-float  -fno-asynchronous-unwind-tables -ffreestanding -fwrapv -fstack-protector -gdwarf-2 -Wall -Wredundant-decls -Wnested-externs -Wstrict-prototypes -Wmissing-prototypes -Wpointer-arith -Winline -Wcast-qual -Wundef -Wno-pointer-sign -D__printf__=__freebsd_kprintf__ -Wmissing-include-dirs -fdiagnostics-show-option -Wno-unknown-pragmas -Wno-error-tautological-compare -Wno-error-empty-body -Wno-error-parentheses-equality -Wno-error-unused-function -Wno-error-pointer-sign -Wno-error-shift-negative-value  -mno-aes -mno-avx  -std=iso9899:1999  -Werror /tmp/mnt/usr/src/sys/amd64/amd64/support.S
/tmp/mnt/usr/src/sys/amd64/amd64/support.S:834:2: error: unknown directive
 .altmacro
 ^
<instantiation>:1:13: error: invalid register name
handle_ibrs_%(ll):
            ^~
<instantiation>:3:2: note: while in macro instantiation
 ibrs_seq_label %(ll)
 ^
...
...
<<<...lots more errors snipped...>>>
 
And now, compiling the 11.2-RELEASE kernel - even via a rescue image that is 11.2-RELEASE itself - bombs out.
Does that rescue media have a kernel.txz? Just extract that to overwrite the botched kernel. It's probably going to be a 11.2 kernel but that's less likely to cause problems on a 11.1 userland than a 12.2 kernel might.
 
Back
Top