Solved Networking failures not long after every boot (after restoring a root ZFS pool to new hardware)

EDIT: For anyone looking for the solution, it was to install the net/realtek-re-kmod package. Though in my case, since my system was not up-to-date, the critical first step was to update to the latest RELEASE version of FreeBSD.

I'm replacing a NUC that I use as a server with several services running in jails via Bastille. The old NUC is a NUC7CJYH, and the new one is a NUC11ATKC4. Importantly, they have the same NIC, a Realtek 8111H-CG, which uses the re driver.

The old server has a SATA SSD with a root ZFS pool and I backed it up using zfs send/recv to an external HDD on which I have another pool. The new server has an NVMe SSD and I first created the basic partitions similar to other threads I've seen on the subject, though I did so by using the FreeBSD memstick installer to do a basic install with the default settings, then I destroyed the resulting zroot pool and recreated it to be empty. I then did another zfs send/recv from the Live CD to get the backup pool restored. There was a minor hiccup here as I had to change the fstab to reflect the change from /dev/ada0 to /dev/nvd0.

Things initially went smoothly. There were no errors and the jails spun up without any issues. But after a short time, I started to get errors about resolving domain names and about connections to other services, as well as not being able to ssh in to the machine anymore. I didn't see any errors other than failures to connect to outside systems (ntpd, VPN, etc.) If I reboot the system, things will come back working fine again ... for a little while. Usually less than 30 minutes.

I have the server configured to use a static IP address outside of the DHCP range of my router, and the old server is no longer connected to the network, but I thought in case there was still something cached with the MAC address causing an issue, I would reboot the router (which is running pfSense). But that didn't help.

So I'm thinking that perhaps there are some other settings that are specific to the hardware, or some kind of configuration that doesn't play nicely being transported to slightly different hardware. I've looked through the usual places, such as /etc/rc.conf, but I'm not seeing anything.

It's very frustrating as the old server was like a rock.
 
Last edited:
Have you looked with dmesg -a or in /var/log/messages for something related to the problem, possibly concerning your network card? Nothing that speak of "re0 watchdog timeout" by chance?
 
Yes, I do see re0: watchdog timeout a lot in the logs. Though it also appears at other times in the past, but particularly frequent these last couple days. I'm not entirely sure what that means, I'm afraid.
 
It means that your re0 interface lose connectivity. It's a reccurent trouble with realteak network cards since ages. On my server, this happen rarely and only for a few seconds, so I keep like it because there is no real impact. But, depending on the card, it can be just hell.

Therefore, the cards between your NUCs aren't exactly the same (different firmwares perhaps).

Try the realtek-re-kmod port as Charlie_ advised you.
 
It means that your re0 interface lose connectivity. It's a reccurent trouble with realteak network cards since ages. On my server, this happen rarely and only for a few seconds, so I keep like it because there is no real impact. But, depending on the card, it can be just hell.

Therefore, the cards between your NUCs aren't exactly the same (different firmwares perhaps).

Try the realtek-re-kmod port as Charlie_ advised you.
Interesting. I suppose even with the same part number, the network cards could still be slightly different. I do remember hearing bad things about Realtek cards, though I had pretty good luck so far with the other NUC. Odd that Intel would use a Realtek chipset for wired networking in their own machines. They use their own chips for wireless, though I have that disabled.

Anywho, I will give net/realtek-re-kmod a try and report back. Thanks!
 
Unfortunately, even after installing that new driver, I'm still experiencing issues along with plenty of this in /var/log/messages:
Code:
Sep  1 22:44:41 miniserver kernel: re0: watchdog timeout
Sep  1 22:44:41 miniserver kernel: re0: link state changed to DOWN
Sep  1 22:44:44 miniserver kernel: re0: link state changed to UP
Sep  1 22:44:52 miniserver kernel: re0: watchdog timeout
Sep  1 22:44:52 miniserver kernel: re0: link state changed to DOWN
Sep  1 22:44:55 miniserver kernel: re0: link state changed to UP
Sep  1 22:45:02 miniserver kernel: re0: watchdog timeout
Sep  1 22:45:02 miniserver kernel: re0: link state changed to DOWN

It might be worth noting that it does seem to correlate with usage of net-p2p/transmission. I shut down the jail that runs that along with an OpenVPN client for a short while before trying this new Realtek driver and there didn't seem to be any issues.
 
The only thing that comes now in my mind for solving your problem is to set up a linux VM with bhyve. Passthru the card and use this VM as a connectivity provider. So, the card will use the linux driver. But, if you're not familiar with bhyve the road may be long.
 
Oof. I think at that point, I'd rather just get other hardware. I can still revert to the previous NUC.

The reason I'm migrating servers is actually because of another NUC that crashes a lot and seems to be hardware-related. That one (a NUC6CAYH) is my HTPC (running LibreELEC) and needs the least performance. So the plan was to use the NUC7CJYH as the new HTPC, and the NUC11ATKC4 as the new FreeBSD server. But I suppose I could use the NUC11ATKC4 as the new HTPC, even though it's way overkill, since I think Realtek NICs work better with Linux, then go back to the NUC7CJYH as the server until I find something else suitable with an Intel NIC.

So are we sure it's down to the Realtek NIC being garbage? Are those logs a smoking gun, or is there anything else that might be causing it?
 
So are we sure it's down to the Realtek NIC being garbage? Are those logs a smoking gun, or is there anything else that might be causing it?

The messages themselves tell that the interface goes down, and thus has no connectivity until it stays up without a "watchdog timeout" message within a few seconds after. So, this is the symptom of your problem.

That said, I don't know if something else than a faulty driver can cause this. You can imagine that the network card of this machine is defective or - why not - even a software issue as it was your first idea with the transfert of the original pool.
 
While message might be 're0 watchdog' you might want to investigate BIOS settings.
Can you disable EFI Network stack.
I would recommend you start there.
 
Looks like a June 2023 BIOS is newest.
 
Thinking about it BIOS upgrade might be a waste. NUC's use Visual BIOS and that offers limited troubleshooting.
Intel axed NUC boxes and these designers are gone already. Probably small crew to wrap it up.

Realtec are so hit and miss across all open source OS.
When they work they are OK but when they don't the device is garbage.

Today I am working on UpSquared and LAGG0 with realtek. Working great. Realtek are really hit or miss for me.
 
Thanks for the extra ideas Phishfry . I was going to look at BIOS updates next. I didn't see anythig about the EFI Network Stack last time I was in there. There are just some basic switches to enable/disable the NIC or wifi.

But I was curious to know if I could confirm that the new driver from net/realtek-re-kmod was really being used. I then remembered that when I installed it, there was a message about incompatible OS versions. I ignored it at the time, but then did a quick search and from a Reddit thread on this driver, it seems that incompatibility is an issue. I was on 13.1-RELEASE, and I'm betting that kernel module was built for 13.2.

So I had a look at the output of kldstat, and sure enough if_re.ko was not in the list. So I then
  • uninstalled the net/realtek-re-kmod package
  • removed the relevant lines from /boot/loader.conf
  • upgraded to 13.2-RELEASE-p3
  • upgraded existing packages
  • re-installed the net/realtek-re-kmod package
  • replaced the relevant lines in /boot/loader.conf
  • rebooted and confirmed that if_re.ko is in the output of kldstat
  • rebooted with all the services running
It's only been 30 minutes or so, but so far, I haven't seen any "watchdog timeout" messages. So, fingers crossed, this might be solved. If not, I'll look into BIOS updates.
 
Is the light at the end of the tunnel? :)
Careful, sometimes the light at the end of the tunnel is a train. ;)

But it's been a day now, and I haven't seen any "watchdog timeout" messages. So I think we're in the clear. I think I'll give it a week before I recommission the older NUC to its new purpose, just in case.
 
So, this is a happy end. But... You have to be aware of two things:

- It's possible that the effective speed of the interface is lower than the one with the built-in FreeBSD driver. You should mesure the download/upload of your NUC, just to know.
- Each time you do a minor or a major update of FreeBSD, you'd better deactivate this kmod before to reboot. Then, upgrade packages and test if if_re.ko is well loaded before to reactivate net/realtek-re-kmod.
 
So, this is a happy end. But... You have to be aware of two things:

- It's possible that the effective speed of the interface is lower than the one with the built-in FreeBSD driver. You should mesure the download/upload of your NUC, just to know.
- Each time you do a minor or a major update of FreeBSD, you'd better deactivate this kmod before to reboot. Then, upgrade packages and test if if_re.ko is well loaded before to reactivate net/realtek-re-kmod.
Thanks. I'll definitely keep these points in mind.
 
Back
Top