Shell Script for rebooting if network down for a certain amount of time?

MasterOne · Jan 25, 2022

Today in the morning I have discovered that my server running 13.0-RELEASE-p6 wasn't reachable anymore, which left me no other choice then to perform a hardware reset, after which everything was working again. It started during the night with re0 losing connection and toggling link state between DOWN and UP till the hardware reset.

/var/log/messages only showed the following:

Code:

Jan 24 23:24:43 server kernel: re0: watchdog timeout
Jan 24 23:24:43 server kernel: re0: link state changed to DOWN
Jan 24 23:24:47 server kernel: re0: link state changed to UP
Jan 24 23:24:52 server kernel: re0: watchdog timeout
Jan 24 23:24:52 server kernel: re0: link state changed to DOWN
Jan 24 23:24:56 server kernel: re0: link state changed to UP
Jan 24 23:25:02 server kernel: re0: watchdog timeout
Jan 24 23:25:02 server kernel: re0: link state changed to DOWN
Jan 24 23:25:05 server kernel: re0: link state changed to UP
.
.
.

No idea what caused the problem and no idea how to analyze this any further.

What would be the most elegant solution to prevent such a problem in the future?

So likely a script, that would check connectivity and reboot the server if the link can not be established anymore?

SirDice · Jan 25, 2022

MasterOne said:
No idea what caused the problem and no idea how to analyze this any further.

Could be anything. Broken network card, bad cable, bad switch port. Cable is the simplest to start with. Also check the switch (if it's managed), see if you have port errors there, that might cause the switch to disable that port.

MasterOne said:
What would be the most elegant solution to prevent such a problem in the future?

Fix the cause of the issue.

MasterOne said:
So likely a script, that would check connectivity and reboot the server if the link can not be established anymore?

This is just combating symptoms, not a real solution.

MasterOne · Jan 25, 2022

SirDice said:
Could be anything. Broken network card, bad cable, bad switch port. Cable is the simplest to start with. Also check the switch (if it's managed), see if you have port errors there, that might cause the switch to disable that port.

Already had that checked, no hardware issue determinable.

SirDice said:
Fix the cause of the issue. This is just combating symptoms, not a real solution.

Can't fix what can't even be analyzed any further. That server was running just fine for over a month and this is the first time that the shown problem occurred.

My intention to combat the symptoms may not be the solution, but at least will prevent a prolonged downtime (like last night, when I didn't discover the issue till the morning) and the requirement for a hardware reset.

SirDice · Jan 25, 2022

MasterOne said:
That server was running just fine for over a month and this is the first time that the shown problem occurred.

Did any updates before the issues started? If not then it's possible the hardware itself just started showing issues. Have you tried replacing the network card?

MasterOne said:
My intention to combat the symptoms may not be the solution, but at least will prevent a prolonged downtime (like last night, when I didn't discover the issue till the morning) and the requirement for a hardware reset.

Most of us have monitoring in place (Zabbix, Munin, Nagios, Monit, etc.) and will alert us when a system goes down (or gets disconnected from the network).

MasterOne · Jan 25, 2022

SirDice said:
Did any updates before the issues started? If not then it's possible the hardware itself just started showing issues. Have you tried replacing the network card?

No updates before the issue happened. A hardware problem may be possible but quite unlikely, as it's a new dedicated server in a data center that has just gone online a little more than a month ago.

SirDice said:
Most of us have monitoring in place (Zabbix, Munin, Nagios, Monit, etc.) and will alert us when a system goes down (or gets disconnected from the network).

Thought so, and I will have to start playing around with monitoring at one point, but I was hoping for a quick solution.

Though monitoring doesn't really help in such a case. The ISPs dashboard showed that the server was still running and with no access anymore due to a network problem, the only possible solution would have been a hardware reset anyway.

Even a notification would not have helped as I was at sleep at the time it happened, and since I checked and discovered the problem after waking up, it would not have made any difference.

So considering all that, an automated reboot under these circumstances would have been the best solution.

SirDice · Jan 25, 2022

MasterOne said:
A hardware problem may be possible but quite unlikely, as it's a new dedicated server in a data center that has just gone online a little more than a month ago.

I've had DOAs before, not that uncommon unfortunately. Watchdog time outs are usually not a good sign. Don't like Realtek network cards, never did. They're dirt cheap but the quality and performance isn't very good.

MasterOne said:
Though monitoring doesn't really help in such a case.

You want to have monitoring in place anyway. You also want to keep an eye on the load, disk usages, etc. There are many factors that can indicate potential problems and getting alerted means you can often take action before anything bad happens. At least you'll get immediately notified when the server goes offline, instead of finding out it's been offline for several hours.

MasterOne said:
The ISPs dashboard showed that the server was still running and with no access anymore due to a network problem, the only possible solution would have been a hardware reset anyway.

Ok, so the server itself was still up and running and it just lost connection. From the console did you try to restart netif before restarting the server? And you could just login on that console and restart the server using shutdown(8), that's a much better option than a hard reset (which could potentially cause filesystem issues).

MasterOne said:
Even a notification would not have helped as I was at sleep at the time it happened

Notifications can be sent through text messages. That's why we always have people on standby, to fix issues that happen outside of the regular office hours. And that does mean you can get called when you're asleep.

MasterOne · Jan 25, 2022

SirDice said:
I've had DOAs before, not that uncommon unfortunately. Watchdog time outs are usually not a good sign. Don't like Realtek network cards, never did. They're dirt cheap but the quality and performance isn't very good.

Must be an onboard RealTek network card and without permanent failure and no other data to show, they are not going to replace the whole mainboard, so I'm stuck with the present hardware for the time being.

After all, it could be an issue with the re driver as well?

Since it's the first time in over a month that this problem occurred, there is no point in contacting the ISPs support again, but I will do that if it should happen again.

SirDice said:
You want to have monitoring in place anyway. You also want to keep an eye on the load, disk usages, etc. There are many factors that can indicate potential problems and getting alerted means you can often take action before anything bad happens. At least you'll get immediately notified when the server goes offline, instead of finding out it's been offline for several hours.

I know, I'll have a look as soon as I find the time. In the meantime I just see what happens. This server is not doing mission critical things, so a downtime of a few hours is not causing me too much headaches.

SirDice said:
Ok, so the server itself was still up and running and it just lost connection. From the console did you try to restart netif before restarting the server? And you could just login on that console and restart the server using shutdown(8), that's a much better option than a hard reset (which could potentially cause filesystem issues). Notifications can be sent through text messages. That's why we always have people on standby, to fix issues that happen outside of the regular office hours.

I don't have access to a console, so the only way was resetting the server anyway.

A notification would not have helped, as I'm the only one handling that server and my phone is off during the night.

SirDice · Jan 25, 2022

MasterOne said:
After all, it could be an issue with the re driver as well?

Possible but then it would have shown up earlier, not suddenly after the server has been running for a while and no changes to the driver in the meantime.

MasterOne said:
I don't have access to a console, so the only way was resetting the server anyway.

How did you get access to the reset switch? If you have access to IPMI you have console access.

MasterOne · Jan 25, 2022

SirDice said:
Possible but then it would have shown up earlier, not suddenly after the server has been running for a while and no changes to the driver in the meantime.

Let's hope you are wrong, because a real hardware issue would be difficult to handle under these circumstances, since it's not reproducible for now.

SirDice said:
How did you get access to the reset switch? If you have access to IPMI you have console access.

It's a reset option in the ISP's dashboard (it's a server at Hetzner, if you know how that looks like).

covacat · Jan 25, 2022

well with a realtek nic onboard this does not seem to be a server grade mainboard so probably no ipmi
they may have some external power control ?

MasterOne · Jan 25, 2022

covacat said:
well with a realtek nic onboard this does not seem to be a server grade mainboard so probably no ipmi
they may have some external power control ?

Their cheapest AX server, so no, definitely no server grade hardware.

This is how the UI for requesting a reset looks like:

SirDice · Jan 25, 2022

I personally can't live without IPMI or some other form of console access (IP KVM switch for example), especially on remote machines. There's always a risk of a screw up (mistakes made to the IP configuration, SSH not working, stuck in single user mode, etc.) and having access to the console can be a lifesaver.

covacat · Jan 25, 2022

what's the VNC button in the menu for ?

VladiBG · Jan 25, 2022

Ask your hosting to test the cable connection and replace the server NIC.

SirDice · Jan 25, 2022

VladiBG said:
replace the server NIC.

It's an onboard NIC as I understood it, so that would mean replacing the mainboard. It's probably easier to ask them to fit a PCI/PCIe network card in the machine.

_martin · Jan 25, 2022

Well there's the watchdog timeout before the DOWN/UP, these links state changes are more likely due to driver than physical connection. It's seems there's an old PR 166724 that still seem to be valid on higher versions of FreeBSD too. It's worth exploring further.

Erichans · Jan 25, 2022

_martin said:
Well there's the watchdog timeout before the DOWN/UP, these links state changes are more likely due to driver than physical connection. It's seems there's an old PR 166724 that still seem to be valid on higher versions of FreeBSD too. It's worth exploring further.

From Revision 542324; see also: "Supported devices":

Realtek PCIe FE / GBE / 2.5G / Gaming Ethernet Family Controller
kernel driver.

This is the official driver from Realtek and can be loaded instead of
the FreeBSD driver built into the GENERIC kernel if you experience
issues with it (eg. watchdog timeouts), or your card is not supported.

If I'm reading PR 166724 and D33677:

If this driver is causing problems then the unmodified driver from
the vendor can be found in ports under net/realtek-re-kmod.

correctly, this unmodified vendor driver is used in net/realtek-re-kmod.

If you're not already using this driver from ports, you might want to give it a try.

_martin · Jan 25, 2022

Erichans That's what I meant by my reply.

SirDice · Jan 25, 2022

If it were driver issues I would have expected these issues to happen all the time, not out of the blue after months of correct operation. But it'll be good to rule out any issues that may be caused by the driver.

MasterOne · Jan 25, 2022

covacat said:
what's the VNC button in the menu for ?

That's only for a CentOS installation:

Erichans said:
correctly, this unmodified vendor driver is used in net/realtek-re-kmod. If you're not already using this driver from ports, you might want to give it a try.

Very well, I haven't been using this driver yet, but I'm going to try it now.

MasterOne · Jan 25, 2022

Just rebooted and re0: version:1.96.04 is in use. I have a good feeling that this will solve the issue.

For reference, the data of the used NIC:

Code:

re0: <RealTek 8168/8111 B/C/CP/D/DP/E/F/G PCIe Gigabit Ethernet> port 0xd000-0xd0ff mem 0xfc504000-0xfc504fff,0xfc500000-0xfc503fff irq 35 at device 0.0 on pci9
re0: <Realtek PCIe GbE Family Controller> port 0xd000-0xd0ff mem 0xfc504000-0xfc504fff,0xfc500000-0xfc503fff irq 35 at device 0.0 on pci9

_martin · Jan 25, 2022

SirDice said:
If it were driver issues I would have expected these issues to happen all the time

Not necessarily, depends on how the driver is being stressed. I still agree with you to check the cables and others. netstat -ss to check errors reported by OS. Also OP didn't mention history of the server (versions), could be that newer setup is rubbing the code the right way, as I'm use to say.

But that's why I also shared the PR -- port driver is mentioned there as an alternative to base one.

MasterOne · Jan 25, 2022

_martin said:
I still agree with you to check the cables and others. netstat -ss to check errors reported by OS. Also OP didn't mention history of the server (versions), could be that newer setup is rubbing the code the right way, as I'm use to say. But that's why I also shared the PR -- port driver is mentioned there as an alternative to base one.

Can't check anything hardware as it's a server in a data center and I don't want to bother the support team again just because of that single failure that could not be reproduced so far. If it should happen again, I'll request the cable to be checked (according to them there were no issues with the switch that server is connected to at the time of the incident).

Not sure what you mean by server history. It's new hardware that went online a little more than a month ago, installed FreeBSD 13.0-RELEASE and performed the upgrades to the current patch level 13.0-RELEASE-p6.

net/realtek-re-kmod is now in use and I'll report back if the issue should arise again.

MasterOne · Jan 25, 2022

_martin The output of netstat -ss is quite long. Anything specific I should be looking for?

_martin · Jan 25, 2022

Compare the total packets vs the ones that are bad. netstat -i provides also good summary, compare all vs bad ones. Bad cabling would definitely yield errors.

Shell Script for rebooting if network down for a certain amount of time?

MasterOne

SirDice

Administrator

MasterOne

SirDice

Administrator

MasterOne

SirDice

Administrator

MasterOne

SirDice

Administrator

MasterOne

covacat

MasterOne

SirDice

Administrator

covacat

VladiBG

SirDice

Administrator

_martin

Erichans

_martin

SirDice

Administrator

MasterOne

MasterOne

_martin

MasterOne

MasterOne

_martin