Server Shutting Down With No Logs

Hello, I was hoping some one could give fresh ideas to troubleshoot this error.

At work a server has been shutting down over night consistently for the past couple days. The error logs show nothing about the shutdown which leads us to believe it is faulty hardware.

We have tried swapping the power supply, resetting bios, plugging into a new ups, cleaning the CPU re-applying fresh thermal grease, cleaning the ram and putting it into a new slot and we have visually inspected all the capacitors etc for any noticeable damage. None of these have resolved the issue.


Any fresh ideas would be greatly appreciated.

Thanks in advance.
 
I don't know much information about the specific hardware (manufacturers names etc.) but the server is running FreeBSD 8.2. It is a mail server and prior to having issues it had been running without issues for 380+ days.
 
I'm just shooting in the dark here but 8.2 is End-of-Life and with a 380+ days uptime I'm guessing nobody installed any security patches.
 
SirDice said:
I'm just shooting in the dark here but 8.2 is End-of-Life and with a 380+ days uptime I'm guessing nobody installed any security patches.

I do not manage the server, so I wouldn't be able to give you a correct answer. The administrator is thinking hardware issues do you think it could be some kind of security issue?

Are there any other hardware issues than the ones listed that you can think of off the top of your head? We are open to all suggestions.
 
user1 said:
The administrator is thinking hardware issues do you think it could be some kind of security issue?
The only hardware issue that would cause a sudden shutdown is overheating. Or your power company isn't supplying a 'clean' signal and the power supply simply shuts down.

Oh, and I've had a case where a server inexplicably went down around the same time every day. This turned out to be the cleaning lady that unplugged the machine so she could use the socket for her vacuum. Seriously, this happened.

But besides that, yes, an unmaintained and unpatched machine on the internet? That's just asking for it.
 
SirDice said:
The only hardware issue that would cause a sudden shutdown is overheating. Or your power company isn't supplying a 'clean' signal and the power supply simply shuts down.

Or bad memory causes a panic. Or a sudden increase in usage drives marginal components into the failure zone.
 
wblock@ said:
Does it shut down the same time every day, like when a particular cron(8) job runs?

We are unsure of the exact shut down time. It happens over night and the servers are not monitored at night. When we come into the office and check the server it is shutdown. I suggested running a memory test (memtest) and checking the cpu temps (sysctl dev.cpu.0.temperature)/(sysctl -a | grep tempe). I will ask about the cron jobs thank you for the suggestion.

The external power was a concern a week or two ago and we mentioned it to the power company. The external power is monitored and doesn't seem to abnormal. There are quite a few servers running and none of the others had issues similar to this one.

One of the other servers did have a bad HDD close to the time this server started having problems. Seems unrelated but wanted to note it.
 
Hard drives often fail in clusters.

To find the time of reset, a cron job could be added that just mails an "I'm alive" message once an hour or more.
 
wblock@ said:
Or bad memory causes a panic.
Wouldn't that leave traces in /var/log/messages? At the very least a crash dump in /var/crash/.

User1, also check the BIOS. There's usually a setting for when the power goes out and back on again. Most servers have the option for "off", "on" or "last state". If it's a power fluctuation and it turns off at least it should turn back on again when the power is good.
 
SirDice said:
Wouldn't that leave traces in /var/log/messages? At the very least a crash dump in /var/crash/.

Maybe, depends on the failure mode. Seems like I've also heard of CPU cache going bad.
 
The /var/log/messages will give you a lot of information like:
  • If this was a clean shutdown or not.
  • The time that this occurred.

Also, during the night periodic scripts run which can stress faulty hardware.
 
This is still an ongoing issue, the server was off this morning when we came in. The network administrator is going to check through the cron jobs but he did not seem to concerned about them I don't that much is run on that server over night.

I'll keep every one posted, hopefully we can figure out what the problem is soon.
 
user1 said:
This is still an ongoing issue, the server was off this morning when we came in. The network administrator is going to check through the cron jobs but he did not seem to concerned about them I don't that much is run on that server over night.

Why do you think that a "Network Administrator" will be able to solve this problem for you?

Do you think that this is related to a network issue?

If the Network Administrator in not concerned about the periodic scripts then maybe you need to find a System Administrator.

I am being very honest and brute because your approach is really a recipe for disaster. Your topic suggests that your server which is running an non-patched and EOL Operating System is shutting down overnight without any errors in the logs.
You were asked to provide more information about this system but you can't because you obviously don't know how to. So, how can you be so sure that there are is nothing in the logs that may give you a clue on where to start looking for the problem?
 
gkontos said:
Why do you think that a "Network Administrator" will be able to solve this problem for you?

Do you think that this is related to a network issue?

If the Network Administrator in not concerned about the periodic scripts then maybe you need to find a System Administrator.

I am being very honest and brute because your approach is really a recipe for disaster. Your topic suggests that your server which is running an non-patched and EOL Operating System is shutting down overnight without any errors in the logs.
You were asked to provide more information about this system but you can't because you obviously don't know how to. So, how can you be so sure that there are is nothing in the logs that may give you a clue on where to start looking for the problem?

Allow me to clarify the situation to avoid any confusion. Where I work there is a network/system admin who is in charge of the entire network and all the servers. He is troubleshooting the server, I am completely confident he will solve the issue but I am looking to help him solve it faster.

Also in my post further down I mentioned I do not know the status of patches and would not be able to provide valid information on whether it is patched or not.

As far as the logs I am going off what the administrator told me. I'm sure he is competent enough to search the proper logs for errors.

I am new to networking and working with servers and I am hoping some one on here (I know there are very experienced administrators on this site) would be able to give me advice to troubleshoot this problem with the limited information that is available to me.
 
user1 said:
I am new to networking and working with servers and I am hoping some one on here (I know there are very experienced administrators on this site) would be able to give me advice to troubleshoot this problem with the limited information that is available to me.

It is very difficult to find people with psychic abilities in a technical forum.
 
I'd suggest setting up a serial console and capturing that output with another PC. If the system prints something to the console and then reboots, you'll know what the problem is. If nothing is printed and the system reboots, you have a hardware problem.

Neither the built-in VGA console nor a remote viewer for the console (via server hardware management) will help, as these don't record what has scrolled off the screen. You need to capture the console output on another system.

Some system failures intentionally don't log things to the local disk (for example, if the disk drops offline there's no disk to log to), and crash dumps have been problematic for years (the mechanisms involved are not entirely SMP / thread / interrupt safe, so you often get a double panic and no useful crash data).
 
user1 said:
We are unsure of the exact shut down time. It happens over night and the servers are not monitored at night.

Then add some monitoring! Seriously. If you don't know when it's shutting down, then you need to add some logging to find out. Even something as simple as the following in root's crontab:
Code:
* * * * * /bin/date >> /var/log/time.log

Then you can open the file after booting, and find out when it shutdown.
 
phoenix said:
...
Code:
* * * * * time >> /var/log/time.log

Then you can open the file after booting, and find out when it shutdown.

I guess you meant date(1)(), as time(1)() may be not exactly as useful in the given respect. In addition, it is recommended to use the full path for everything in the crontab, i.e.:

Code:
*       *       *       *       *       root    /bin/date >> /var/log/time.log
 
gkontos said:
Mercy, mercy !!!

A simple look at /var/log/messages will tell you EXACTLY when did a server rebooted!!!

Hmm...

This time is more or less known already, i.e. once the admin presses the power button in the morning, after finding the server being off.

user1 said:
This is still an ongoing issue, the server was off this morning when we came in...
 
Back
Top