Server stops responding at 4AM

Defre · Mar 14, 2010

Hi,

I have set up a freebsd 8 server for personal use. I updated to 8-STABLE due to instability with the if_vge driver. Everything else seems to run fine (the layout is the following: the root system, and three jails: a service one running nginx as a proxy over the two other jails, which run various servers).

Only problem: every day at 4AM the server becomes inaccessible. The last two lines in /var/log/cron are always the same:

Code:

Mar 14 04:00:00 ygg /usr/sbin/cron[21493]: (root) CMD (newsyslog)
Mar 14 04:00:00 ygg /usr/sbin/cron[21494]: (root) CMD (/usr/libexec/atrun)

I can still ping the server, but ssh, http, bind, etc are no more available.
At the beginning, if the first command ran at 4AM was from one of the jails (for instance, I forgot to disable /usr/libexec/save-entropy), this command ate 100% of the CPU and only servers running in jails became unavailable (cron from host didn't launch commands anymore). As I disabled useless periodic entries from jails, it is the host which launches commands first at 4AM and stops responding.

The hardware is the following:
- VIA-C7 cpu @ 2 ghz, with padlock built into kernel
- VGE network interface
- 2 x 150 GB hdd, no RAID.

Any idea to correct this strange behavior would be welcome!

Thank you.

jailed · Mar 14, 2010

what is the command that ate the 100% of CPU? Is it a heavy application? Or is it not normal to ate the full CPU?

achix · Mar 14, 2010

Any panics/crashes? How does the system gets back to normal? at a later time? or by you issuing an explicit reboot?
what does [CMD="last"]-100[/CMD] give?

Defre · Mar 14, 2010

achix: no panics nor crashes. The system gets back to normal after issuing an explicit reboot if the process run in the host. If the process that ate 100% is in a jail, I can kill it.

jailed: everytime I checked, it was "sh", running standard scripts from fresh freebsd installation, before and after upgrade to STABLE (w/ full rebuild and installation of the world and kernel).

I was not aware of the "last" command. I will execute it asap (maybe just after "manually" rebooting tomorrow).

edwtjo · Mar 16, 2010

Sorry for hijacking your thread. I moved my posts to
X becomes unresponsive..

Defre · Mar 21, 2010

No problem edwtjo.

I tried many things: stopping cron, sendmail, jails...

Stopping sendmail helped a lot, the server didn't stop for more than 2 days. After a few time, the hungry process started in a jail (it was "sh", I forgot to check the full commandline): I was still able to login as user on ssh server, but doing a "su" (with a valid password) didn't do anything after: the ssh connection become unresponsive. I logged in many times on ssh, and every time I tried to execute as root, it stopped. I had to reboot.

Strange behavior, although I think my installation is clean: I reinstalled BSD many times. As I remember, it worked well with 7.2

jailed · Mar 21, 2010

Defre,

Is your mailbox full?

By default, cron sends the output of the job by e-mail to the user. The local mails are stored in mail-file under /var/mail

If you don't want to receive e-mail for a job, add

Code:

>/dev/null 2>&1

To the end of the command. You said that there's a hungry job under jail. That may give so many output and adding this to the only mail file (if it's so big in size) may act like that.

Defre · Mar 22, 2010

Hi,

No the mailbox is almost empty: it's just a fresh install with some jails (though I updated to 8-STABLE due to instability with if_vge, but the strange behavior "at 4AM" was already present).

The server seems to work if I stop sendmail_clientmqueue, both in host and in "full system" jails.
Otherwise, the same things happen:
- precisely at 4AM, an sh process appears consuming 100% CPU.
- if it is in a jail I can still run simple command on host (but not "su")
- if it is on host, the server give no other sign of life than answering "ping" (and the network out activity graph shows a small peak at approximately 4AM, then nothing at all).

I am quite disappointed.

posix · Mar 24, 2010

I have same problem now after upgrading to 8_STABLE, server stops responding on pings, even on keyboard which connected instantly and becomes alive just after reset. Some information:

Code:

[root@gw /]# last | grep crash
posix            pts/1    10.20.5.2        Tue Mar 23 01:49 - crash (1+00:39)

[root@gw ~]# cat /var/crash/info.13 
Dump header from device /dev/ad0s1b
  Architecture: i386
  Architecture Version: 2
  Dump Length: 136187904B (129 MB)
  Blocksize: 512
  Dumptime: Sun Mar 21 01:40:07 2010
  Hostname: gw.mydomain.ru
  Magic: FreeBSD Kernel Dump
  Version String: FreeBSD 8.0-STABLE #0: Thu Mar  4 13:26:39 MSK 2010
    root@gw:/usr/obj/usr/src/sys/gw
  Panic String: privileged instruction fault
  Dump Parity: 3613957471
  Bounds: 13
  Dump Status: good

[root@gw ~]# cat /var/crash/vmcore.13 | tail
Fatal trap 1: privileged instruction fault while in kernel mode
cpuid = 0; apic id = 00
instruction pointer     = 0x20:0xc0857285
stack pointer           = 0x28:0xc2e64ba0
frame pointer           = 0x28:0xc2e64bc0
code segment            = base 0x0, limit 0xfffff, type 0x1b
                        = DPL 0, pres 1, def32 1, gran 1
processor eflags        = interrupt enabled, resume, IOPL = 0
current process         = 12 (swi1: netisr 0)
trap number             = 1
panic: privileged instruction fault
cpuid = 0
Uptime: 6d22h4m8s
Physical memory: 499 MB
Dumping 129 MB: 114 98 82 66 50 34 18 20

This kernel worked perfectly with 7.2, so I have no idea why does this happen now.