7.1 prerelease gateway: panic and reboot

kisscool-fr · Feb 5, 2010

Hi,

I have a problem with a server acting as a gateway and firewall.
The problems started approximatively 2 months ago.

I tried some resolving. I disabled some services and hardware thinking it might help but it not.

The gateway panic and reboot randomly. The reboots are more and more frequent. Just today it reboots 2 times.

Here are some informations I hope they will be helpful.

Code:

uname -v
FreeBSD 7.1-PRERELEASE #7: Sat Sep 26 14:48:18 CEST 2009     root@gateway.cifacom.lan:/usr/obj/usr/src/sys/KERNv1

The content of /var/run/dmesg.boot http://pastebin.com/m737f7577

I followed the kernel debugging page in the handbook and the results are :

Code:

kgdb kernel.debug /var/crash/vmcore.14
GNU gdb 6.1.1 [FreeBSD]
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "i386-marcel-freebsd"...

Unread portion of the kernel message buffer:
spin lock 0xc07886c0 (Ã€xÃ€d) held by 0xc0788660 (tid 0) too long
panic: spin lock held too long
Uptime: 17h43m59s
Physical memory: 503 MB
Dumping 118 MB: 103 87 71 55 39 23 7

Reading symbols from /boot/kernel/acpi.ko...Reading symbols from /boot/kernel/acpi.ko.symbols...done.
done.
Loaded symbols for /boot/kernel/acpi.ko
Reading symbols from /boot/kernel/if_vlan.ko...Reading symbols from /boot/kernel/if_vlan.ko.symbols...done.
done.
Loaded symbols for /boot/kernel/if_vlan.ko
Reading symbols from /boot/kernel/pf.ko...Reading symbols from /boot/kernel/pf.ko.symbols...done.
done.
Loaded symbols for /boot/kernel/pf.ko
Reading symbols from /boot/kernel/if_tun.ko...Reading symbols from /boot/kernel/if_tun.ko.symbols...done.
done.
Loaded symbols for /boot/kernel/if_tun.ko
#0  doadump () at pcpu.h:196
196             __asm __volatile("movl %%fs:0,%0" : "=r" (td));

Code:

kgdb kernel.debug /var/crash/vmcore.16
GNU gdb 6.1.1 [FreeBSD]
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "i386-marcel-freebsd"...

Unread portion of the kernel message buffer:
kernel trap 12 with interrupts disabled


Fatal trap 12: page fault while in kernel mode
fault virtual address   = 0xd627fa70
fault code              = supervisor read, page not present
instruction pointer     = 0x20:0xc0558ea1
stack pointer           = 0x28:0xd607fa54
frame pointer           = 0x28:0xd627fa80
code segment            = base 0x0, limit 0xfffff, type 0x1b
                        = DPL 0, pres 1, def32 1, gran 1
processor eflags        = resume, IOPL = 0
current process         = 1255 (lldpd)
trap number             = 12
panic: page fault
Uptime: 13m8s
Physical memory: 503 MB
Dumping 71 MB: 56 40 24 8

Reading symbols from /boot/kernel/acpi.ko...Reading symbols from /boot/kernel/acpi.ko.symbols...done.
done.
Loaded symbols for /boot/kernel/acpi.ko
Reading symbols from /boot/kernel/if_vlan.ko...Reading symbols from /boot/kernel/if_vlan.ko.symbols...done.
done.
Loaded symbols for /boot/kernel/if_vlan.ko
Reading symbols from /boot/kernel/pf.ko...Reading symbols from /boot/kernel/pf.ko.symbols...done.
done.
Loaded symbols for /boot/kernel/pf.ko
#0  doadump () at pcpu.h:196
196             __asm __volatile("movl %%fs:0,%0" : "=r" (td));

If other informations are needed, ask me and i will answer quickly.

Thanks

ps: excuse me for my poor english

SirDice · Feb 5, 2010

If it had been working fine and suddenly starts acting up it is usually a hardware problem. Check memory and disks.

DutchDaemon · Feb 5, 2010

And dust, temperature, loose contacts/wiring, short circuits, cards touching bare metal, etc.

kisscool-fr · Feb 5, 2010

Reading the logs and daily/weekly reports, i seen that for 2 of the 3 disks available, the system is changing the optimize mode from time to space. I disabled thoses disks and the problem persists.

I ran memory tests again one hour ago, and it seems that one module is with error(s). The first test didn't show that.
I'm trying to identify that module and change it and see if it help.

Is it a good way to test the hard drive with smartmontools ?

jalla · Feb 5, 2010

kisscool-fr said:
Reading the logs and daily/weekly reports, i seen that for 2 of the 3 disks available, the system is changing the optimize mode from time to space. I disabled thoses disks and the problem persists.

This is irrelevant to your problem, it's caused by your disks filling up. See tunefs(8) for some explanation (-o option).

kisscool-fr · Feb 5, 2010

jalla said:
This is irrelevant to your problem, it's caused by your disks filling up. See tunefs(8) for some explanation (-o option).

Yes, i know the option exists and how to change it.
But the disks weren't full. They were all the time 35% of free space available.

But it's not the problem. Seeing those logs, I disabled the disks because i thought it might be related with the panics/reboots.

kisscool-fr · Feb 5, 2010

I'm still making my tests.

The server rebooted again 15 minutes ago.
I was in the server room and had the time to take some notes.

Code:

Fatal trap 12      : page fault while in kernel mode
fault virtual address    = 0xcefd7e34
fault code               = supervisor read, page not present
instruction pointer      = 0x20:0xc055dece.....
stack pointer            = 0x28:0xccfd7c24
frame pointer            = 0x28:0xcefd7c44
code segment             = base 0x0, limit 0xfffff, type 0x1b
                         = DPL 0, pres1, def32 1, gran 1
processors eflags        = interrupt enabled, resume, IOPL=0
current process          = 14420(less)
trap number              = 12
panic: page fault
Uptime = 1h7m18s
...

Are these lines confirming something ?

DutchDaemon · Feb 6, 2010

To me it confirms it's sufficiently random to warrant looking into the hardware. 'Less' causing panics .. not likely.

kisscool-fr · Feb 6, 2010

Ok, just before I started a [CMD=""]less /boot/loader.conf[/CMD].

I done again memory tests on a healthy computer and it seems the 2 modules are ok.
I checked the hard drive with smartctl (a short and a long test), without any error.

I will continue my investigations.

kisscool-fr · Feb 16, 2010

Hi,

After few days, it seems to be stable.

I checked memory (memtest), disks (smartmontools) , temperatures mbmon), contacts (manually), dust (it is clean). I removed one memory module from the server and tested it on another computer and it is ok.

I will check it again and if itÃ¨s ok plug it again to be see what happens because i'm not sure it came from it.

Thangs guys for your responses.

kisscool-fr · Mar 1, 2010

Hi,

I'm back to close this thread.

After another tests, I can say the problems comes from the motherboard and exactly from the memory slots.

When changing the place of memory modules, the server reacts differently during memory tests.

slot 1 : tests will not start
slot 2 : tests runs fine
slot 3 and 4 : server will not boot.

In my opinion, something was damaged during a power failure.

Thread can be tagged solved.

Thanks

7.1 prerelease gateway: panic and reboot

kisscool-fr

SirDice

Administrator

DutchDaemon

Administrator

kisscool-fr

jalla

kisscool-fr

kisscool-fr

DutchDaemon

Administrator

kisscool-fr

kisscool-fr

kisscool-fr