kernel trap 12 with interrupts disabled

Hi everybody!

Please, help a newbie.
There's a server running FreeBSD 5.2.1-RELEASE. I'm kindly asking you not to kick me for an out-of-date OS as it was not me who installed it, I only try to support it and at the moment don't have enough experience either to reinstall or to upgrade it.
The server is working as a router with NAT and ipfw. It is running named, apache, sendmail (with some milters), samba (the server is a domain controller for a small office network), openvpn. The Internet connection is coming from a DSL-modem working in bridge mode, PPPoE authentication is done on the server. There are 2 HDDs installed on two different IDE interfaces. One partition is mirrored with vinum. The others are backuped from the first (main) HDD to the second one with a script running once a month with cron.
For some time past the server began hanging. There's a message on the console:

Code:
kernel trap 12 with interrupts disabled
Fatal trap 12: page fault while in kernel mode
fault code = supervisor read, page not present
panic: page fault
(some lines with codes and addresses are omitted).

Every time after this I have to reboot it manually with a reset button.

I noticed that every time after loading (not only after kernel panic but literally every time) there's an error in /var/log/messages:
'kernel: session in wrong state'
It goes just after filesystem mounting.
For example, here is a fragment of the log:
Code:
Dec 21 10:38:34 aqua kernel: GEOM: create disk ad0 dp=0xc2509d60
Dec 21 10:38:34 aqua kernel: ad0: 238475MB <WDC WD2500AAJB-00J3A0> [484521/16/63] at ata0-master UDMA100
Dec 21 10:38:34 aqua kernel: GEOM: create disk ad2 dp=0xc2509a60
Dec 21 10:38:34 aqua kernel: ad2: 114473MB <WDC WD1200JB-00FUA0> [232581/16/63] at ata1-master UDMA100
Dec 21 10:38:34 aqua kernel: Mounting root from ufs:/dev/ad0s1a
Dec 21 10:38:34 aqua kernel: session in wrong state
Dec 21 10:38:34 aqua last message repeated 2 times
Could you please help me to find the source of this trouble? I'm also wondering if these two above mentioned things (kernel panic and 'session in wrong state' message) are connected.
Generally, I just can't figure out whose message is this: 'kernel: session in wrong state'.
I've learned from some unix-forums that it is often known to be connected with PPPoE, but I guess it isn't my case.
I'm a bit suspicious about the hardware, as two HDDs have failed on the server for the last half-year and both were connected to the first IDE-channel.

Thanks in advance for your useful suggestions! :)
 
SirDice said:
Make sure the memory is still good.

Thnx for your prompt reply!
That's what I plan to do next. Tonight I'm going to connect a CD-drive or FDD to the server and run memtest86+. But taking into consideration the fact that the current RAM module has been working in this server for some three years I don't think it could fail. Another matter is the motherboard. We'll see.
 
SergeyMas said:
But taking into consideration the fact that the current RAM module has been working in this server for some three years I don't think it could fail. Another matter is the motherboard. We'll see.

Taking into consideration, this server has been working and accumulating dust for some three years, I would suggest a major hardware cleaning :P

Chances are, your problem is gone afterwards.

You can clean the contacts on the memory modules using some Q-tips and isopropanol. That will get rid of dust, grease and oxide.
 
mickey said:
Taking into consideration, this server has been working and accumulating dust for some three years, I would suggest a major hardware cleaning :P

Chances are, your problem is gone afterwards.

You can clean the contacts on the memory modules using some Q-tips and isopropanol. That will get rid of dust, grease and oxide.
That's a nice idea. But I always keep it in mind. I mean regular cleaning up. When I was replacing the broken HDD this summer I did a thorough cleaning. Still I will consider this and do one more cleaning tonight. But something gives me a hint that the problem isn't there...
BTW, usually I use a simple rubber eraser to clean the contacts of a RAM module. ;)
 
At least another cleaning will eliminate faulty contacts as a potential source of the problem, so you can focus on the remaining possibilities.

If it's not the source, you can only run memtest and try to isolate the faulty module.
 
Tonight I opened the server case, took out all the cards from the slots (two network adapters, a videoadapter and a memory module), cleaned the contacts with an eraser and inserted them back; disconnected all the interface and power cables and reconnected them again; removed dust from heatsinks; and also I checked 12V and 5V DC voltages with a digital multimeter, they were 12.05V and 5.16V respectively. Then I connected an FDD drive and ran memtest86+ from a floppy. Three full loops (9 tests each) passed successfully, there were no errors at all. Then, just in case, I tested the HDDs (both WD) with WD Data Lifeguard Tool. Again no errors. Then I disconnected the FDD, started FreeBSD in single-user mode and ran fsck -f -y. The filesystems on both HDDs were reported clean. Finally, I ran the server in normal mode. Everything was just perfect (except the fact that I didn't check if there was that 'wrong state' message in the log). So I went home completely happy and hoping. After 3 hours I tried to connect to the server from home...... but it was down again.. :((( I expect seeing the same 'kernel trap 12' message tomorrow on the console.
Any other ideas as to what direction I should dig in?
 
Just some thoughts:
  • The CPU also has a lot of contacts that might need some cleaning
  • Just because the power supply shows correct voltages on it's outputs, doesn't mean it's OK and that the voltages will not drop eventually when under load.
 
Today I came to the office early and managed to catch the panic screen before the server was restarted by somebody.
Here are some details. In fact, it is the whole panic message:
Code:
kernel trap 12 with interrupts disabled

Fatal trap 12: page fault while in kernel mode
cpuid = 0; apic id = 00
fault virtual address = 0x24
fault code = supervisor read, page not present
instruction pointer = 0x8:0xc065ae9e
stack pointer = 0x10:0xcce03c2c
frame pointer = 0x10:0xcce03c50
code segment = base 0x0, limit 0xfffff, type 0x1b
             = DPL 0, pres 1, def 32 1, gran 1
processor eflags = resume, IOPL=0
current process = 36 (swi8: tty:sio clock)
trap number = 12
panic: page fault
cpuid = 0;
As I can see process 36 (swi8: tty:sio clock) is the one to blame or at least one of the suspects.

Unfortunately, as I've recently learned the kernel was compiled without debugging (it is commented in the config file: #makeoptions DEBUG=-g). So I cannot trace the error deeper.

One of my friends suggested to save all the data from the mirrored volume and disable vinum temporarily. He suspects that the latter is the cause of the problem. The matter is that it was broken HDD replacement that I did last. And after that I had to re-actualize the mirror. Probably, something went wrong. Although, 'vinum list' says that everything's OK:
Code:
aqua# vinum list
2 drives:
D vinumdrive0           State: up       /dev/ad2s1f     A: 0/40960 MB (0%)
D vinumdrive1           State: up       /dev/ad0s1f     A: 0/40960 MB (0%)

1 volumes:
V raid0                 State: up       Plexes:       2 Size:         39 GB

2 plexes:
P raid0.p0            C State: up       Subdisks:     1 Size:         39 GB
P raid0.p1            C State: up       Subdisks:     1 Size:         39 GB

2 subdisks:
S raid0.p0.s0           State: up       D: vinumdrive0  Size:         39 GB
S raid0.p1.s0           State: up       D: vinumdrive1  Size:         39 GB
 
Today the system crashed on another process:
current process = 32 (irq22: ed0)
So, as I can see, it is rather a consequence than a reason.
 
mickey said:
Exactly what I was thinking... Likely hardware failure.

That is what I also incline to. However, Memtest86+ did not reveal any problems, as I've reported.
What if I try to replace the motherboard with another one with a different chipset, CPU and memory and connect current HDDs to it? Will the system start up?
 
SergeyMas said:
That is what I also incline to. However, Memtest86+ did not reveal any problems, as I've reported.
What if I try to replace the motherboard with another one with a different chipset, CPU and memory and connect current HDDs to it? Will the system start up?

I would say that depends upon the kernel being in use and with what options/devices it is configured.

But I wouldn't go that far as for now. Like I said, another possibility is the CPU itself, which is subject to constant vibration, heat and airflow. I would take it out of it's socket, clean the contacts on the CPU using a toothbrush and isopropanol. Also clean the socket itself, as far as that is possible. Let it all dry completely, then reseat the CPU firmly in it's socket. Be sure to discharge yourself of any electrostatic charge by touching some grounded metal object before handling the CPU.

Another possibility is the power supply unit. If you have a spare one, you could swap it in, and see whether this makes a difference.

Also look out for any electrolytic capacitors on the mainboard, especially near the power regulation circuits, that look swollen. These could be defective and also be a cause of such problems. These are also to be found in the power supply unit, but you would probably have to open the enclosure in order to take a look inside.
 
Back
Top