Incurable, Inconsistent Kernel Panics?

Hi all. I've recently switched to FreeBSD after a long hiatus spent with Windows and Ubuntu. I've always felt right at home on FreeBSD, so it feels good to be back :)

Unfortunately, I'm having huge problems, intermittently on boot.

The problem is, the system boots, I see the copyright lines, CPU identification, real memory, avail memory, "ACPI APIC Table: <IntelR AWRDACPI>" -- theres a long pause, and then page after page after page of what I think are kernel panics (they go by too quickly to read and don't end up logged anywhere since the system doesn't boot).

Sometimes the system boots fine, sometimes booting ACPI disabled (which doesn't actually boot, for some reason, it hangs on timecounters) -- then rebooting normally fixes this. And sometimes this can be mitigated by dropping to the loader prompt and "disable-module linux". Sometimes it happens even without the linux module being loaded (i.e. it happened, irrecoverably on a *fresh* 8.1 install).

I'm at my wits end here, it's tough dealing with a system that only has a 50% of booting without manual intervention, and a 10-20% chance of not being able to boot *at all*.

Do you have any suggestions as to what could be causing this? I really don't want to go back to linux �e�e
 
Try to verify that there isn't something wrong with your hardware first (no, running another operating system isn't necessarily a good verification). I suggest running memtest86+ for many hours (preferably over a day or night, if possible) - this will help you identify if there is something wrong with the memory in your machine.

A while ago I had to replace the PSU of a relatively new machine (a bit over a year old), because it started to randomly reboot. Before I did that, I had tried several operating systems (Xubuntu, OpenBSD, and several versions of FreeBSD), changed the memory modules in the machine and a few other things.

Also, I see that you mention FreeBSD 8.1 - well, 8.2 is out, try that one and see if it works better.
 
  • Thanks
Reactions: jrz
The brand and model of system just possibly could be important. Server hardware sometimes has long delays in booting. Toshiba notebook ACPI hates FreeBSD, or some do. Amount of memory, age of machine, which version of FreeBSD (i386, amd64), whether other operating systems run on the same hardware, any of those details could help. Bits are cheap, use lots.
 
I am running 8.2, amd64, on an Abit IP35 'Pro' motherboard with a Core 2 Duo E6800 chip. 8gb of RAM. I guess the machine is maybe 2 years old or so? No current other operating systems, though until this month I ran Windows 7 and Ubuntu 10.10 on this system.

I thought I'd isolated the problem to a downloaded Nvidia module (as opposed to the one from ports) -- but that's not the case, using the ports module the problem returned (plus there was the fresh install of 8.1 that had the same problem without loading any specific modules).

This is how it happened yesterday evening. I reboot (I'm only reboot so much to make sure this problem is fixed, basically), no changes to modules since last boot, boot process pauses for a second on the "ACPI APIC Table: <IntelR AWRDACPI>" line, then spams the error.

I reboot again, same thing. Reboot again, drop to loader prompt, lsmod tells me: kernel, nvidia, snd_ich and sound (*no* linux). Just striking out wildly I type 'disable-module linux' and 'boot' and the system boots normally. Right, kind of weird. I think the 'disable-module linux' is probably a red herring.

Is there some way I can trap what this error message actually is? It's frustrating not being able to read what the problem is since it scrolls by so fast in a blur and doesn't end up logged anywhere. If only I had a turbo button!! Thanks for all your suggestions.
 
Press Scroll Lock and see if you can scroll back with the arrow keys and Page Up/Page Down.

Check the inside of the machine, making sure all the fans are spinning, nothing is blocking the vents, and the motherboard power connectors are firmly connected (including that 8-pin CPU power connector back by the keyboard connector). Make sure you have any overclocking disabled.

Besides running memtest as tingo suggests, you could remove 4G or 6G of RAM. If it runs reliably, then you know either there's bad RAM or a power supply problem. Swap the RAM to test each batch. If they all work, put it all back in. If the problem comes back, probably the power supply is inadequate or failing.
 
  • Thanks
Reactions: jrz
I did try scroll lock, with no luck. I haven't tried memtest yet.

Everything checks out visually inside, and no other OS, LiveCD or installed has had trouble booting this whole time -- not that I'm ruling out bad hardware, but I'm skeptical that's it.
 
Well, so far, still no luck. I built a custom kernel, that didn't fix the issue. It does appear that dropping to the loader prompt (whether or not I do anything other than boot straight away) does help a bit -- it doesn't 100% fix the problem, but it seems to raise my chances of a successful boot.

I did have the bright idea to snap a shot of my screen with my digital camera set to a high shutter speed -- the pictures are pretty bad, but at least a bit readable:

http://i.imgur.com/JsS1s.jpg

Code:
Fatal trap 12: page fault while in kernel mode
fault virtual address = 0x18
stack pointer = 0x28:0xffffffff80ee6770
frame pointer = 0x28:0xffffffff80ee67b0
= DPL 0, pres 1, long 1, def32 0, gran 1
trap number = 12

cpuid = 0; apic id = 00
fault virtual address = 0x18

[url]http://i.imgur.com/LWdEG.jpg[/url]

cpuid = 0; apic id = 00
fault code = supervisor read data, page not present
stack pointer = 0x28:0xffffffff8100cc50
code segment = base 0x0, limit 0xfffff, type 0x1b
processor eflags = resume, IOPL = 0
trap number = 12
cpuid = 0

Fatal trap 12: page fault while in kernel mode
fault virtual address = 0x18
stack pointer = 0x28:0xffffffff805b3918
frame pointer = 0x28:0xffffffff8100c8f0
= DPL 0, pres 1, long 1, def32 0, gran 1
trap number = 12

http://i.imgur.com/HfgsL.jpg

(more of the same)

http://i.imgur.com/wMvg9.jpg

(more of the same)

It *sort of* sounds like the same problem described here: http://www.freebsd.org/cgi/query-pr.cgi?pr=140979 -- but who knows maybe its just an unrelated page fault. I'm going to try booting with:

/boot/loader.conf:
Code:
debug.acpi.disabled="ec"
 
I believe this is solved.

I updated my BIOS which made things worse (unbootably worse). Switched BIOS settings to 'failsafe defaults' which booted past the problem area (but didn't work, since it had everything including disk controllers turned off at a BIOS level). Started switching features back on in the BIOS, and randomly found a point where all the hardware I needed was re-enabled, and the system boots with no problems regularly.

At some point when I'm less giddy that I don't have to go back to Ubuntu, I will try to determine which specific BIOS setting does trip the error.

Thanks for the suggestions and help, all!
 
Back
Top