CPU: Local APIC error 0x80

tcn · Dec 26, 2010

Hi,

I've been wondering what this log message means:

Code:

CPU: Local APIC error 0x80

I keep getting this in the logs but don't know the cause. This system runs on a VIA Nano CPU.

Any thoughts?

Thanks,

tcn

ndotn · Jan 15, 2011

I am also seeing this error, on a just-purchased EPIA M840 (1.2 GHz VIA Nano).

I can reproduce the errors by loading the CPU, RAM, and I/O eg with a buildkernel or buildworld with -j3.

I've tried the board now under 8.1 and 8.2-RC, both amd64 and i386. Running amd64 the errors seem to occur more frequently, under i386 less.

I have also tried varying most BIOS options that seem conceivably related, but these EPIA machines don't have many options.

Naturally I can suppress the messages by disabling APIC on i386. Amd64 failed to boot without APIC. However, disabling APIC on such a recent machine would seem to have significant performance/efficiency consequences, so I hope the problem can be resolved otherwise.

I have checked, and VIA offers no newer BIOS for this machine.

I'm unsure if it's related, but this machine has also exhibited an unusual network problem. It has two vge PCIe interfaces. When the machine is acting as a PPPoE client with pf and NAT, client TCP sessions to certain servers will fail to establish, and time out, much like the behavior I would expect with path MTU discovery problems. With the affected services, the problem is 100% repeatable.

However, when I move the same disk to an Opteron laptop, changing *only* the interface names in rc.conf, pf.conf, and ppp.conf, the gateway setup works flawlessly.

I would be happy to run instrumented debug kernels, obtain vmcore dumps, etc. I recently gained a fair bit of experience collecting such data working on an unrelated issue.

ndotn · Jan 15, 2011

If the error is being reported correctly, then according to Intel's System Programming Guide error 0x80 would appear to indicate:

Illegal Register Address
Set when the local APIC is in xAPIC mode and software attempts to access a register that is reserved in the processor's local-APIC register-address space; see Table 10-1. (The local-APIC register-address space comprises the 4 KBytes at the physical address specified in the IA32_APIC_BASE MSR.) Used only on Intel Core, Intel AtomTM, Pentium 4, Intel Xeon, and P6 family processors.

I'm not sure if this suggests the problem is a FreeBSD bug, a BIOS bug, or a matter of faulty hardware.

tcn · Jan 16, 2011

Hi ndotn,

I feel bad, I used to have the reflex to dig deeper; I guess I became lazy.

You gave me an idea as of why it could be happening. Most of chipsets support SMP but VIA does not have SMP ready processors so I think the VX800 does not support it (or not support it well).

I had already thought about this and already disabled the SMP in the kernel but you made me think that there might be something else. The scheduler uses a hell of a lot of the APIC.

I just compiled a new kernel and am currently testing the 4BSD scheduler (the prior scheduler). I would also like to know if you experience some weird stalls once in a while; most of the time related to IOs. The system would pause for a few seconds and resume.

tcn
(Thanks for showing me how lazy I became, I have to work on this

)

tcn · Jan 16, 2011

I've got a few errors with the 4BSD scheduler; this is not a solution but was worth investigating.

Next thing would be to know is who calls the APIC in this manner. This would require debugging the kernel...

tcn

ndotn · Jan 17, 2011

Regarding the occasional stalls, I haven't noticed that explicitly, but as yet my use of the machine has not been interactive. So I think a few seconds worth of I/O stall could easily have gone unnoticed.

What board are you using, out of curiosity?

We do need deeper insight into the kernel. The actual printf happens in lapic_handle_error() in sys/i386/i386/local_apic.c (on i386). The only place under sys where I can find a call to lapic_handle_error() is in sys/i386/i386/apic_vector.s (on i386).

I suppose one brute force possibility to get more information would be to set lapic_handle_error to panic so that a vmcore and backtrace could be obtained. I'm certainly willing to do that, this machine is currently not serving in a critical role. I'll post those results once I get them.

We'd probably still need help from a FreeBSD developer to interpret the results though.

ndotn · Jan 17, 2011

Well, that was disappointing. The backtrace didn't go very far back. Not sure if this is useful in diagnosing the problem:

Code:

CPU0: local APIC error 0x80
panic: local APIC error
cpuid = 0
KDB: stack backtrace:
#0 0xc067db07 at kdb_backtrace+0x47
#1 0xc064ebc7 at panic+0x117
#2 0xc083ba39 at lapic_handle_error+0x49
#3 0xc083432f at Xerrorint+0x1f
Uptime: 7m48s
kernel trap 12 with interrupts disabled


Fatal trap 12: page fault while in kernel mode
cpuid = 0; apic id = 00
fault virtual address   = 0x0
fault code              = supervisor write, page not present
instruction pointer     = 0x20:0xc084a3fe
stack pointer           = 0x28:0xe8a1bc6c
frame pointer           = 0x28:0xe8a1bc90
code segment            = base 0x0, limit 0xfffff, type 0x1b
                        = DPL 0, pres 1, def32 1, gran 1
processor eflags        = resume, IOPL = 0
current process         = 23297 (cc1)
trap number             = 12
panic: page fault
cpuid = 0
Uptime: 7m48s

tcn · Jan 17, 2011

Hi ndotn,

I'm using the M840 as well. I think the problem is with the VX800. If it's not with SMP and task switch, then my other guess would be power saving.

Try to set a breakpoint with the debugger instead of generating a kernel panic. This way, you could be able to step through and finally get into to calling function. It should not be very far as IRQ routine is short.

Once we get the register number, we can check what it is used for and verify that the VX800 implements it.

I don't know if I can do all this with the serial console.....

tcn

ndotn · Jan 17, 2011

I've learned to do some kernel data collection, but I have no experience with online kernel debugging, so please bear with me through my ignorance of this process.

We need to find out what APIC register was written before the APIC asserted error 0x80, do we not? If I understand correctly, we could only step forward through execution, and a backtrace would go no farther backward than it did in the panic. If we set a breakpoint in lapic_handle_error, we still need some way to recover the contents of the most recent register write attempt.

I was thinking that we could perhaps record some or all local APIC register writes using KTR. Then when the lapic_handle_error breakpoint fired, I could read out from KTR and hopefully obtain the most recent sequence of register writes.

Only thing with that is, I'm having difficulty finding where in the source I would insert the CTR macros to log the register writes.

tcn · Jan 18, 2011

The trick is to know what function caused this. The exception will return to the instruction right after the faulty one. This way we would know which function is causing this. Causing an exception within an exception unless we have multiple level of backtrace will lead to nothing.

Logging activity is not a bad idea either but because we don't know which function is called, we would have to log each and every function. Although, I think power management functions should be our initial targets.

My system is in use and I can't just take it down right now. I did compile debugging info back into the kernel. I am still unsure as if I can debug through serial console; I think it is possible... I'll see if I can try this out tonight.

ndotn · Jan 18, 2011

It is possible to use the debugger through the serial console. That much I have done.

I think at minimum you'll need KDB, DDB, and BREAK_TO_DEBUGGER. You may also want DDB. You can set a script to run when the break point fires in /etc/ddb.conf, and enable that with

Code:

ddb_enable="YES"

in /etc/rc.conf.

tcn · Jan 19, 2011

I've setup my machine and tried a bit of debugging. I'll have to wait for the weekend as the error does not occur very often.

I intend to set a breakpoint in

Code:

lapic_handle_error

and then wait. Once debugger traps the breakpoint, trace it out of the exception routine and see where we are.

ndotn · Jan 20, 2011

For what it's worth, I've found that

make -j3 buildkernel

reproduces the issue consistently on my M840.

tcn · Jan 21, 2011

Initial tracing shows this:

Code:

Tracing pid 41600 tid 100187 td 0xffffff003be5e3e0
kvprintf() at kvprintf+0xb58
vprintf() at vprintf+0x85
printf() at printf+0x67
lapic_handle_error() at lapic_handle_error+0x3f
Xerrorint() at Xerrorint+0x8a

Traced a bit more and obtained:

Code:

Tracing pid 41600 tid 100187 td 0xffffff003be5e3e0
msglogchar() at msglogchar+0x10c
vprintf() at vprintf+0x85
printf() at printf+0x67
lapic_handle_error() at lapic_handle_error+0x3f
Xerrorint() at Xerrorint+0x8a

Traced a bit, got back into the error routine and lost control of the system. Serial console got irresponsive; had to reset.

I don't think I got out of the interrupt routine; I think I got stuck somehow. I'll have to try to clear the first breakpoint and set one at the routine's exit. What I don't like it that the interrupt routine seems to be encapsulated into another one (Xerrorint).

tcn