A8N-SLI motherboard, 9 years of socket 939

I've had an old socket 939 server running for years. ASUS A8N-SLI Premium MB, Opteron 185, ECC ram. It's been a tank until just this past week. It looks like the CPU started throwing correctable errors, those turned into uncorrectable errors (panic), and finally the system panics when trying to boot the kernel. I've never seen anything go bad like this, so I'm recording it here for posterity.

Timeline:
Code:
1/28 - reboot onto new 9.3p9 kernel
1/29 - first sign of trouble
Jan 29 06:39:37 host kernel: MCA: Bank 1, Status 0xd400000000000151
Jan 29 06:39:37 host kernel: MCA: Global Cap 0x0000000000000105, Status 0x0000000000000000
Jan 29 06:39:37 host kernel: MCA: Vendor "AuthenticAMD", ID 0x20f32, APIC ID 0
Jan 29 06:39:37 host kernel: MCA: CPU 0 COR OVER ICACHE L1 IRD error
Jan 29 06:39:37 host kernel: MCA: Address 0xffff80cc0bb0
Jan 29 06:39:37 host kernel: MCA: Bank 1, Status 0x9400000000000151
Jan 29 06:39:37 host kernel: MCA: Global Cap 0x0000000000000105, Status 0x0000000000000000
Jan 29 06:39:37 host kernel: MCA: Vendor "AuthenticAMD", ID 0x20f32, APIC ID 1
Jan 29 06:39:37 host kernel: MCA: CPU 1 COR ICACHE L1 IRD error
Jan 29 06:39:37 host kernel: MCA: Address 0xffff808f2d80
1/30 - the problem repeats: starting at 2:39, and repeating every hour until 13:39
1/30 - 15:19, the panics begin and the correctable errors cease
1/31 - 00:36:04, the last panic before the long darkness
This sure looks like a failed CPU. The boot loader usually spins, but the kernel panics at the end of device probes. Windows also fails to boot. Some simple DOS things might work for a few minutes but then lock up. I swapped the RAm out to no effect. I don't have another 939 to pop in at the moment, but I'll get a cheap one from ebay to test.

Anyone ever seen this sort of thing before? Why is there specifically an hour between errors!?

Things I wish I had:
  • longer baseline of /var/log/messages to check for older errors
  • baseline of cpu temperature readings
  • my old backup 939 CPU that I sold on eBay JUST LAST MONTH grrrr
 
Is it possible disable Lx caches nowadays in BIOS ? Most of transistors go to cache implementation, so disabling it can over round the issue. Or you can try to reduce clocking.
 
I had the Opteron 285 system till last year I had those same error from a bad memory slot on the mobo. Had that system 7 almost 8 years...two different mobos.
 
Good call on the caps. At least 6 are bulging and one has an obvious leak. This makes much more sense than a CPU suddenly going belly up.

Maybe I'll bug my buddy with a decent soldering rig and at least pull the caps to see how bad they got before things stopped working. Uncertain if I really need a new soldering hobby though.
 
Check for de-valued components, especially capacitors.

Look for grease between connector pins and pads, with some alcohol you can wipe that.

Check memory slots.

And PLEASE remember: Those Nvidia chipsets are prone to desolder because of bad thermal design, also, some of them have serious signal problems. Try to "reflow" or reball that chipset (or just get a spare motherboard, prefer VIA or Radeon Xpress/AMD motherboards).
 
One hour - just default MCA polling interval, you are able to switch it by:
hw.mca.interval (FreeBSD8.4 and forward).

If capacitors changing will not help, then my advices.
1.You need to check CPU core temperature. Sometimes happens that thermally conductive compound (inside the CPU) has broken and CPU overheated (independently of cooler quality).

2. Also these CPU have integrated memory controller and any problem with memory (even problem in the some memory slots only) really maybe CPU problems.
 
Good call on the caps. At least 6 are bulging and one has an obvious leak. This makes much more sense than a CPU suddenly going belly up.

Maybe I'll bug my buddy with a decent soldering rig and at least pull the caps to see how bad they got before things stopped working. Uncertain if I really need a new soldering hobby though.

Back when that board was made, there was a major debacle with electrolytic caps. We had a several devices that failed, all with the telltale bulging. According the "official" Internet rumor mill, one of the Japanese capacitor makers had its formula ripped off by a Chinese maker who was selling their product in mass quantities to the various contract manufacturers. Problem was, the formula as obtained by the Chinese company contained an error of some kind that resulted in early failures. I don't remember if the error was deliberate or what, but it certainly hurt a lot of businesses and individuals in the long run.
 
Replacing motherboard capacitors can be fairly difficult. The boards have several layers that remove heat, and it's easy to damage the pads with too much heat. Then, the replacement capacitors must be low ESR, and often the tall, skinny form factor is hard to obtain. Don't buy them from eBay, there are many fakes. Generally, motherboards that old are not worth repairing. If you really insist and are willing to spend the money and time, practice on old, dead motherboards first.

Usually it is better to spend the money on a new motherboard and processor. It will go faster and use less power.
 
Back when that board was made, there was a major debacle with electrolytic caps. We had a several devices that failed, all with the telltale bulging. According the "official" Internet rumor mill, one of the Japanese capacitor makers had its formula ripped off by a Chinese maker who was selling their product in mass quantities to the various contract manufacturers. Problem was, the formula as obtained by the Chinese company contained an error of some kind that resulted in early failures. I don't remember if the error was deliberate or what, but it certainly hurt a lot of businesses and individuals in the long run.

SOYO, FIC and some gigabyte motherboards near Pentium III/4 era suffered with those bad capacitors.
 
Replacing motherboard capacitors can be fairly difficult. The boards have several layers that remove heat, and it's easy to damage the pads with too much heat. Then, the replacement capacitors must be low ESR, and often the tall, skinny form factor is hard to obtain. Don't buy them from eBay, there are many fakes. Generally, motherboards that old are not worth repairing. If you really insist and are willing to spend the money and time, practice on old, dead motherboards first.
Agreed. There are only 2 general cases where I'll replace caps:
  • On Cisco equipment*. Catalyst 3750's have a lot of this problem (older hardware revs - the newer ones went to a completely redesigned voltage regulator systems).
  • If someone comes to me and says "I need to have this exact system repaired because it uses software X which is tied to this hardware, and I'll pay you $$$ to do it".
* Of course, I'll do lots of crazy stuff with Cisco equipment anyway - like replacing soldered-on RAM on Catalyst 4948-10GE switches.
 
Back
Top