I've had an old socket 939 server running for years.  ASUS A8N-SLI Premium MB, Opteron 185, ECC ram.  It's been a tank until just this past week.  It looks like the CPU started throwing correctable errors, those turned into uncorrectable errors (panic), and finally the system panics when trying to boot the kernel.  I've never seen anything go bad like this, so I'm recording it here for posterity.
Timeline:
	
	
	
		
This sure looks like a failed CPU.  The boot loader usually spins, but the kernel panics at the end of device probes.  Windows also fails to boot.  Some simple DOS things might work for a few minutes but then lock up.  I swapped the RAm out to no effect.  I don't have another 939 to pop in at the moment, but I'll get a cheap one from ebay to test.
Anyone ever seen this sort of thing before? Why is there specifically an hour between errors!?
Things I wish I had:
				
			Timeline:
		Code:
	
	1/28 - reboot onto new 9.3p9 kernel
1/29 - first sign of trouble
Jan 29 06:39:37 host kernel: MCA: Bank 1, Status 0xd400000000000151
Jan 29 06:39:37 host kernel: MCA: Global Cap 0x0000000000000105, Status 0x0000000000000000
Jan 29 06:39:37 host kernel: MCA: Vendor "AuthenticAMD", ID 0x20f32, APIC ID 0
Jan 29 06:39:37 host kernel: MCA: CPU 0 COR OVER ICACHE L1 IRD error
Jan 29 06:39:37 host kernel: MCA: Address 0xffff80cc0bb0
Jan 29 06:39:37 host kernel: MCA: Bank 1, Status 0x9400000000000151
Jan 29 06:39:37 host kernel: MCA: Global Cap 0x0000000000000105, Status 0x0000000000000000
Jan 29 06:39:37 host kernel: MCA: Vendor "AuthenticAMD", ID 0x20f32, APIC ID 1
Jan 29 06:39:37 host kernel: MCA: CPU 1 COR ICACHE L1 IRD error
Jan 29 06:39:37 host kernel: MCA: Address 0xffff808f2d80
1/30 - the problem repeats: starting at 2:39, and repeating every hour until 13:39
1/30 - 15:19, the panics begin and the correctable errors cease
1/31 - 00:36:04, the last panic before the long darknessAnyone ever seen this sort of thing before? Why is there specifically an hour between errors!?
Things I wish I had:
- longer baseline of /var/log/messages to check for older errors
- baseline of cpu temperature readings
- my old backup 939 CPU that I sold on eBay JUST LAST MONTH grrrr
 
			     
 
		 
 
		 
 
		