Hi all,
I've got a server that has been rock-solid for several years that started crashing on 3 days ago. It has crashed a total of 3 times over 3 days. Two of the crashes were overnight when very little would have been happening, one of the crashes was in the early morning, shortly after a reboot, and at a time of little activity.
Two of the times the system was completely frozen, no error messages on the screen (and nothing apparent) in logs, and completely unresponsive to ping, keyboard, etc.
One time (the middle time), the screen was filled with errors like:
This message seems to be repeated for each of the CPUs (cpu0-cpu15).
Am I correct that this is an ECC error and that the DIMM in bank 8 is throwing some errors? The computer has 8x2gb memory sticks.
Other than the removing what I believe is the offending DIMM, is there anything else to do?
Here's the output of dmidecode:
I've got a server that has been rock-solid for several years that started crashing on 3 days ago. It has crashed a total of 3 times over 3 days. Two of the crashes were overnight when very little would have been happening, one of the crashes was in the early morning, shortly after a reboot, and at a time of little activity.
Two of the times the system was completely frozen, no error messages on the screen (and nothing apparent) in logs, and completely unresponsive to ping, keyboard, etc.
One time (the middle time), the screen was filled with errors like:
Code:
MCA: Vendor "GenuineIntel", ID 0x106a5, APIC ID 7
MCA: CPU 7 UNCOR PCC AC channel 0 memory error
MCA: Misc 0x0
MCA: Bank 8, Status 0xba0000000008000b0
MCA: Global Cap 0x00000000000001c09, Status 0x0000000000000004
This message seems to be repeated for each of the CPUs (cpu0-cpu15).
Am I correct that this is an ECC error and that the DIMM in bank 8 is throwing some errors? The computer has 8x2gb memory sticks.
Other than the removing what I believe is the offending DIMM, is there anything else to do?
Here's the output of dmidecode:
Code:
# dmidecode 2.11
SMBIOS 2.6 present.
Handle 0x002B, DMI type 16, 15 bytes
Physical Memory Array
Location: System Board Or Motherboard
Use: System Memory
Error Correction Type: Multi-bit ECC
Maximum Capacity: 96 GB
Error Information Handle: Not Provided
Number Of Devices: 4
Handle 0x002D, DMI type 17, 28 bytes
Memory Device
Array Handle: 0x002B
Error Information Handle: Not Provided
Total Width: 72 bits
Data Width: 64 bits
Size: 2048 MB
Form Factor: DIMM
Set: None
Locator: DIMM0
Bank Locator: BANK0
Type: Other
Type Detail: Other
Speed: 1333 MHz
Manufacturer: Manufacturer00
Serial Number: SerNum00
Asset Tag: AssetTagNum0
Part Number: ModulePartNumber00
Rank: Unknown
Handle 0x002F, DMI type 17, 28 bytes
Memory Device
Array Handle: 0x002B
Error Information Handle: Not Provided
Total Width: 72 bits
Data Width: 64 bits
Size: No Module Installed
Form Factor: DIMM
Set: None
Locator: DIMM1
Bank Locator: BANK1
Type: Other
Type Detail: Other
Speed: 1333 MHz
Manufacturer: Manufacturer01
Serial Number: SerNum01
Asset Tag: AssetTagNum1
Part Number: ModulePartNumber01
Rank: Unknown
Handle 0x0031, DMI type 17, 28 bytes
Memory Device
Array Handle: 0x002B
Error Information Handle: Not Provided
Total Width: 72 bits
Data Width: 64 bits
Size: 2048 MB
Form Factor: DIMM
Set: None
Locator: DIMM2
Bank Locator: BANK2
Type: Other
Type Detail: Other
Speed: 1333 MHz
Manufacturer: Manufacturer02
Serial Number: SerNum02
Asset Tag: AssetTagNum2
Part Number: ModulePartNumber02
Rank: Unknown
Handle 0x0033, DMI type 17, 28 bytes
Memory Device
Array Handle: 0x002B
Error Information Handle: Not Provided
Total Width: 72 bits
Data Width: 64 bits
Size: 2048 MB
Form Factor: DIMM
Set: None
Locator: DIMM3
Bank Locator: BANK3
Type: Other
Type Detail: Other
Speed: 1333 MHz
Manufacturer: Manufacturer03
Serial Number: SerNum03
Asset Tag: AssetTagNum3
Part Number: ModulePartNumber03
Rank: Unknown
Handle 0x0035, DMI type 16, 15 bytes
Physical Memory Array
Location: System Board Or Motherboard
Use: System Memory
Error Correction Type: Multi-bit ECC
Maximum Capacity: 96 GB
Error Information Handle: Not Provided
Number Of Devices: 4
Handle 0x0037, DMI type 17, 28 bytes
Memory Device
Array Handle: 0x0035
Error Information Handle: Not Provided
Total Width: 72 bits
Data Width: 64 bits
Size: 2048 MB
Form Factor: DIMM
Set: None
Locator: DIMM0
Bank Locator: BANK0
Type: Other
Type Detail: Other
Speed: 1333 MHz
Manufacturer: Manufacturer00
Serial Number: SerNum00
Asset Tag: AssetTagNum0
Part Number: ModulePartNumber00
Rank: Unknown
Handle 0x0039, DMI type 17, 28 bytes
Memory Device
Array Handle: 0x0035
Error Information Handle: Not Provided
Total Width: 72 bits
Data Width: 64 bits
Size: No Module Installed
Form Factor: DIMM
Set: None
Locator: DIMM1
Bank Locator: BANK1
Type: Other
Type Detail: Other
Speed: 1333 MHz
Manufacturer: Manufacturer01
Serial Number: SerNum01
Asset Tag: AssetTagNum1
Part Number: ModulePartNumber01
Rank: Unknown
Handle 0x003B, DMI type 17, 28 bytes
Memory Device
Array Handle: 0x0035
Error Information Handle: Not Provided
Total Width: 72 bits
Data Width: 64 bits
Size: 2048 MB
Form Factor: DIMM
Set: None
Locator: DIMM2
Bank Locator: BANK2
Type: Other
Type Detail: Other
Speed: 1333 MHz
Manufacturer: Manufacturer02
Serial Number: SerNum02
Asset Tag: AssetTagNum2
Part Number: ModulePartNumber02
Rank: Unknown
Handle 0x003D, DMI type 17, 28 bytes
Memory Device
Array Handle: 0x0035
Error Information Handle: Not Provided
Total Width: 72 bits
Data Width: 64 bits
Size: 2048 MB
Form Factor: DIMM
Set: None
Locator: DIMM3
Bank Locator: BANK3
Type: Other
Type Detail: Other
Speed: 1333 MHz
Manufacturer: Manufacturer03
Serial Number: SerNum03
Asset Tag: AssetTagNum3
Part Number: ModulePartNumber03
Rank: Unknown