Hi,
I recently added a zfs disk array to my old HP 585 G1 Server.
Immediately there was kernel panics and I have spent quite a bit of time
figuring out what was really wrong.
The system has 4 cpu cards with opteron double core processors. Each
card has 4x2 gigabyte memory 4x2x4 = 32 gigabyte of total system mem.
The memory is DDR400 ECC mem.
The panic was very easily reproducable. I just had to issue enough reads
to the system up until the faulty mem was accessed.
Strangely I can run memtest86+ with the DDR setting on and I find no
error what so ever.
Adding
hint.lapic.2.disabled=1 > /boot/loader.conf
Immediately mitigates the error for FreeBSD. So here is my conclusion:
If you can make the system stable by disabling one core on one cpu card:
1) The other cards / mem must be ok.
2) The mainboard must be ok since one of the cores on the cpu is still
running / not barfing panics.
3) the cpu core with acpi 2 is probably also ok. it is on the same chip
as a non disabled core.
4) It is likely down to a rotten DIMM.
In place of mindlessly trying to find the culprit by switching dimms I
would really like to identify the CPU, card and mem module from the os.
Info here:
http://pastebin.com/jqufNKck
Thank you for your time and help.
--
Med venlig hilsen / with regards
Nikolaj Hansen
I recently added a zfs disk array to my old HP 585 G1 Server.
Immediately there was kernel panics and I have spent quite a bit of time
figuring out what was really wrong.
The system has 4 cpu cards with opteron double core processors. Each
card has 4x2 gigabyte memory 4x2x4 = 32 gigabyte of total system mem.
The memory is DDR400 ECC mem.
The panic was very easily reproducable. I just had to issue enough reads
to the system up until the faulty mem was accessed.
Strangely I can run memtest86+ with the DDR setting on and I find no
error what so ever.
Adding
hint.lapic.2.disabled=1 > /boot/loader.conf
Immediately mitigates the error for FreeBSD. So here is my conclusion:
If you can make the system stable by disabling one core on one cpu card:
1) The other cards / mem must be ok.
2) The mainboard must be ok since one of the cores on the cpu is still
running / not barfing panics.
3) the cpu core with acpi 2 is probably also ok. it is on the same chip
as a non disabled core.
4) It is likely down to a rotten DIMM.
In place of mindlessly trying to find the culprit by switching dimms I
would really like to identify the CPU, card and mem module from the os.
Info here:
http://pastebin.com/jqufNKck
Thank you for your time and help.
--
Med venlig hilsen / with regards
Nikolaj Hansen