FreeBSD intermittent crash!

No, it is a regression in FreeBSD. See the FreeBSD-stable@ thread I started here. It is harmless, it just delays the boot process for a while during the retries.
I saw that, but I didn't think 10.1-RELENG had that problem, but now I realize it's been introduced sometime after 8.4. Sorry for the misdirection.
 
Hi again, got back after a long time. So yes, we've move to new Dell R510 Hardware now. Here is the specs :

DELL R510
2 x L5520
64GB RAM
12x3TB Raid stripping+mirroring (HBA LSI-9211-fw version 19.00)
FreeBSD cw009.tunefiles.com 10.2-RELEASE-p14 FreeBSD 10.2-RELEASE-p14 #0: Wed Mar 16 20:46:12 UTC 2016 root@amd64-builder.daemonology.net:/usr/obj/usr/src/sys/GENERIC amd64

After 9days of uptime, server again got crashed with following error in crash log :

http://pastebin.com/baShWuMP

I am so much depressed now, there's much pressure on me from my company :( . Please help us resolving this crash issue .
 
SirDice , we really need help now. We've discarded supermicro and moved with Dell R510 but errors look to be same 'Internal Timer error' . I am so depressed and no idea where to go from here. :(
 
The errors look like hardware errors. Apparently you're not that lucky when it comes to hardware.

As a side note, one of my clients has around 25 SuperMicro servers (old, new, various sizes) and they all run FreeBSD (mostly 9.3, some 10.1) without any issues.
 
Machine check exception could be caused by software/applicaion? Maybe we're on the wrong side of troubleshoot and culprit is our application not hardware ? Servers utilize following programs :

- NGINX + PHP_FPM (Uploading videos and streaming them to end users)
- FFMPEG (Encode the uploaded videos to make it ready for streaming)
 
Very depressing thread, not hard to feel sympathy for the OP. But hard to believe that they got faulty hardware back to back, I can't buy this.

Shahzaib, what's the response from Dell on this though?
Also, inspite of Debian 8, I'd suggest CentOS 7.1/7.2 for a test on one of the servers.

Regards.
 
When you said you moved to the new HW, does it mean brand new ? Meaning no moving parts from server to server (RAM, e.g).
Also - who did install those servers ? Was proper procedure followed ?

I can too imagine how stressful it has to be to deal with the HW behaving like this. Seems way too unlikely to get so many faulty HW.
It might be forth moving those parts in order to single out the faulty item. Remove all but one CPU, keep it running with the lowest amount of memory modules possible. And do real stress tests - put quite a load on those servers.
 
Machine check exception could be caused by software/applicaion? Maybe we're on the wrong side of troubleshoot and culprit is our application not hardware ?

Since you said you are now running on 100% new hardware, that the new server has none of the hardware from the old server then that makes me suspicious that somehow a software issue appears to be a hardware issue. So stop using the application software you were using on one of the other servers, install something else on it, and let it run. Put a game server port on it, you need to de-stress a bit anyway. :) See if it still crashes.
 
Since you said you are now running on 100% new hardware, that the new server has none of the hardware from the old server then that makes me suspicious that somehow a software issue appears to be a hardware issue. So stop using the application software you were using on one of the other servers, install something else on it, and let it run. Put a game server port on it, you need to de-stress a bit anyway. :) See if it still crashes.
This might have some merit to it. A long time ago I had to manage a Windows application on a bunch of servers and the application itself had to run as Administrator. The reason for that is because it did some funky memory stuff and it made Windows crash every now and then, regardless of what hardware it was running on. So maybe the business application shahzaib is running might be doing the same...
 
Have you checked my last post with HT disabled error got change ? :

Here is the crash dump :

http://prntscr.com/b1mgj3
This might have some merit to it. A long time ago I had to manage a Windows application on a bunch of servers and the application itself had to run as Administrator. The reason for that is because it did some funky memory stuff and it made Windows crash every now and then, regardless of what hardware it was running on. So maybe the business application shahzaib is running might be doing the same...

I experienced the same (M$ background here). The BSoD I kept getting was PAGE_FAULT_IN_NON_PAGED_AREA. I replaced all the DIMMs and it stopped blue screening on me. Maybe its the same in this case.

*EDIT* I re-read what you said. It could be the application causing a memory leak, but I would rather blame the hardware, as that resolves the issue a lot quicker, than asking dev's to look into it.
 
Last edited:
Hi,

After disable HT , things are improved. Now it crashes around after 1 month instead of each week but crash is not fixed. Here is the recent log of a crash attached. Please read if you can help diagnosing the problem ?

Thanks for following up.
 

Attachments

  • core.txt.txt
    1.3 MB · Views: 529
Code:
Unread portion of the kernel message buffer:
MCA: Bank 5, Status 0xbe00000000800400
MCA: Global Cap 0x0000000000001c09, Status 0x0000000000000004
MCA: Vendor "GenuineIntel", ID 0x206c2, APIC ID 16
MCA: CPU 9 UNCOR PCC internal timer error
MCA: Address 0x8018cfa2c
MCA: Misc 0x0
panic: Unrecoverable machine check exception
This is a hardware error.
 
Back
Top