FreeBSD intermittent crash!

leebrown66 · Feb 21, 2016

Terry_Kennedy said:
No, it is a regression in FreeBSD. See the FreeBSD-stable@ thread I started here. It is harmless, it just delays the boot process for a while during the retries.

I saw that, but I didn't think 10.1-RELENG had that problem, but now I realize it's been introduced sometime after 8.4. Sorry for the misdirection.

shahzaib · Feb 21, 2016

Terry_Kennedy said:
No, it is a regression in FreeBSD. See the FreeBSD-stable@ thread I started here. It is harmless, it just delays the boot process for a while during the retries.

Hi , so you saying that shouldn't be the cause of crash ?

Terri_Kennedy · Feb 22, 2016

shahzaib said:
Hi , so you saying that shouldn't be the cause of crash ?

Correct. It is a harmless (other than the delay) set of messages during boot, and also when a rescan of the bus is requested.

shahzaib · Apr 18, 2016

Hi again, got back after a long time. So yes, we've move to new Dell R510 Hardware now. Here is the specs :

DELL R510
2 x L5520
64GB RAM
12x3TB Raid stripping+mirroring (HBA LSI-9211-fw version 19.00)
FreeBSD cw009.tunefiles.com 10.2-RELEASE-p14 FreeBSD 10.2-RELEASE-p14 #0: Wed Mar 16 20:46:12 UTC 2016 root@amd64-builder.daemonology.net:/usr/obj/usr/src/sys/GENERIC amd64

After 9days of uptime, server again got crashed with following error in crash log :

http://pastebin.com/baShWuMP

I am so much depressed now, there's much pressure on me from my company

. Please help us resolving this crash issue .

shahzaib · Apr 19, 2016

Is FreeBSD-10.2 an stable release? Do I need to upgrade to 10.3 using freebsd-update utility?

SirDice · Apr 19, 2016

FreeBSD 10.2 is supported until the end of this year. So yes, it's a stable release.

https://www.freebsd.org/security/security.html#sup

shahzaib · Apr 19, 2016

SirDice , we really need help now. We've discarded supermicro and moved with Dell R510 but errors look to be same 'Internal Timer error' . I am so depressed and no idea where to go from here.

SirDice · Apr 19, 2016

The errors look like hardware errors. Apparently you're not that lucky when it comes to hardware.

As a side note, one of my clients has around 25 SuperMicro servers (old, new, various sizes) and they all run FreeBSD (mostly 9.3, some 10.1) without any issues.

shahzaib · Apr 19, 2016

SirDice , please don't demoralize me more, i am already very much depressed

. Don't know what to do , someone on other freebsd mailling list suggested to update cpu microcode but looks like intel doesn't provide microcode file for FreeBSD :

- Install microcode updates and hope, it will fix it

Intel offers for many CPUs an microcode update.
https://downloadcenter.intel.com/download/25512/Linux-Processor-Microcode-Data-File?v=t

shahzaib · Apr 19, 2016

Machine check exception could be caused by software/applicaion? Maybe we're on the wrong side of troubleshoot and culprit is our application not hardware ? Servers utilize following programs :

- NGINX + PHP_FPM (Uploading videos and streaming them to end users)
- FFMPEG (Encode the uploaded videos to make it ready for streaming)

CurlyTheStooge · Apr 19, 2016

Very depressing thread, not hard to feel sympathy for the OP. But hard to believe that they got faulty hardware back to back, I can't buy this.

Shahzaib, what's the response from Dell on this though?
Also, inspite of Debian 8, I'd suggest CentOS 7.1/7.2 for a test on one of the servers.

Regards.

shahzaib · Apr 20, 2016

CurlyTheStooge , thanks for showing sympathy

. Well, that's all Dell has to say for now :

http://en.community.dell.com/support-forums/servers/f/956/p/19682411/20901711#20901711

_martin · Apr 20, 2016

When you said you moved to the new HW, does it mean brand new ? Meaning no moving parts from server to server (RAM, e.g).
Also - who did install those servers ? Was proper procedure followed ?

I can too imagine how stressful it has to be to deal with the HW behaving like this. Seems way too unlikely to get so many faulty HW.
It might be forth moving those parts in order to single out the faulty item. Remove all but one CPU, keep it running with the lowest amount of memory modules possible. And do real stress tests - put quite a load on those servers.

CurlyTheStooge · Apr 20, 2016

shahzaib said:
CurlyTheStooge , thanks for showing sympathy . Well, that's all Dell has to say for now :

http://en.community.dell.com/support-forums/servers/f/956/p/19682411/20901711#20901711

I can understand your situation, been there.
Well, Dell also has stated they didn't validate Debian on these servers. Might be good to try CentOS 7.1 / 7.2 I think?

Cheers from neighborhood.
Regards.

shahzaib · May 8, 2016

Well, after disabling logical cores on servers, situation got much stable. Though, there was a recent crash of FreeBSD-10.2 on DELL with different error panic: page fault . Following guide suggested to grab the value of "instruction pointer" but the value was not found even omitting the digits. :

https://www.freebsd.org/doc/faq/advanced.html

Here is the crash dump :

http://prntscr.com/b1mgj3

RedShift1 · May 8, 2016

matoatlantis said:
When you said you moved to the new HW, does it mean brand new ? Meaning no moving parts from server to server (RAM, e.g).

shahzaib · May 9, 2016

RedShift1 , exactly !! No moving part.

PacketMan · May 9, 2016

shahzaib said:
Machine check exception could be caused by software/applicaion? Maybe we're on the wrong side of troubleshoot and culprit is our application not hardware ?

Since you said you are now running on 100% new hardware, that the new server has none of the hardware from the old server then that makes me suspicious that somehow a software issue appears to be a hardware issue. So stop using the application software you were using on one of the other servers, install something else on it, and let it run. Put a game server port on it, you need to de-stress a bit anyway.

See if it still crashes.

shahzaib · May 9, 2016

Have you checked my last post with HT disabled error got change ? :

Here is the crash dump :

http://prntscr.com/b1mgj3

sizigee · May 9, 2016

shahzaib said:
Have you checked my last post with HT disabled error got change ? :

Here is the crash dump :

http://prntscr.com/b1mgj3

Panic: page fault... hmmm... try removing some DIMMs out of it and work your way up. Might be a memory module that is faulty. But I can be completely wrong too...

I read the whole thread and I feel for you.

RedShift1 · May 10, 2016

PacketMan said:
Since you said you are now running on 100% new hardware, that the new server has none of the hardware from the old server then that makes me suspicious that somehow a software issue appears to be a hardware issue. So stop using the application software you were using on one of the other servers, install something else on it, and let it run. Put a game server port on it, you need to de-stress a bit anyway. See if it still crashes.

This might have some merit to it. A long time ago I had to manage a Windows application on a bunch of servers and the application itself had to run as Administrator. The reason for that is because it did some funky memory stuff and it made Windows crash every now and then, regardless of what hardware it was running on. So maybe the business application shahzaib is running might be doing the same...

sizigee · May 10, 2016

shahzaib said:
Have you checked my last post with HT disabled error got change ? :

Here is the crash dump :

http://prntscr.com/b1mgj3

RedShift1 said:
This might have some merit to it. A long time ago I had to manage a Windows application on a bunch of servers and the application itself had to run as Administrator. The reason for that is because it did some funky memory stuff and it made Windows crash every now and then, regardless of what hardware it was running on. So maybe the business application shahzaib is running might be doing the same...

I experienced the same (M$ background here). The BSoD I kept getting was PAGE_FAULT_IN_NON_PAGED_AREA. I replaced all the DIMMs and it stopped blue screening on me. Maybe its the same in this case.

*EDIT* I re-read what you said. It could be the application causing a memory leak, but I would rather blame the hardware, as that resolves the issue a lot quicker, than asking dev's to look into it.

sizigee · Jun 23, 2016

shahzaib How are things going? managed to resolve the issue?

shahzaib · Jun 23, 2016

Hi,

After disable HT , things are improved. Now it crashes around after 1 month instead of each week but crash is not fixed. Here is the recent log of a crash attached. Please read if you can help diagnosing the problem ?

Thanks for following up.

SirDice · Jun 23, 2016

Code:

Unread portion of the kernel message buffer:
MCA: Bank 5, Status 0xbe00000000800400
MCA: Global Cap 0x0000000000001c09, Status 0x0000000000000004
MCA: Vendor "GenuineIntel", ID 0x206c2, APIC ID 16
MCA: CPU 9 UNCOR PCC internal timer error
MCA: Address 0x8018cfa2c
MCA: Misc 0x0
panic: Unrecoverable machine check exception

This is a hardware error.

FreeBSD intermittent crash!

Administrator

Administrator

Attachments

Administrator