FreeBSD intermittent crash!

shahzaib · Dec 26, 2015

We've again encountered panic with another Supermicro x5690 server. It crashed and booted up automatically:

Code:

HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 17 BANK 5
MISC 0 ADDR 802bf6c60
MCG status:MCIP
MCi status:
Uncorrected error
Error enabled
MCi_MISC register valid
MCi_ADDR register valid
Processor context corrupt
MCA: Internal Timer error
STATUS be00000000800400 MCGSTATUS 4
MCGCAP 1c09 APICID 25 SOCKETID 0
CPUID Vendor Intel Family 6 Model 44
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 16 BANK 5
MISC 0 ADDR 802bf6c60
MCG status:MCIP
MCi status:
Uncorrected error
Error enabled
MCi_MISC register valid
MCi_ADDR register valid
Processor context corrupt
MCA: Internal Timer error
STATUS be00000000800400 MCGSTATUS 4
MCGCAP 1c09 APICID 24 SOCKETID 0
CPUID Vendor Intel Family 6 Model 44
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 17 BANK 5
MISC 0 ADDR 802bf6c60
MCG status:MCIP
MCi status:
Uncorrected error
Error enabled
MCi_MISC register valid
MCi_ADDR register valid
Processor context corrupt
MCA: Internal Timer error
STATUS be00000000800400 MCGSTATUS 4
MCGCAP 1c09 APICID 25 SOCKETID 0
CPUID Vendor Intel Family 6 Model 44
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 16 BANK 5
MISC 0 ADDR 802bf6c60
MCG status:MCIP
MCi status:
Uncorrected error
Error enabled
MCi_MISC register valid
MCi_ADDR register valid
Processor context corrupt
MCA: Internal Timer error
STATUS be00000000800400 MCGSTATUS 4
MCGCAP 1c09 APICID 24 SOCKETID 0
CPUID Vendor Intel Family 6 Model 44

We think it maybe related to Power-Supply issue because x5690 is a heavy CPU requires more power to process when it comes to encoding videos with ffmpeg but we're not sure. Is there any way to find the specific hardware component which is culprit for panic?

CPU 17 BANK 5 -- what does it mean?

shahzaib · Dec 31, 2015

I showed those Hardware errors to Vendor from whom we purchased Supermicro servers . This is what he has to say :

-----------------------------------
Why do you not made one test environment with CentOS or one other Linux that you know to use, and see if you have same errors ??? if not than you know that the errors come from OS not from hardware. ( CentOS, RedHead….work diferend like FreeBSD – work direct on hardware if you don’t have the right kernel settings can the server crashed. CentOS , RedHead…. don’t work direct on hardware and distribute the resource load better and you have better control and you can better debug one situation)
-----------------------------------

According to vendor, issue could be related to software. What should we do now, they are crashing alot with same dump crash. Today we examined a bit different behaviour server was crashed but no crash dump was created. We're really stuck on it

.

One more point is worth mentioning is, we installed these servers using FreeBSD Backup Image in clonezilla and restored those images on these servers.

Now iI am going to disable mca permanently. What would be disadvantages of disabling it?

hw.mca.enabled="0"

Terri_Kennedy · Dec 31, 2015

shahzaib said:
We've again encountered panic with another Supermicro x5690 server. It crashed and booted up automatically:

Code:

HARDWARE ERROR. This is *NOT* a software problem! Please contact your hardware vendor

We think it maybe related to Power-Supply issue because x5690 is a heavy CPU requires more power to process when it comes to encoding videos with ffmpeg but we're not sure. Is there any way to find the specific hardware component which is culprit for panic?

It is possible it is power-related, but (unless you have an easily swappable power supply), I'd suspect the CPU chip, CPU cooler, motherboard first and then chassis cooling / memory / power supply if I didn't find the issue in the first set of parts I replaced.

I showed those Hardware errors to Vendor from whom we purchased Supermicro servers . This is what he has to say :

-----------------------------------
Why do you not made one test environment with CentOS or one other Linux that you know to use, and see if you have same errors ??? if not than you know that the errors come from OS not from hardware. ( CentOS, RedHead….work diferend like FreeBSD – work direct on hardware if you don’t have the right kernel settings can the server crashed. CentOS , RedHead…. don’t work direct on hardware and distribute the resource load better and you have better control and you can better debug one situation)
-----------------------------------

That sounds like the vendor being discussed in this post. They refused to take responsibility even after they demonstrated the problem on CentOS. The customer in that case ended up returning the system for a partial refund (just to get away from that vendor) and buying one from a vendor who is specifically FreeBSD-friendly (iXsystems) and things went fine except for being fumbled repeatedly by the "remote hands" in the data center they were using.

Supermicro lists FreeBSD as one of their supported operating systems here, by the way. Products without a checkmark for FreeBSD were probably not tested, but are compatible. In some cases, FreeBSD may not support some particular onboard device - for example, integrated WiFi - and that is why there is no checkmark. FreeBSD would boot on such a board and operate properly, but would be less useful due to not having drivers for some of the onboard hardware.

More below.

Now i am going to disable mca permanently. What would be disadvantages of disabling it ?

hw.mca.enabled="0"

If you successfully disable MCA, when the system has a fault that would normally be handled by MCA, the system will continue to run (possibly with corrupt data) until it either locks up or manifests at a higher level (like a panic from a corrupted instruction stream). Either will be much harder to isolate the actual cause.

Intel publishes a list of instructions their CPUs support and the requirements for each instruction. It doesn't matter what operating system you use, or if you write a program on "the bare metal" - everybody has to play by the same set of rules or the CPU will reject it. There are occasionally errata published by Intel which describe a particular instruction sequence that either a) should be permissible, but which has unexpected results / side effects (example - FDIV bug), or b) should be rejected, but is incorrectly accepted (example - f00f bug). It may be that some operating systems are affected to a greater or lesser extent due to their particular code paths, but the problem is still in the hardware. The vast majority of code bugs will be visible much higher up, with a trap due to referencing unallocated memory, etc. MCA happens at a much lower level inside the CPU. Data paths are checked, and currently-inactive units may run self-tests. If a machine check happens, it is pretty much certain that the problem is within the tightly-coupled area of the CPU and memory controller (integrated on the CPU for some time now). I'd go on, but hopefully I've made my point.

shahzaib · Jan 7, 2016

Hi,

I showed crash logs to our hosting team which is very good at hardware stuff and thats what they replied with :

-------------------------------------------------
I've set bios to optimized defaults, (this changed the multiplier back to 26) and disabled the following OC / powersaving features:

Ø Intel Turbo Boost

Ø Intel C-STATE Tech

Ø Intel EIST tech

Ø C1E Support

I have to mention that FreeBSD is not meant to work flawless on any hardware, we had situations in the past were our client had to switch from FreeBSD to other OS as the hardware - software combination was causing unexpected crashes.

-------------------------------------------------------

_martin · Jan 7, 2016

Did you overclocked the HW ? If so, I'm sorry but FreeBSD has nothing to do with it (MCA is related to HW here, not the faulty OS).
Internal timer error indicates problem with CPU/clock and "related".

shahzaib · Jan 7, 2016

I don't know if it was overclocked or not because i I didn't do anything except for reducing CMOS multiplier ratio from max(26) to 18. Is there some other option in BIOS for overclocking ? Is Intel Turbo Boost is what overclock is ?

Sorry i I am new to this overclock stuff.

_martin · Jan 8, 2016

You didn't mention what board you have.

Last 3 are basically power saving settings. If you put them in google you'll find a good explanations of them.
Turbo boost - http://www.intel.com/content/www/us/en/architecture-and-technology/turbo-boost/turbo-boost-technology.html.

As I mentioned earlier - I would check for BIOS updates or known problems with these technologies on the board you have. Interesting that power saving options are off in opt settings. Maybe manufacturer knows why.

Intel does have CPU diagnostic tools on their site. Some are for Windows only, but still worth putting a Windows-installed disk there for that purpose temporarily.

shahzaib · Jan 8, 2016

The supermicro motherboard is "X8DT3" and already using the latest BIOS.

matoatlantis said:
...
Interesting that power saving options are off in opt settings. Maybe manufacturer knows why.

Power saving options were not off, our support team disabled it now.

shahzaib · Jan 8, 2016

Unfortunately server again went down with following error on screen :

http://prntscr.com/9nlzvr

and didn't generated any crash dump under /var/crash directory. Would you please let me know why server is failed to generated crash dump ?
On further checking the OC / Power saving options under BIOS I found out that Intel EIST option was not disabled by our support guy as though he mentioned in his reply that he has disabled it (maybe he forgot to disable this option). So I've disabled EIST option now and rebooted the server to further monitor server performance.

http://prntscr.com/9nm1yp

shahzaib · Jan 9, 2016

To test FreeBSD with hw.mca.enabled: 0 on one of the same Supermicro server. Here is its uptime since then :

Code:

[root@cw002 /var/crash]# uptime
1:52PM  up 8 days,  2:36, 1 user, load averages: 2.62, 0.73, 0.50

So ignoring MCA didn't cause any downtime till now. Don't know if should we disable MCA on all servers?

RedShift1 · Jan 9, 2016

shahzaib said:
To test FreeBSD with hw.mca.enabled: 0 on one of the same Supermicro server. Here is its uptime since then :

Code:

[root@cw002 /var/crash]# uptime 1:52PM up 8 days, 2:36, 1 user, load averages: 2.62, 0.73, 0.50

So ignoring MCA didn't cause any downtime till now. Don't know if should we disable MCA on all servers ?

By flipping hw.mca.enabled to 0, the kernel is deliberately ignoring a signal from the processor that it detected an error in its operation. That signal is something serious, the root cause of it must be determined.

I would start with the basics: do a memtest86 (http://www.memtest.org/). Because you experience the problem only very intermittently (multiple days pass by), don't trust the result for just one pass, do multiple passes. Also do the bitfade test.

Furthermore, you suspected the power supplies before but can you post full system specs so the power requirements can be assessed and determined if inadequate?

Keep collecting as much information about the crashes as you can (screenshot of the on-screen messages, output of mcelog, etc...), so we can see if there's a pattern.

So do the memtest and let's take it from there.

leebrown66 · Jan 9, 2016

shahzaib said:
Don't know if should we disable MCA on all servers ?

You indicate you have more than 1 server:

Is this problem happening on all of them?
Do they all have the same CPU, motherboard and chassis?
Do they all have identical BIOS settings?
Are they all running the same version of FreeBSD?

shahzaib · Jan 10, 2016

RedShift1 said:
By flipping hw.mca.enabled to 0, the kernel is deliberately ignoring a signal from the processor that it detected an error in its operation. That signal is something serious, the root cause of it must be determined.

I would start with the basics: do a memtest86 (http://www.memtest.org/). Because you experience the problem only very intermittently (multiple days pass by), don't trust the result for just one pass, do multiple passes. Also do the bitfade test.

Furthermore, you suspected the power supplies before but can you post full system specs so the power requirements can be assessed and determined if inadequate?

Keep collecting as much information about the crashes as you can (screenshot of the on-screen messages, output of mcelog, etc...), so we can see if there's a pattern.

So do the memtest and let's take it from there.

Hi, actually we've total of 5 same specs Supermicro servers performing same job of video encoding and serving over 80 port. The reason i am not doubting memory is, its highly unlikely that all 5 servers has faulted memory because all of them goes down intermittently but I guess I'll try memtest on one of them. Could you please guide a bit about how to test with memtest ? Do i need to boot server with memtest ISO and let it run for days or there's another way despite of downtime for memtest ?

RedShift1 said:
...
Furthermore, you suspected the power supplies before but can you post full system specs so the power requirements can be assessed and determined if inadequate?

All servers are built upon following components :

2 x Intel X5690 @ 3.47Ghz (12 cores, 24 threads)
12 x 3TB SATA 7200rpm
12 x 8GB DIMM (96GB Memort)
2 x 800W Redundant PS
X8DT3 Board

RedShift1 said:
...
Keep collecting as much information about the crashes as you can (screenshot of the on-screen messages, output of mcelog, etc...), so we can see if there's a pattern.

Here is recent crash occurred last night when load-avg was 0.2 on server :

http://pastebin.com/vCxypF2z

shahzaib · Jan 10, 2016

leebrown66 said:
You indicate you have more than 1 server:

Is this problem happening on all of them?

Do they all have the same CPU, motherboard and chassis?

Do they all have identical BIOS settings?

Are they all running the same version of FreeBSD?

Yes we've 5 servers with same specs doing identical job of encoding videos and store them, you can call it a cluster of servers.

YES, the problem happening with all of them, same CPU, motherboard, chassis, identical BIOS settings.

leebrown66 said:
...
Are they all running the same version of FreeBSD?

Yes, we restored the same FreeBSD image from clonezilla on all 5 servers. So same configs.

leebrown66 · Jan 10, 2016

Considering they all exhibit the same problem, I'd put the BIOS to default settings first. Then if you still experience problems, do exactly what the vendor suggests, throw a different OS on there and run the same code. If it still crashes, it's obviously not OS related therefore must be hardware (per the MCA message, although mis-configured BIOS can cause hardware problems, hence setting it to defaults).

Was this a self-build or did you purchase them pre-built? If self-build, the vendor is going to be your best source of support. If purchased, you have a good case for returning them as it indicates they did something wrong.

shahzaib · Jan 10, 2016

One point worth mention is the model of PS which is "Supermicro PS- 902-1R 900W". Though iI've edited in my first post now.

shahzaib · Jan 18, 2016

After lots of testing, we've decided installing Debian in one of these Supermicro servers, if it stops crashing after that we'll consider that FreeBSD has issues supporting Supermicro hardware. If that would be the case than we'll think to buy Dell R510 server backed with LSI-9211 controller and install FreeBSD on it. Could anyone please let us know that Dell supports FreeBSD as in the following thread it is stated that the user was unable to smoothly run FreeBSD on dell due to some issue with LSI-9211 controller on Dell-FreeBSD but that was working fine on other distros e.g Red Hat:

http://hardforum.com/showthread.php?t=1681334

If you guys can give some valuable suggestion to whether we should go with Dell or not for FreeBSD, it'd be really kind? Following will be our configs with Dell :

Dell R510
2 x x5675 (12 cores, 24 threads)
RAM 64GB
12 x 3TB SATA Raid-10 (HBA LSI-9211)

Terri_Kennedy · Jan 19, 2016

shahzaib said:
Could anyone please let us know that Dell supports FreeBSD as in the following thread it is stated that the user was unable to smoothly run FreeBSD on dell due to some issue with LSI-9211 controller on Dell-FreeBSD but that was working fine on other distros e.g Red Hat:

http://hardforum.com/showthread.php?t=1681334

I'm running FreeBSD on an R710 and an NX3100 (which is based on something else in the Rx10 series - possibly even the R510). If you have 12 drives in that R510, there is a SAS expander in there (on the drive backplane). SATA drives behind a SAS expander are normally a bad idea. It is too long to go into here, though. The Dell H700 controller for that R510 is a nice controller and supported in FreeBSD by the mfi(4) driver. Since it is a Dell controller, it will either warn or flat-out refuse to work with non-Dell-certified drives (newer controller firmware just warns).

I also have a number of Supermicro systems running FreeBSD without any problems. They're all X8-series motherboards.

shahzaib · Jan 19, 2016

terry Thanks for detailed reply iI'll go through it but first here is the recent summary to diagnose issue, though not confirmed yet.

So as we went with Debian 8 on one of the Supermicro server. So, things were working pretty stable for first two days though we had encountered lots of CPU heating logs in kernel but as long as server didn't crashed we didn't bothered Googling those logs but eventually today at morning around 6:00am server crashed and rebooted automatically, on reading kernel logs we found the same heating logs which were occurring from the beginning on Debian 8 before the crash happened. Logs are boiled down :

Code:

Jan 19 05:16:29 cws004 kernel: [362404.826424] mce: [Hardware Error]: Machine check events logged
Jan 19 05:16:29 cws004 mcelog: Processor 17 heated above trip temperature. Throttling enabled.
Jan 19 05:16:29 cws004 mcelog: Please check your system cooling. Performance will be impacted
Jan 19 05:16:29 cws004 mcelog: Processor 5 heated above trip temperature. Throttling enabled.
Jan 19 05:16:29 cws004 mcelog: Please check your system cooling. Performance will be impacted
Jan 19 05:16:29 cws004 mcelog: Processor 5 below trip temperature. Throttling disabled
Jan 19 05:16:29 cws004 mcelog: Processor 17 below trip temperature. Throttling disabled

-----------------------------------------------------------

On reporting this issue to DC this is what they did to fixed heating issue :

The airflow for CPU1 was obstructed by a "feature" of the plastic cover that should had been removed in a 2 CPU scenario. We removed it, and now both CPU's are cooled correctly.
------------

Now we're not sure if the FreeBSD was also crashed due to that heating issue as the logs were different on FreeBSD (MCA:Internal timer error) while Debian had pretty different and explicit logging to debug issue on the spot. We'll monitor servers for days and will update in thread about the situation.

fossette · Jan 21, 2016

Sounds like a problem I've encountered when I started with FreeBSD. Demanding too much to the CPU may freak its heat sensors.
https://forums.freebsd.org/threads/one-solution-to-kernel-panic-computer-reboot.51941/

shahzaib · Feb 19, 2016

Hi, came back after a long time, so yes the issue not solved and soon we'll replace hardware though i noticed some more errors when server got back online after the crash (No crash log though). Here is the log - something related to LSI SAS controller i guess :

http://pastebin.com/piDcMszC

Could that be the cause of crash ? Though it looks like the errors were generated during boot instead of before crash. Still not sure and need to learn these logs.

Thanks to all for your help !!

leebrown66 · Feb 20, 2016

Stating the obvious, you have a disk problem. Command 12 is (as the log states) an INQUIRY command (reference here) , which is failing. The SCSI command reference can be found via a link on the bottom of that page. The driver is issuing an INQUIRY command in order to fetch data about the disk(s) and there's some kind of problem. I've only seen this type of fault with disks that have read faults due to age, but I wouldn't rule out a bad card or cables either. Try swapping cables/moving the disks to difference ports see if the error moves, which would indicate faulty disk. If the error sticks on the same port I'd guess a bad card. Try another card if you have one....

shahzaib · Feb 20, 2016

leebrown66 said:
Stating the obvious, you have a disk problem. Command 12 is (as the log states) an INQUIRY command (reference here) , which is failing. The SCSI command reference can be found via a link on the bottom of that page. The driver is issuing an INQUIRY command in order to fetch data about the disk(s) and there's some kind of problem. I've only seen this type of fault with disks that have read faults due to age, but I wouldn't rule out a bad card or cables either. Try swapping cables/moving the disks to difference ports see if the error moves, which would indicate faulty disk. If the error sticks on the same port I'd guess a bad card. Try another card if you have one....

Thanks for the reply, we're using 12 x 3TB HDD stripping+mirroring backed by LSI-9211 HBA controller. Could you please kindly answer some of my newbie questions ?

- When you said moving disks to different ports, is that means moving disks to different ports of controller ?
- As it looks like these logs were generated on boot time instead of crash, so do you think could that be the cause of crash ? Because if they were the cause of crash they should had been generated just before the crash event - though i still need clearance on that.

leebrown66 · Feb 20, 2016

...moving disks to different ports, is that means moving disks to different ports of controller ?

yes either the controller or the expander backplane. You are trying to see if the mps0:0:3:0 and mps0:0:3:1 entries change really. I am not familiar with the mps driver, so I don't know how it indexes the disks, but if you can identify which disk is 0:3:0 swap it for example with 0:3:2, then if you still get an error on 0:3:0 looks like the card, if the error moves to 0:3:2, it must be the disk or cable (or expander I suppose).

...do you think could that be the cause of crash

That's impossible to say really. For example, it's possible a disk has been silently corrupted during a write operation (imagine faulty controller on the disk itself). If it's a mirrored RAID you may have had a read from a good disk with no crash and at another time a read from a bad disk causing a crash, making the crash random. That's also assuming it's code that got corrupted. I would have thought you'd get a different error from the RAID controller though, not an INQUIRY.
The LSI's BIOS should have a check disk operation you could try if you can isolate a single disk. As you are using ZFS, try scrubbing.

Terri_Kennedy · Feb 21, 2016

leebrown66 said:
Stating the obvious, you have a disk problem. Command 12 is (as the log states) an INQUIRY command (reference here) , which is failing.

No, it is a regression in FreeBSD. See the FreeBSD-stable@ thread I started here. It is harmless, it just delays the boot process for a while during the retries.