Critical vmpfw status !!

shahzaib · Sep 12, 2018

Hi,

We're using FreeBSD-11.2 and for the first time we're hit by a very critical error which changes all Nginx processes "status" to vmpfw and jam all the processes. We tried performing mentioned below actions to recover from this error but all in vain.

This is the error : http://prntscr.com/ktnxex

Followings are the actions performed :

- killall -9 nginx (but its not killing it, looks like command is ignored)
- service nginx restart (This jams the process and we've to press ctrl+c to abort the action)

The only way we recover from it is by hard rebooting the server and then it runs for few hours and again go back to "vmpfw" status. During this status, nginx stops serving any requests.

Looking for urgent help.

Thanks!

shahzaib · Sep 12, 2018

In the kernel logs we're seeing following messages :

http://prntscr.com/kto11i

And on iDrac console, following logs are on screen :

http://prntscr.com/kto253

leebrown66 · Sep 13, 2018

It appears that da9, da6 and da2 are having problems reading and writing (the last screenshot)

From that I'd have to guess nginx isn't able to perform some work either at all or in a timely enough manner.
Then I'd guess you've got more requests (connections) coming into nginx (penultimate screetshot) that are now piling up waiting for those earlier things to complete. That queue exhausts.

You can't kill because it needs the disk to work.

You have a hardware problem. Try replacing the cables to those disks, then replacing/moving those particular disks.

shahzaib · Sep 13, 2018

SirDice Hopes you're doing fine. I wanted to bring this post in your notice.

Bobi B. · Sep 13, 2018

Did you ran sysutils/smartmontools diagnostics on those disks? Normally when I/O hangs, regardless if it is a local disk or a NFS mount (actually it depends on mount options), user-space process hangs, as well.

shahzaib · Sep 13, 2018

Hi,

Yeah we've 12XHDDs built on raid10 and all of these drives have high incrementing values for UDMA_CRC_Error_Count :

https://pastebin.com/41aW1HPY

Although we didn't find much about the criticality of this parameter.

leebrown66 · Sep 13, 2018

I can't find an official S.M.A.R.T document, but wiki indicates that is:
199 UltraDMA CRC Error Count - The count of errors in data transfer via the interface cable as determined by ICRC (Interface Cyclic Redundancy Check).

Which could be:

bad cable (kinked or damaged internally)
badly seated cable either on the disk or controller side
Electrical interference (eg cable wrapped around a power supply)

Given all the zeros it doesn't look like the disks themselves are the problem. I'm guessing they are fairly new, but the hours are not posted.
As all them are indicating that, either ALL the cables are problematic, which I think unlikely, or the controller is the issue.

Bobi B. · Sep 13, 2018

Do you mind providing hardware spec for your system, as well as full SMART data for at least one hard drive?

Including current configuration, controller + OS, might also help, like how did you set-up your storage.

I take it you've verified that power supply can support all the hardware?

SirDice · Sep 14, 2018

leebrown66 said:
Which could be:

bad cable (kinked or damaged internally)

badly seated cable either on the disk or controller side

Electrical interference (eg cable wrapped around a power supply)

You can add a dodgy or broken port extender to that list too. I've had this happen, it caused so many errors ZFS constantly marked random drives as bad.

shahzaib · Oct 5, 2018

Hi Guys,

I am back with the change we made to this hardware. We replaced its backplane and today again we're hit by the vmpfw problem and here is the latest kernel messages now, can you please explain what it means ?

https://pastebin.com/VsK7JnaL

CyberCr33p · Oct 5, 2018

When the issue happens can you login remotely using SSH? I have a similar issue and I want to see if it's the same.

leebrown66 · Oct 6, 2018

My search-foo is coming up short. Is this using the original cables?

If you can identify drives that consistently experience the problem and some that don't swap the cables and see if the problem moves. That'll point to the cables for sure.

Critical vmpfw status !!

Administrator