Critical vmpfw status !!

Hi,

We're using FreeBSD-11.2 and for the first time we're hit by a very critical error which changes all Nginx processes "status" to vmpfw and jam all the processes. We tried performing mentioned below actions to recover from this error but all in vain.

This is the error : http://prntscr.com/ktnxex

Followings are the actions performed :

- killall -9 nginx (but its not killing it, looks like command is ignored)
- service nginx restart (This jams the process and we've to press ctrl+c to abort the action)

The only way we recover from it is by hard rebooting the server and then it runs for few hours and again go back to "vmpfw" status. During this status, nginx stops serving any requests.

Looking for urgent help.

Thanks!
 
It appears that da9, da6 and da2 are having problems reading and writing (the last screenshot)

From that I'd have to guess nginx isn't able to perform some work either at all or in a timely enough manner.
Then I'd guess you've got more requests (connections) coming into nginx (penultimate screetshot) that are now piling up waiting for those earlier things to complete. That queue exhausts.

You can't kill because it needs the disk to work.

You have a hardware problem. Try replacing the cables to those disks, then replacing/moving those particular disks.
 
I can't find an official S.M.A.R.T document, but wiki indicates that is:
199 UltraDMA CRC Error Count - The count of errors in data transfer via the interface cable as determined by ICRC (Interface Cyclic Redundancy Check).

Which could be:
  • bad cable (kinked or damaged internally)
  • badly seated cable either on the disk or controller side
  • Electrical interference (eg cable wrapped around a power supply)
Given all the zeros it doesn't look like the disks themselves are the problem. I'm guessing they are fairly new, but the hours are not posted.
As all them are indicating that, either ALL the cables are problematic, which I think unlikely, or the controller is the issue.
 
Do you mind providing hardware spec for your system, as well as full SMART data for at least one hard drive?

Including current configuration, controller + OS, might also help, like how did you set-up your storage.

I take it you've verified that power supply can support all the hardware?
 
Which could be:
  • bad cable (kinked or damaged internally)
  • badly seated cable either on the disk or controller side
  • Electrical interference (eg cable wrapped around a power supply)
You can add a dodgy or broken port extender to that list too. I've had this happen, it caused so many errors ZFS constantly marked random drives as bad.
 
Hi Guys,

I am back with the change we made to this hardware. We replaced its backplane and today again we're hit by the vmpfw problem and here is the latest kernel messages now, can you please explain what it means ?

https://pastebin.com/VsK7JnaL
 
My search-foo is coming up short. Is this using the original cables?

If you can identify drives that consistently experience the problem and some that don't swap the cables and see if the problem moves. That'll point to the cables for sure.
 
Back
Top