Critical vmpfw status !!

shahzaib

Active Member

Reaction score: 4
Messages: 114

Hi,

We're using FreeBSD-11.2 and for the first time we're hit by a very critical error which changes all Nginx processes "status" to vmpfw and jam all the processes. We tried performing mentioned below actions to recover from this error but all in vain.

This is the error : http://prntscr.com/ktnxex

Followings are the actions performed :

- killall -9 nginx (but its not killing it, looks like command is ignored)
- service nginx restart (This jams the process and we've to press ctrl+c to abort the action)

The only way we recover from it is by hard rebooting the server and then it runs for few hours and again go back to "vmpfw" status. During this status, nginx stops serving any requests.

Looking for urgent help.

Thanks!
 

leebrown66

Well-Known Member

Reaction score: 174
Messages: 445

It appears that da9, da6 and da2 are having problems reading and writing (the last screenshot)

From that I'd have to guess nginx isn't able to perform some work either at all or in a timely enough manner.
Then I'd guess you've got more requests (connections) coming into nginx (penultimate screetshot) that are now piling up waiting for those earlier things to complete. That queue exhausts.

You can't kill because it needs the disk to work.

You have a hardware problem. Try replacing the cables to those disks, then replacing/moving those particular disks.
 

Bobi B.

Well-Known Member

Reaction score: 197
Messages: 416

Did you ran sysutils/smartmontools diagnostics on those disks? Normally when I/O hangs, regardless if it is a local disk or a NFS mount (actually it depends on mount options), user-space process hangs, as well.
 
OP
S

shahzaib

Active Member

Reaction score: 4
Messages: 114

Hi,

Yeah we've 12XHDDs built on raid10 and all of these drives have high incrementing values for UDMA_CRC_Error_Count :

https://pastebin.com/41aW1HPY

Although we didn't find much about the criticality of this parameter.
 

leebrown66

Well-Known Member

Reaction score: 174
Messages: 445

I can't find an official S.M.A.R.T document, but wiki indicates that is:
199 UltraDMA CRC Error Count - The count of errors in data transfer via the interface cable as determined by ICRC (Interface Cyclic Redundancy Check).

Which could be:
  • bad cable (kinked or damaged internally)
  • badly seated cable either on the disk or controller side
  • Electrical interference (eg cable wrapped around a power supply)
Given all the zeros it doesn't look like the disks themselves are the problem. I'm guessing they are fairly new, but the hours are not posted.
As all them are indicating that, either ALL the cables are problematic, which I think unlikely, or the controller is the issue.
 

Bobi B.

Well-Known Member

Reaction score: 197
Messages: 416

Do you mind providing hardware spec for your system, as well as full SMART data for at least one hard drive?

Including current configuration, controller + OS, might also help, like how did you set-up your storage.

I take it you've verified that power supply can support all the hardware?
 

SirDice

Administrator
Staff member
Administrator
Moderator

Reaction score: 12,267
Messages: 38,773

Which could be:
  • bad cable (kinked or damaged internally)
  • badly seated cable either on the disk or controller side
  • Electrical interference (eg cable wrapped around a power supply)
You can add a dodgy or broken port extender to that list too. I've had this happen, it caused so many errors ZFS constantly marked random drives as bad.
 
OP
S

shahzaib

Active Member

Reaction score: 4
Messages: 114

Hi Guys,

I am back with the change we made to this hardware. We replaced its backplane and today again we're hit by the vmpfw problem and here is the latest kernel messages now, can you please explain what it means ?

https://pastebin.com/VsK7JnaL
 

CyberCr33p

Well-Known Member

Reaction score: 38
Messages: 350

When the issue happens can you login remotely using SSH? I have a similar issue and I want to see if it's the same.
 

leebrown66

Well-Known Member

Reaction score: 174
Messages: 445

My search-foo is coming up short. Is this using the original cables?

If you can identify drives that consistently experience the problem and some that don't swap the cables and see if the problem moves. That'll point to the cables for sure.
 
Top