Systems hang on bhyve guest (11.0)

IPTRACE

Well-Known Member

Reaction score: 24
Messages: 321

Hello!

I've encountered a strange problem. At 3 am mail server (bhyve guest) crashed with 100% CPU usage. After hard reset in single user mode I had the following inconsistency file system.

Code:
# fsck -fy
** /dev/vtbd0p2
** Last Mounted on /
** Root file system
** Phase 1 - Check Blocks and Sizes
** Phase 2 - Check Pathnames
** Phase 3 - Check Connectivity
** Phase 4 - Check Reference Counts
UNREF FILE I=401653  OWNER=root MODE=100600
SIZE=0 MTIME=Nov 11 23:47 2016
CLEAR? yes

UNREF FILE I=401654  OWNER=root MODE=100600
SIZE=0 MTIME=Nov 11 18:26 2016
CLEAR? yes

** Phase 5 - Check Cyl groups
FREE BLK COUNT(S) WRONG IN SUPERBLK
SALVAGE? yes

SUMMARY INFORMATION BAD
SALVAGE? yes

BLK(S) MISSING IN BIT MAPS
SALVAGE? yes

685864 files, 42943666 used, 477579059 free (246891 frags, 59666521 blocks, 0.0% fragmentation)

***** FILE SYSTEM IS CLEAN *****

***** FILE SYSTEM WAS MODIFIED *****

At 3 am works backup job on host (packing and copying VM files). I know at 3 am certain system jobs work. I'm sure the system was clean before from any file system errors.


Host: FreeBSD 11.0p3 with UFS
Guest: FreeBSD 11.0p3 with UFS
Type of hard drive: SSD and HDD (I had the problem on both, guests work on both as well)

SSD specification:
Code:
% sudo diskinfo -v /dev/nvd0
/dev/nvd0
        512             # sectorsize
        512110190592    # mediasize in bytes (477G)
        1000215216      # mediasize in sectors
        0               # stripesize
        0               # stripeoffset

HDD specification:
Code:
sudo diskinfo -v /dev/stripe/stripe0
/dev/stripe/stripe0
        512             # sectorsize
        8001573879808   # mediasize in bytes (7.3T)
        15628073984     # mediasize in sectors
        131072          # stripesize
        0               # stripeoffset

Any ideas where should I look for the solution?
No physical errors on both hard drives.
 

SirDice

Administrator
Staff member
Administrator
Moderator

Reaction score: 12,323
Messages: 38,841

A hard system reset is bound to create filesystem errors, especially if the system was trying to write at the exact moment. Note though that normally a system doesn't crash, no matter how long it's been pushing the CPU to 100%. So something else is going on too. I would definitely re-schedule things a bit to spread out the load a bit. Especially the nightly system jobs tend to be quite I/O intensive (lots of find(1) actions). It would be a bad idea to schedule another I/O intensive job (like backups) at the same time. Either schedule the backups before or after the nightly jobs or move the nightly jobs to another time.
 
OP
IPTRACE

IPTRACE

Well-Known Member

Reaction score: 24
Messages: 321

Ok, thanks.
I'll try to "spread" the time.
 
OP
IPTRACE

IPTRACE

Well-Known Member

Reaction score: 24
Messages: 321

I've changed the time and I know there is no backups ware working, especially for VMs which still freeze.
At 3am some of my bhyve VM freeze. So every morning I have to check if everything works.
At 3am when I was connected via SSH to the VM I was disconnected everytime du to no pipe or pipe error.
I'm trying to run console session to look up "bad" VM but I have a lot of VM so It's very difficult to find a proper system.

Code:
# fsck -fy
** /dev/vtbd0p2
** Last Mounted on /
** Root file system
** Phase 1 - Check Blocks and Sizes
** Phase 2 - Check Pathnames
** Phase 3 - Check Connectivity
** Phase 4 - Check Reference Counts
** Phase 5 - Check Cyl groups
FREE BLK COUNT(S) WRONG IN SUPERBLK
SALVAGE? yes

SUMMARY INFORMATION BAD
SALVAGE? yes

BLK(S) MISSING IN BIT MAPS
SALVAGE? yes

331858 files, 3339912 used, 6813570 free (92482 frags, 840136 blocks, 0.9% fragmentation)

***** FILE SYSTEM IS CLEAN *****
 
OP
IPTRACE

IPTRACE

Well-Known Member

Reaction score: 24
Messages: 321

I've caught the system during freezing...

On bhyve guest (frozen):
Code:
last pid: 27698;  load averages:  0.50,  0.35,  0.32                                                                                    up 6+02:58:09  03:01:28
103 processes: 2 running, 100 sleeping, 1 waiting
CPU 0:  2.4% user,  0.0% nice, 53.9% system,  0.4% interrupt, 43.3% idle
CPU 1:  1.6% user,  0.0% nice, 20.9% system,  0.4% interrupt, 77.2% idle
Mem: 41M Active, 3931M Inact, 928M Wired, 541M Buf, 3025M Free
Swap:

  PID USERNAME    THR PRI NICE   SIZE    RES STATE   C   TIME    WCPU COMMAND
   11 root          2 155 ki31     0K    32K RUN     0 291.7H 120.59% [idle]
27696 root          1  52    0 10388K  4448K biord   1   0:10  73.92% find -sx / /dev/null -type f ( ( ! -perm +010 -and -perm +001 ) -or ( ! -perm +020 -and -
   20 root          1  21    -     0K    16K vlruwt  0   0:12   6.45% [vnlru]
   12 root         16 -64    -     0K   256K WAIT   -1   7:51   1.29% [intr]
21463 root          1  20    0 20120K  3812K CPU0    0   0:04   0.08% top -SPa
    6 root          1 -16    -     0K    16K -       0   1:44   0.04% [rand_harvestq]
27506 myuser        1  20    0 19120K  5912K kqread  1   0:00   0.03% imap: [myuser 192.168.1.10 IDLE] (imap)
  310 root          1 -16    -     0K    16K pftm    0   2:36   0.03% [pf purge]
  326 _pflogd       1  20    0 10672K  3084K bpf     0   1:17   0.02% pflogd: [running] -s 116 -i pflog0 -f /var/log/pflog (pflogd)
21450 myuser        1  20    0 83092K  8068K select  1   0:01   0.01% sshd: myuser@pts/0 (sshd)
   19 root          2 -16    -     0K    32K psleep  0   0:27   0.01% [bufdaemon]
   15 root          2 -16    -     0K    32K psleep  0   0:15   0.00% [pagedaemon]
  303 root          1  20    0  9512K  4972K select  1   0:00   0.00% /sbin/devd
  527 opendkim      4  20    0 45904K 11172K select  1   0:06   0.00% /usr/local/sbin/opendkim -l -u opendkim -P /var/run/milteropendkim/pid -x /usr/local/etc/
  580 mailnull      3  20    0 27084K  3956K select  0   0:04   0.00% /usr/local/sbin/opendmarc -l -P /var/run/opendmarc/pid -c /usr/local/etc/mail/opendmarc.c
   18 root          1  20    -     0K    16K -       1   0:07   0.00% [bufspacedaemon]
   21 root          1  16    -     0K    16K syncer  0   8:39   0.00% [syncer]
26952 root          1  20    0 11584K  3496K kqread  0   0:18   0.00% /usr/local/sbin/dovecot -c /usr/local/etc/dovecot/dovecot.conf
  471 root          1  20    0 10464K  2400K select  1   0:17   0.00% /usr/sbin/syslogd -s
26954 root          1  20    0 11520K  3272K kqread  0   0:16   0.00% dovecot/log
40452 root          1  20    0 11612K  3980K kqread  0   0:07   0.00% dovecot/config
  710 root          1  52    0 12564K  2432K nanslp  1   0:06   0.00% /usr/sbin/cron -s
  573 _ntp          1  20  -20 14240K  3252K select  0   0:05   0.00% ntpd: ntp engine (ntpd)
41384 root          1  20    0 42368K  8964K kqread  1   0:05   0.00% /usr/local/libexec/postfix/master -w
26953 dovecot       1  20    0 11516K  3236K kqread  0   0:04   0.00% anvil: [14 connections] (anvil)
    0 root         14 -16    -     0K   224K swapin  0   0:02   0.00% [kernel]
41386 postfix       1  20    0 44480K  9048K kqread  1   0:01   0.00% qmgr -l -t unix -u
41542 postfix       1  20    0 48656K  9124K kqread  1   0:01   0.00% tlsmgr -l -t unix -u
  575 root          1  20  -20 14232K  3096K select  0   0:00   0.00% /usr/local/sbin/ntpd -sv -f /usr/local/etc/ntpd.conf
    8 root          1 -16    -     0K    16K -       1   0:00   0.00% [soaiod2]
   14 root          1 -16    -     0K    16K -       1   0:00   0.00% [soaiod4]
    9 root          1 -16    -     0K    16K -       1   0:00   0.00% [soaiod3]
    7 root          1 -16    -     0K    16K -       1   0:00   0.00% [soaiod1]
27505 dovenull      1  20    0 20080K  5788K kqread  0   0:00   0.00% imap-login: [1 connections (1 TLS)] (imap-login)
21451 myuser        1  21    0 29844K 11224K pause   1   0:00   0.00% -csh (csh)
27417 dovecot       1  20    0 33864K  8324K kqread  1   0:00   0.00% auth: [0 wait, 0 passdb, 0 userdb] (auth)
    1 root          1  25    0  5388K   992K wait    1   0:00   0.00% [init]
27534 root          1  52    0  8392K  2096K piperd  1   0:00   0.00% mail -E -s mail.domain.my daily run output root
25706 dovenull      1  20    0 20080K  5788K kqread  0   0:00   0.00% imap-login: [1 connections (1 TLS)] (imap-login)
26131 dovenull      1  20    0 20080K  5788K kqread  0   0:00   0.00% imap-login: [1 connections (1 TLS)] (imap-login)
26631 dovenull      1  20    0 20080K  5788K kqread  0   0:00   0.00% imap-login: [1 connections (1 TLS)] (imap-login)
   13 root          3  -8    -     0K    48K -       1   0:00   0.00% [geom]
25707 myuser        1  20    0 17084K  4748K kqread  1   0:00   0.00% imap: [myuser 192.168.1.11] (imap)
21447 root          1  20    0 83092K  7956K select  1   0:00   0.00% sshd: myuser [priv] (sshd)
26132 myuser        1  20    0 17084K  4748K kqread  1   0:00   0.00% imap: [myuser 192.168.1.11] (imap)
   17 root          1 155 ki31     0K    16K pgzero  0   0:00   0.00% [pagezero]
27280 dovenull      1  20    0 20080K  5788K kqread  0   0:00   0.00% imap-login: [1 connections (1 TLS)] (imap-login)
25846 dovenull      1  20    0 20080K  5788K kqread  0   0:00   0.00% imap-login: [1 connections (1 TLS)] (imap-login)
27281 myuser        1  20    0 17084K  4748K kqread  1   0:00   0.00% imap: [myuser 192.168.1.11] (imap)
26632 myuser        1  20    0 17084K  4748K kqread  1   0:00   0.00% imap: [myuser 192.168.1.11] (imap)
25847 myuser        1  20    0 17084K  4960K kqread  1   0:00   0.00% imap: [myuser 192.168.1.10 IDLE] (imap)
26125 dovenull      1  20    0 20080K  5784K kqread  0   0:00   0.00% imap-login: [1 connections (1 TLS)] (imap-login)
26202 dovenull      1  20    0 20080K  5788K kqread  0   0:00   0.00% imap-login: [1 connections (1 TLS)] (imap-login)
25708 dovenull      1  20    0 20080K  5788K kqread  1   0:00   0.00% imap-login: [1 connections (1 TLS)] (imap-login)
27278 dovenull      1  20    0 20080K  5788K kqread  0   0:00   0.00% imap-login: [1 connections (1 TLS)] (imap-login)
26127 dovenull      1  20    0 20080K  5784K kqread  0   0:00   0.00% imap-login: [1 connections (1 TLS)] (imap-login)
26633 dovenull      1  20    0 20080K  5788K kqread  0   0:00   0.00% imap-login: [1 connections (1 TLS)] (imap-login)
27282 dovenull      1  20    0 20080K  5788K kqread  1   0:00   0.00% imap-login: [1 connections (1 TLS)] (imap-login)
25850 postfix       1  20    0 44428K  8964K kqread  1   0:00   0.00% pickup -l -t unix -u
27279 myuser        1  20    0 17084K  5152K kqread  0   0:00   0.00% imap: [myuser 192.168.1.11 IDLE] (imap)
  706 root          1  20    0 55676K  6960K select  0   0:00   0.00% /usr/sbin/sshd
26126 myuser        1  20    0 17084K  4748K kqread  1   0:00   0.00% imap: [myuser 192.168.1.12] (imap)
27533 root          1  52    0 13144K  2692K wait    1   0:00   0.00% /bin/sh /usr/sbin/periodic LOCKED daily
27630 postfix       1  20    0 44428K  8964K kqread  0   0:00   0.00% showq -t unix -u
26128 myuser        1  20    0 17072K  4720K kqread  1   0:00   0.00% imap: [myuser 192.168.1.12] (imap)
21462 root          1  22    0 46364K  3916K select  0   0:00   0.00% sudo top -SPa
  322 root          1  20    0 10672K  2496K sbwait  1   0:00   0.00% pflogd: [priv] (pflogd)
26203 myuser        1  20    0 17072K  4644K kqread  1   0:00   0.00% imap: [myuser 192.168.1.11] (imap)
27691 root          1  52    0 13144K  2688K piperd  1   0:00   0.00% /bin/sh - /etc/periodic/security/110.neggrpperm
25709 myuser        1  20    0 17072K  4644K kqread  1   0:00   0.00% imap: [myuser 192.168.1.11] (imap)
  759 root          1  52    0 10460K  2088K ttyin   1   0:00   0.00% /usr/libexec/getty Pc ttyv2
26634 myuser        1  20    0 17072K  4644K kqread  1   0:00   0.00% imap: [myuser 192.168.1.11] (imap)
27511 root          1  21    0 13144K  2684K wait    0   0:00   0.00% /bin/sh - /usr/sbin/periodic daily
27283 myuser        1  20    0 17072K  4644K kqread  1   0:00   0.00% imap: [myuser 192.168.1.11] (imap)
  764 root          1  52    0 10460K  2088K ttyin   1   0:00   0.00% /usr/libexec/getty Pc ttyv7
  762 root          1  52    0 10460K  2088K ttyin   1   0:00   0.00% /usr/libexec/getty Pc ttyv5
27647 root          1  52    0 13144K  2684K wait    1   0:00   0.00% /bin/sh /usr/sbin/periodic LOCKED security
  765 root          1  52    0 10460K  2088K ttyin   0   0:00   0.00% /usr/libexec/getty std.9600 ttyu0
  757 root          1  52    0 10460K  2088K ttyin   0   0:00   0.00% /usr/libexec/getty Pc ttyv0
27518 root          1  52    0 13144K  2684K wait    0   0:00   0.00% /bin/sh /usr/sbin/periodic LOCKED daily
  760 root          1  52    0 10460K  2088K ttyin   1   0:00   0.00% /usr/libexec/getty Pc ttyv3
27418 root          1  20    0 11512K  3064K kqread  0   0:00   0.00% dovecot/ssl-params
  758 root          1  52    0 10460K  2088K ttyin   0   0:00   0.00% /usr/libexec/getty Pc ttyv1
27635 root          1  52    0 13144K  2688K wait    0   0:00   0.00% /bin/sh /etc/periodic/daily/450.status-security
  763 root          1  52    0 10460K  2088K ttyin   0   0:00   0.00% /usr/libexec/getty Pc ttyv6
  761 root          1  52    0 10460K  2088K ttyin   0   0:00   0.00% /usr/libexec/getty Pc ttyv4
27654 root          1  47    0 13144K  2692K wait    0   0:00   0.00% /bin/sh /usr/sbin/periodic LOCKED security
27698 root          1  52    0 12592K  2148K piperd  0   0:00   0.00% wc -l
27640 root          1  52    0 13144K  2684K wait    1   0:00   0.00% /bin/sh - /usr/sbin/periodic security
On bhyve host (CPU usage is still changing 95%-105%):
Code:
 5308 root         12  20    0  9248M  4628M kqread 21 245:53  99.11% bhyve: mail.domain.my (bhyve)
So the problem is with find command. It freezes bhyve guests, I think.
Is here somebody who can answer on the question, why?

Anyway, I've sent to the guest signal 15 (kill -15 ...) from host and on the host console I saw rebooting process.
Code:
Stopping devd.
Waiting for PIDS: 303.
Writing entropy file:.
Writing early boot entropy file:.
.
Terminated
Jan 21 03:01:39 mail syslogd: exiting on signal 15
pflog0: promiscuous mode disabled
Waiting (max 60 seconds) for system process `vnlru' to stop... done
Waiting (max 60 seconds) for system process `bufdaemon' to stop... done
Waiting (max 60 seconds) for system process `syncer' to stop...
Syncing disks, vnodes remaining... 0 0 done
All buffers synced.
Uptime: 6d2h58m27s
acpi0: Powering system off
Filesystem consistency was ok.
Code:
# fsck -Cfy
** /dev/vtbd0p2
** Last Mounted on /
** Root file system
** Phase 1 - Check Blocks and Sizes
** Phase 2 - Check Pathnames
** Phase 3 - Check Connectivity
** Phase 4 - Check Reference Counts
** Phase 5 - Check Cyl groups
332396 files, 3388563 used, 6764919 free (93463 frags, 833932 blocks, 0.9% fragmentation)

***** FILE SYSTEM IS CLEAN *****

***** FILE SYSTEM WAS MODIFIED *****
 
  • Thanks
Reactions: Oko
OP
IPTRACE

IPTRACE

Well-Known Member

Reaction score: 24
Messages: 321

Unfortunately, I have still the problem. It especially hits NS and SMTP/IMAP servers.
I'm investigating two commands because of whom systems hang.

Code:
grep -R "find -sx" /etc/periodic
/etc/periodic/security/110.neggrpperm:  n=$(find -sx $MP /dev/null -type f \
/etc/periodic/security/100.chksetuid:   find -sx $MP /dev/null -type f \

Code:
find -sx / /dev/null -type f \( \( ! -perm +010 -and -perm +001 \) -or \( ! -perm +020 -and -perm +002 \) -or \( ! -perm +040 -and -perm +004 \) \) -exec ls -liTd \{\} \+ | tee /dev/stderr | wc -l
find -sx / /dev/null -type f \( -perm -u+x -or -perm -g+x -or -perm -o+x \) \( -perm -u+s -or -perm -g+s \) -exec ls -liTd \{\} \+
 
  • Thanks
Reactions: Oko

forquare

Well-Known Member

Reaction score: 180
Messages: 329

I've had similar problems on my 11.0 box running CentOS 7 in bhyve, though I'm using ZFS/ZVOL as storage (already tuned down memory useage for ZFS)

I've been running some video encrypting tools that do a lot of read/writing the the FS.
I've noticed that if I run the encryption tools either on local files, or have a destination of a local FS, then it is likely that my box will crash. However, doing it all over Samba (file source and destination both on a NAS) seems fine.

Unfortunately I've not been able to spend much time of the crash at work due to other commitments, and I don't know if the about is at all useful to anyone more knowledgeable. The annoying thing is that neither the host or guest show being anywhere max RAM usage, and nothing seems to get written to logs.
 
OP
IPTRACE

IPTRACE

Well-Known Member

Reaction score: 24
Messages: 321

I was thinking about the problem and I'm confused about disk labelling.
Some of my VM has the same GPTID usidng glabel status .

Is that the reason of my problem?
 
Top