How to fix a "unresponsive" system.

cellini · Feb 27, 2017

Hello i have a "unresponsive" system where zfs command don`t respond and

Code:

anders@backupbsd:~ % ps -A | wc -l
  8940

Obvious there is something wrong but I am not sure where to start? I could restart the machine but i thought this could be a god learning experience, do anyone have a god place to start?

The machine is a backup machine for another machine and running samba, zfs receive, and some zfs snapshot and destroy scripts. Thanks for any help

some other status commands

Code:

anders@backupbsd:~ % ps -A | grep "zfs snapshot" | wc -l
     830
anders@backupbsd:~ % ps -A | grep "samba" | wc -l
       1
anders@backupbsd:~ % ps -A | grep "smbd" | wc -l
    5751

aupanner · Feb 27, 2017

If you can run ps, that's still a responsive system! Assuming you don't just want to reboot and wait for it to happen again...

I'd look to see if you:

have run out of disk
have run out of swap
have run out of inodes
are pegged on i/o
are pegged on cpu
have processes waiting on disk (status D)
have processes waiting on lock (status L)
are increasing/decreasing the number of processes
are completing/blocked on jobs

I'd probably just kill the samba backups and see if the zfs jobs will finish on their own.

cellini · Feb 27, 2017

Thanks

i ran some commands to find out

Code:

anders@backupbsd:~ % top > top
anders@backupbsd:~ % cat top 
last pid: 29813;  load averages:  0.11,  0.11,  0.08  up 34+06:10:22    20:40:56
8935 processes:1 running, 8933 sleeping, 1 zombie

Mem: 11M Active, 5610M Inact, 25G Wired, 12M Cache, 272M Free
ARC: 17G Total, 2864M MFU, 8781M MRU, 3260M Anon, 181M Header, 2696M Other
Swap: 14G Total, 14M Used, 14G Free


  PID USERNAME    THR PRI NICE   SIZE    RES STATE   C   TIME    WCPU COMMAND
  651 root          1  20    0 14500K  1728K select  0   3:20   0.00% powerd
  590 root          1  20    0 26124K 18044K select  3   1:30   0.00% ntpd
  528 root          1  20    0 14520K  1768K select  7   0:51   0.00% syslogd
  684 root          1  20    0 61312K  3940K select  3   0:43   0.00% sshd
  701 root          1  20    0 24152K  4208K select  2   0:27   0.00% sendmail
36835 root          1  52    0 20592K  2768K zfs     1   0:18   0.00% find
36445 root          1  52    0   309M 19508K zfs     6   0:14   0.00% smbd
  708 root          1  20    0 16624K  1156K nanslp  4   0:10   0.00% cron
37131 root          1  52    0   316M 20436K scl->s  2   0:06   0.00% smbd
36424 root          1  52    0   309M 19244K zfs     6   0:04   0.00% smbd
39585 root          1  37    0   310M 19876K scl->s  3   0:04   0.00% smbd
36270 root          1  27    0   309M 19484K scl->s  7   0:04   0.00% smbd
46076 root          1  20    0   311M   688K lockf   5   0:03   0.00% smbd
39355 root          1  20    0   309M   688K lockf   7   0:03   0.00% smbd
36415 root          1  52    0   309M 19420K zfs     4   0:03   0.00% smbd
46425 root          1  20    0   311M   688K lockf   3   0:03   0.00% smbd
43928 root          1  20    0   310M   688K lockf   1   0:03   0.00% smbd
40495 root          1  20    0   310M   688K lockf   5   0:03   0.00% smbd

So looks like cpu is fine, swap is fine, it is almost using all the ram but there is still 272M left so i gues thats fine to.

Code:

anders@backupbsd:~ % df -hi
Filesystem                              Size    Used   Avail Capacity iused ifree %iused  Mounted on
startroot/ROOT/default                   71G    1.8G     70G     2%     73k  146M    0%   /
devfs                                   1.0K    1.0K      0B   100%       0     0  100%   /dev
startroot/tmp                            70G    1.0M     70G     0%      28  146M    0%   /tmp
startroot/usr/home                       70G    312K     70G     0%      23  146M    0%   /usr/home
startroot/usr/ports                      70G    192K     70G     0%       7  146M    0%   /usr/ports
startroot/usr/src                        70G    786M     70G     1%     71k  146M    0%   /usr/src
startroot/var/audit                      70G    192K     70G     0%       9  146M    0%   /var/audit
startroot/var/crash                      70G    192K     70G     0%       8  146M    0%   /var/crash
startroot/var/log                        70G    2.5M     70G     0%      70  146M    0%   /var/log
startroot/var/mail                       70G    1.0M     70G     0%      14  146M    0%   /var/mail
startroot/var/tmp                        70G    192K     70G     0%       8  146M    0%   /var/tmp
backuppool/winbackup                    268G     22G    246G     8%    244k  516M    0%   /winbackup
backuppool/winbackup/Dagminator         246G     96K    246G     0%       9  516M    0%   /winbackup/Dagminator
backuppool/winbackup/Dagminator/C       246G     96K    246G     0%       7  516M    0%   /winbackup/Dagminator/C
backuppool/winbackup/Dagminator/D       913G    667G    246G    73%    336k  516M    0%   /winbackup/Dagminator/D
backuppool/winbackup/Master-35          250G    3.7G    246G     1%     63k  516M    0%   /winbackup/Master-35
backuppool/winbackup/Master35           292G     47G    246G    16%    352k  516M    0%   /winbackup/Master35
backuppool/winbackup/Norglass_srv       258G     12G    246G     5%     96k  516M    0%   /winbackup/Norglass_srv
backuppool/winbackup/PRODUKSJON-PC      272G     26G    246G    10%    227k  516M    0%   /winbackup/PRODUKSJON-PC
backuppool/winbackup/VENSTREBUTIKK-P    305G     59G    246G    19%    370k  516M    0%   /winbackup/VENSTREBUTIKK-P
backuppool/winbackup/hoyrebuttik-PC     319G     73G    246G    23%    322k  516M    0%   /winbackup/hoyrebuttik-PC
backuppool/winbackup/macpc              247G    557M    246G     0%     26k  516M    0%   /winbackup/macpc
backuppool/winbackup/master35-pc        246G     96K    246G     0%       7  516M    0%   /winbackup/master35-pc
startroot                                70G    192K     70G     0%       7  146M    0%   /zroot

There is enough disk space, i am running zfs and I think i have heard that the df command or the du command dont display right with zfs.

Code:

sudo ps -A | grep " -  D " | wc -l
    6003
anders@backupbsd:~ % sudo ps -A | grep " -  D " | head -n 20
  178  -  D         0:00.01 zfs snapshot -r startroot@zfssnap2017-02-18_18:00
  193  -  D         0:00.01 zfs list -o name -H -t snapshot -S creation -r backuppool/winbackup
  197  -  D         0:00.01 zfs snapshot -r backuppool/winbackup@zfssnap2017-02-18_18:01
  307  -  D         0:00.01 zfs snapshot -r startroot@zfssnap2017-02-18_19:00
  386  -  D         0:00.02 /usr/local/sbin/smbd --daemon --configfile=/usr/local/etc/smb4.conf
  387  -  D         0:00.03 /usr/local/sbin/smbd --daemon --configfile=/usr/local/etc/smb4.conf
  400  -  D         0:00.03 /usr/local/sbin/smbd --daemon --configfile=/usr/local/etc/smb4.conf
  403  -  D         0:00.03 /usr/local/sbin/smbd --daemon --configfile=/usr/local/etc/smb4.conf
  434  -  D         0:00.01 zfs snapshot -r startroot@zfssnap2017-02-18_20:00
  557  -  D         0:00.01 zfs snapshot -r startroot@zfssnap2017-02-18_21:00
  680  -  D         0:00.01 zfs snapshot -r startroot@zfssnap2017-02-18_22:00
  698  -  D         0:00.02 /usr/local/sbin/smbd --daemon --configfile=/usr/local/etc/smb4.conf
  699  -  D         0:00.03 /usr/local/sbin/smbd --daemon --configfile=/usr/local/etc/smb4.conf
  702  -  D         0:00.03 /usr/local/sbin/smbd --daemon --configfile=/usr/local/etc/smb4.conf
  703  -  D         0:00.03 /usr/local/sbin/smbd --daemon --configfile=/usr/local/etc/smb4.conf
  816  -  D         0:00.01 zfs snapshot -r startroot@zfssnap2017-02-18_23:00
  939  -  D         0:00.00 zfs snapshot -r startroot@zfssnap2017-02-19_00:00
  959  -  D         0:00.00 zfs list -o name -H -t snapshot -S creation -r backuppool/winbackup
  963  -  D         0:00.00 zfs snapshot -r backuppool/winbackup@zfssnap2017-02-19_00:01
 1065  -  D         0:00.04 /usr/local/sbin/smbd --daemon --configfile=/usr/local/etc/smb4.conf

So I guess there is some processes waiting for disk? but why?

Code:

anders@backupbsd:~ % gstat > gstat && cat gstat | head -n 20
dT: 1.102s  w: 1.000s
 L(q)  ops/s    r/s   kBps   ms/r    w/s   kBps   ms/w   %busy Name
    0      0      0      0    0.0      0      0    0.0    0.0  da0
    0      0      0      0    0.0      0      0    0.0    0.0  ada0
    0      0      0      0    0.0      0      0    0.0    0.0  ada0p1
    0      0      0      0    0.0      0      0    0.0    0.0  ada0p2
    0      0      0      0    0.0      0      0    0.0    0.0  ada0p3
    0      0      0      0    0.0      0      0    0.0    0.0  ada0p4
    0      0      0      0    0.0      0      0    0.0    0.0  zvol/backuppool/bsdserver/oceanpool/swap
    0      0      0      0    0.0      0      0    0.0    0.0  gpt/3TbBoot
    0      0      0      0    0.0      0      0    0.0    0.0  gpt/3TbPlass
    0      0      0      0    0.0      0      0    0.0    0.0  zvol/backuppool/bsdserver/oceanpool/swap@zfs-auto-snap_hourly-2016-08-26-08h00
    0      0      0      0    0.0      0      0    0.0    0.0  zvol/backuppool/bsdserver/oceanpool/swap@zfs-auto-snap_minut-2016-08-27-15h00
    0      0      0      0    0.0      0      0    0.0    0.0  zvol/backuppool/bsdserver/oceanpool/swap@zfs-auto-snap_daily-2016-08-25-00h07
    0      0      0      0    0.0      0      0    0.0    0.0  zvol/backuppool/bsdserver/oceanpool/swap@zfs-auto-snap_minut-2016-08-27-19h15
    0      0      0      0    0.0      0      0    0.0    0.0  zvol/backuppool/bsdserver/oceanpool/swap@zfs-auto-snap_minut-2016-08-27-15h45
    0      0      0      0    0.0      0      0    0.0    0.0  zvol/backuppool/bsdserver/oceanpool/swap@zfs-auto-snap_hourly-2016-08-25-22h00
    0      0      0      0    0.0      0      0    0.0    0.0  zvol/backuppool/bsdserver/oceanpool/swap@zfs-auto-snap_minut-2016-08-27-21h15
    0      0      0      0    0.0      0      0    0.0    0.0  zvol/backuppool/bsdserver/oceanpool/swap@zfs-auto-snap_weekly-2016-08-21-00h14
    0      0      0      0    0.0      0      0    0.0    0.0  zvol/backuppool/bsdserver/oceanpool/swap@forvidar

i don`t seem to be going much on with the disks, so why is the processes waiting?

Code:

anders@backupbsd:~ % ps -A | wc -l
    8956

And there seems to be starting more processes.

Thanks for all the help so far, i am learning a lot

cellini · Feb 27, 2017

i found it

Code:

anders@backupbsd:~ % ps aux | awk '{ print $8 " " $2 }' | grep -w Z | cut -d " " -f 2
99952
anders@backupbsd:~ % ps -A | grep 99952
99952  -  Z         0:00.01 <defunct>
30145  6  S+        0:00.00 grep 99952

But i cant kill it? And i don`t get any answer when trying to find the parent ether.

Code:

anders@backupbsd:~ % pgrep -P 99952
anders@backupbsd:~ % ps -A | grep 99952
99952  -  Z         0:00.01 <defunct>

aupanner · Feb 27, 2017

cellini said:
So looks like cpu is fine, swap is fine, it is almost using all the ram but there is still 272M left so i guess that's fine too.

That looks like a reasonable system usage of memory. The wired count is big because your zfs arc usage is big, but that's fine.

One problem is that your script(s) starting the zfs snapshot(s) aren't checking if the previous script is still running. You should write a pid file into
/var/run
and check for it before starting.

The other thing is your samba backups look to be stuck in a lockf state. This could be their fault, it could be the zfs snapshot's fault, it could be the zombie process' fault. Take a look at the lsof command (I think you'll need to add it from packages/ports) and it can help you figure out what specific file all of these processes are waiting on. Here's two stackexchange threads that are dealing with a similar situation:

http://serverfault.com/questions/189612/apache-processes-all-stuck-in-lockf-state-via-top
http://serverfault.com/questions/429882/apache-httpd-processes-in-lockf-state

jef · Feb 28, 2017

fstat(1) can give you insight into open files without installing lsof. The -f and -p options often help to narrow things down in the output.

SirDice · Feb 28, 2017

cellini said:
Code:

anders@backupbsd:~ % gstat > gstat && cat gstat | head -n 20

gstat | tee gstat | head -n 20
See tee(1).

cellini · Feb 28, 2017

hmm i have been looking at lsof and yes it looks like samba is the most of the problems, but what to do now? is there a way i can kill some of the prosesses and get the machine running like it shuld again or is the only option to change som samba config and reboot the machine?

added som lsof output if that clear some of it up

Code:

anders@backupbsd:~ % cat lsof | grep W | head -n 50
smbd        386  nobody  txt     VREG     115,1740636303              54470  43181 / -- libLIBWBCLIENT-OLD-samba4.so
smbd        386  nobody    4wW   VREG     115,1740636303                 21  85354 / -- msg.lock/386
smbd        387  nobody  txt     VREG     115,1740636303              54470  43181 / -- libLIBWBCLIENT-OLD-samba4.so
smbd        387  nobody    4wW   VREG     115,1740636303                 21  85356 / -- msg.lock/387
smbd        400  nobody  txt     VREG     115,1740636303              54470  43181 / -- libLIBWBCLIENT-OLD-samba4.so
smbd        400  nobody    4wW   VREG     115,1740636303                 21  85359 / -- msg.lock/400
smbd        403  nobody  txt     VREG     115,1740636303              54470  43181 / -- libLIBWBCLIENT-OLD-samba4.so
smbd        403  nobody    4wW   VREG     115,1740636303                 20  85361 / -- msg.lock/403
smbd        698  nobody  txt     VREG     115,1740636303              54470  43181 / -- libLIBWBCLIENT-OLD-samba4.so
smbd        698  nobody    4wW   VREG     115,1740636303                 21  85461 / -- msg.lock/698
smbd        699  nobody  txt     VREG     115,1740636303              54470  43181 / -- libLIBWBCLIENT-OLD-samba4.so
smbd        699  nobody    4wW   VREG     115,1740636303                 21  85463 / -- msg.lock/699
smbd        702  nobody  txt     VREG     115,1740636303              54470  43181 / -- libLIBWBCLIENT-OLD-samba4.so
smbd        702  nobody    4wW   VREG     115,1740636303                 20  85467 / -- msg.lock/702
smbd        703  nobody  txt     VREG     115,1740636303              54470  43181 / -- libLIBWBCLIENT-OLD-samba4.so
smbd        703  nobody    4wW   VREG     115,1740636303                 21  85469 / -- msg.lock/703
smbd       1065  nobody  txt     VREG     115,1740636303              54470  43181 / -- libLIBWBCLIENT-OLD-samba4.so
smbd       1065  nobody    4wW   VREG     115,1740636303                 20  85582 / -- msg.lock/1065
smbd       1086  nobody  txt     VREG     115,1740636303              54470  43181 / -- libLIBWBCLIENT-OLD-samba4.so
smbd       1086  nobody    4wW   VREG     115,1740636303                 21  85585 / -- msg.lock/1086
smbd       1089  nobody  txt     VREG     115,1740636303              54470  43181 / -- libLIBWBCLIENT-OLD-samba4.so
smbd       1089  nobody    4wW   VREG     115,1740636303                 20  85587 / -- msg.lock/1089
smbd       1092  nobody  txt     VREG     115,1740636303              54470  43181 / -- libLIBWBCLIENT-OLD-samba4.so
smbd       1092  nobody    4wW   VREG     115,1740636303                 20  85589 / -- msg.lock/1092
smbd       1373  nobody  txt     VREG     115,1740636303              54470  43181 / -- libLIBWBCLIENT-OLD-samba4.so
smbd       1373  nobody    4wW   VREG     115,1740636303                 21  85687 / -- msg.lock/1373
smbd       1374  nobody  txt     VREG     115,1740636303              54470  43181 / -- libLIBWBCLIENT-OLD-samba4.so
smbd       1374  nobody    4wW   VREG     115,1740636303                 21  85689 / -- msg.lock/1374
smbd       1375  nobody  txt     VREG     115,1740636303              54470  43181 / -- libLIBWBCLIENT-OLD-samba4.so
smbd       1375  nobody    4wW   VREG     115,1740636303                 21  85691 / -- msg.lock/1375
smbd       1378  nobody  txt     VREG     115,1740636303              54470  43181 / -- libLIBWBCLIENT-OLD-samba4.so
smbd       1378  nobody    4wW   VREG     115,1740636303                 20  85693 / -- msg.lock/1378
smbd       1380  nobody  txt     VREG     115,1740636303              54470  43181 / -- libLIBWBCLIENT-OLD-samba4.so
smbd       1380  nobody    4wW   VREG     115,1740636303                 20  85698 / -- msg.lock/1380
smbd       1381  nobody  txt     VREG     115,1740636303              54470  43181 / -- libLIBWBCLIENT-OLD-samba4.so
smbd       1381  nobody    4wW   VREG     115,1740636303                 21  85700 / -- msg.lock/1381
smbd       1382  nobody  txt     VREG     115,1740636303              54470  43181 / -- libLIBWBCLIENT-OLD-samba4.so
smbd       1382  nobody    4wW   VREG     115,1740636303                 20  85702 / -- msg.lock/1382
smbd       1383  nobody  txt     VREG     115,1740636303              54470  43181 / -- libLIBWBCLIENT-OLD-samba4.so
smbd       1383  nobody    4wW   VREG     115,1740636303                 20  85704 / -- msg.lock/1383
smbd       1386  nobody  txt     VREG     115,1740636303              54470  43181 / -- libLIBWBCLIENT-OLD-samba4.so
smbd       1386  nobody    4wW   VREG     115,1740636303                 21  85706 / -- msg.lock/1386
smbd       1387  nobody  txt     VREG     115,1740636303              54470  43181 / -- libLIBWBCLIENT-OLD-samba4.so
smbd       1387  nobody    4wW   VREG     115,1740636303                 21  85708 / -- msg.lock/1387
smbd       1400  nobody  txt     VREG     115,1740636303              54470  43181 / -- libLIBWBCLIENT-OLD-samba4.so
smbd       1400  nobody    4wW   VREG     115,1740636303                 20  85711 / -- msg.lock/1400
smbd       1402  nobody  txt     VREG     115,1740636303              54470  43181 / -- libLIBWBCLIENT-OLD-samba4.so
smbd       1402  nobody    4wW   VREG     115,1740636303                 21  85715 / -- msg.lock/1402
smbd       1403  nobody  txt     VREG     115,1740636303              54470  43181 / -- libLIBWBCLIENT-OLD-samba4.so
smbd       1403  nobody    4wW   VREG     115,1740636303                 21  85717 / -- msg.lock/1403

sko · Mar 1, 2017

Did ZFS issue any errors (zpool status)? Oftentimes a dying hard drive with very high latencies can bring a system to a crawl without any ressource shortages. I got this behaviour several times, lastly with a dying SSD which was used for L2ARC...

DTrace is perfect for tracing down obscure performance issues. /usr/share/dtrace has a small collection of useful dtrace scripts, e.g. one for disklatency.

DTrace has to be enabled on FreeBSD - I can highly recommend having it enabled by default on any production system. I'm using DTrace only for a few months and merely scratched the surface of what it is capable of, but I already almost always use it when analyzing performance issues or 'weird behaviour' on any FreeBSD or SmartOS machine.
The handbook and wiki give a rough introduction as well as links to additional reading material:
https://www.freebsd.org/doc/handbook/dtrace.html
https://wiki.freebsd.org/DTrace

Regarding the *many* smbd processes - I'd try probing for context switches, caused by way too many concurring threads. If the samba fileserver is under heavy load from many users, the resulting concurrent disk access might be a problem for your underlying storage.

How to fix a "unresponsive" system.

Administrator