debugging high server load

Hi, I need to take care of 6 FreeBSD servers in my new job. Guy who set things up isn't working any more, so I have to figure things out, since I'm only one with some basic common unix knowledge. (I'm using Linux for few years).

There are 7.1 or 6.2 releases.

I have problem with one of them, it works too slow, I need to wait few seconds to launch mc or login in putty. It's used for routing and samba shares.

So, I'll give some information which I think is relevant.

top
Code:
last pid: 67454;  load averages:  1.07,  1.49,  2.62                                                                                  up 4+16:58:41  10:24:28
286 processes: 3 running, 282 sleeping, 1 stopped
CPU:  0.9% user,  0.0% nice, 42.7% system,  0.0% interrupt, 56.4% idle
Mem: 680M Active, 103M Inact, 159M Wired, 45M Cache, 111M Buf, 4240K Free
Swap: 1473M Total, 1431M Used, 42M Free, 97% Inuse, 48K In, 2048K Out

  PID USERNAME  THR PRI NICE   SIZE    RES STATE  C   TIME   WCPU COMMAND
 3248 nobody      1  46    0 11856K  1508K RUN    1  31.1H 16.06% smbd
63604 smmsp       1  -4    0  5876K  1072K ufs    0   1:36  0.39% sendmail
30861 smmsp       1  -4    0 18164K  1212K ufs    1  30:18  0.20% sendmail
39815 smmsp       1  -4    0 13044K  1176K CPU0   0  29:25  0.20% sendmail
41324 smmsp       1  -4    0 21236K  1204K ufs    1  13:47  0.20% sendmail
60571 smmsp       1  -4    0  5876K  1116K ufs    1   2:53  0.20% sendmail
67417 smmsp       1  -4    0  5876K  2520K ufs    1   0:00  0.20% sendmail
34375 smmsp       1  -8    0 17140K  1196K biord  0  31:48  0.10% sendmail
54309 smmsp       1  -4    0  6900K  1052K ufs    1   4:02  0.10% sendmail
 2794 smmsp       1  -4    0 52980K  1748K ufs    1  81:20  0.00% sendmail
 3524 smmsp       1  -4    0 52980K  1792K ufs    1  74:36  0.00% sendmail
 6447 smmsp       1  -4    0 38644K  1180K ufs    0  73:32  0.00% sendmail
11320 smmsp       1  -4    0 25332K  1192K ufs    1  57:29  0.00% sendmail
 9855 smmsp       1  -4    0 27380K  1176K ufs    1  57:17  0.00% sendmail
13863 smmsp       1  -4    0 23284K 10420K ufs    1  51:24  0.00% sendmail
 6095 smmsp       1  -4    0 52980K  1720K ufs    1  48:02  0.00% sendmail
15328 smmsp       1  -4    0 25332K  1196K ufs    1  46:35  0.00% sendmail
 9122 smmsp       1  -4    0 34548K  1336K ufs    1  46:20  0.00% sendmail
12410 smmsp       1  -4    0 31476K  1188K ufs    1  44:27  0.00% sendmail
10589 smmsp       1  -4    0 34548K  1372K ufs    1  43:27  0.00% sendmail
14229 smmsp       1  -4    0 31476K  1188K ufs    1  42:43  0.00% sendmail
 9474 smmsp       1  -4    0 34548K  1284K ufs    1  41:59  0.00% sendmail
13136 smmsp       1  -4    0 32500K  1172K ufs    1  40:07  0.00% sendmail
12775 smmsp       1  -4    0 34548K  1292K ufs    1  39:03  0.00% sendmail
10954 smmsp       1  -4    0 34548K  1260K ufs    1  38:52  0.00% sendmail
11682 smmsp       1  -4    0 34548K  1292K ufs    1  38:30  0.00% sendmail
17169 smmsp       1  -4    0 33524K  1200K ufs    1  38:21  0.00% sendmail
 7563 smmsp       1  -4    0 52980K  1860K ufs    1  37:59  0.00% sendmail
21172 smmsp       1  -4    0 25332K  1172K ufs    1  37:07  0.00% sendmail
15695 smmsp       1  -4    0 33524K  1204K ufs    1  36:30  0.00% sendmail
16798 smmsp       1  -4    0 31476K  1176K ufs    1  36:13  0.00% sendmail
18990 smmsp       1  -4    0 31476K  1180K ufs    1  35:04  0.00% sendmail
35539 smmsp       1  -4    0 13044K  1216K ufs    1  34:24  0.00% sendmail
16430 smmsp       1  -4    0 33524K  1196K ufs    1  34:15  0.00% sendmail
17533 smmsp       1  -4    0 32500K  1204K ufs    1  33:32  0.00% sendmail
31634 smmsp       1  -4    0 18164K  1196K ufs    1  32:47  0.00% sendmail
29330 smmsp       1  -4    0 18164K  1212K ufs    1  32:01  0.00% sendmail
33188 smmsp       1  -4    0 19188K  1196K ufs    1  30:57  0.00% sendmail
16062 smmsp       1  -4    0 34548K  1280K ufs    1  30:03  0.00% sendmail
23736 smmsp       1  -4    0 25332K  1172K ufs    1  29:47  0.00% sendmail
18262 smmsp       1  -4    0 34548K  1264K ufs    1  29:45  0.00% sendmail
20445 smmsp       1  -4    0 33524K  1208K ufs    1  29:36  0.00% sendmail
24470 smmsp       1  -4    0 23284K 10400K ufs    1  29:28  0.00% sendmail
18628 smmsp       1  -4    0 34548K  1308K ufs    1  29:26  0.00% sendmail
26400 smmsp       1  -4    0 20212K  1172K ufs    0  29:01  0.00% sendmail
32021 smmsp       1  -4    0 20212K  1192K ufs    1  28:58  0.00% sendmail
26762 smmsp       1  -4    0 20212K  1168K ufs    1  28:48  0.00% sendmail
32794 smmsp       1  -4    0 21236K  1172K ufs    1  27:45  0.00% sendmail
33974 smmsp       1  -4    0 20212K  1176K ufs    1  27:09  0.00% sendmail
30093 smmsp       1  -4    0 21236K  1220K ufs    1  26:36  0.00% sendmail
20810 smmsp       1  -4    0 33524K  1208K ufs    0  26:35  0.00% sendmail

As you can see, almost whole swap used, and lot of sendmails.

uptime
Code:
10:25AM  up 4 days, 17 hrs, 2 users, load averages: 9.70, 4.26, 3.58

Is this load high for this machine? This particular server is used by few persons only.

Mails, why root has so many mails? What to do with them?

Code:
modem_bp# mail
Mail version 8.1 6/6/93.  Type ? for help.
"/var/mail/root": 198532 messages 198532 new
>N  1 root@modem_bp         Tue Jun  1 15:37  26/1058  "Cron <root@modem_bp> /root/mount_ilc"
 N  2 root@modem_bp         Tue Jun  1 15:38  25/971   "Cron <root@modem_bp> /root/mount_ilc"
 N  3 root@modem_bp         Tue Jun  1 15:39  25/971   "Cron <root@modem_bp> /root/mount_ilc"
 N  4 root@modem_bp         Tue Jun  1 15:40  25/971   "Cron <root@modem_bp> /root/mount_ilc"
 N  5 root@modem_bp         Tue Jun  1 15:41  25/971   "Cron <root@modem_bp> /root/mount_ilc"

Each mail from cron:
Code:
From root@modem_ns Fri Jun 18 08:26:01 2010
Date: Fri, 18 Jun 2010 08:26:01 GMT
From: root@modem_ns (Cron Daemon)
To: root@modem_ns
Subject: Cron <root@modem_ns> /root/mount_ilc
X-Cron-Env: <SHELL=/bin/sh>
X-Cron-Env: <PATH=/etc:/bin:/sbin:/usr/bin:/usr/sbin>
X-Cron-Env: <HOME=/var/log>
X-Cron-Env: <LOGNAME=root>
X-Cron-Env: <USER=root>

/root/mount_ilc: not found

rc.conf
Code:
modem_bp# cat /etc/rc.confnf
cat: /etc/rc.confnf: No such file or directory
modem_bp# cat /etc/rc.conf
ifconfig_em0="inet 192.168.2.244  netmask 255.255.255.0"
ifconfig_bge0="inet 192.168.3.244  netmask 255.255.255.0"

keymap="pl_PL.ISO8859-2"

defaultrouter="192.168.3.242"

hostname="modem_bp"
gateway_enable="YES"
sshd_enable="YES"
inetd_enable="YES"
usbd_enable="NO"

sendmail_enable="NO"
linux_enable="YES"
moused_type="NO"
moused_enable="NO"
webmin_enable="YES"

nmbd_enable="YES"
smbd_enable="YES"

cupsd_enable="YES"
dhcpd_enable="YES"

tomcat41_enable="YES"

I need your help to figure out what's wrong and how to fix it. I may provide any additional info if needed.
 
I suggest verifying your webmin and/or tomcat applications. It's possible someone is abusing a web service on your machine to send out lots of mail (probably spam).
 
You are low on memory, so OS swapping out and in, that cause high i/o on hard disk.
Processes wait for i/o, this causes high load averages, while CPU is 56% idle.
So, add RAM and probably faster HD..
And as optimization maybe you need configure sendmail and/or some threaded analog for sendmail..
Anyway, many mails for root may point your sendmail configuration is incorrect. Maybe some botnets trying to exploit you.. Restrict things or your ip will be in spamcop =)
 
You may also have a cron job running every minute (and opening sendmail to mail the output to root) that takes much longer than a minute to complete.
 
K, some things get cleared. This machine has an incorrect item in crontab. Relevant file does not exist. I removed all mails for root that refers to failing crontab item. Btw, no tomcat service seems to be running. Tomorrow I'll try to figure out what sendmails are up to.

Thanks for any help/clarifications and sorry for my imperfect english.
 
I agree with SirDice "It's possible someone is abusing a web service on your machine to send out lots of mail (probably spam)." yes it's SPAM
 
I stopped sendmail daemon and killed remaining "sendmail -Ac" processes. Almost whole swap get freed, but some processes still takes too much cpu time:

Code:
last pid: 81216;  load averages:  1.70,  1.59,  1.49                                                                                  up 5+20:52:44  14:18:31
110 processes: 7 running, 102 sleeping, 1 stopped
CPU:     % user,     % nice,     % system,     % interrupt,     % idle
Mem: 108M Active, 581M Inact, 207M Wired, 80K Cache, 111M Buf, 96M Free
Swap: 1473M Total, 23M Used, 1450M Free, 1% Inuse

  PID USERNAME  THR PRI NICE   SIZE    RES STATE  C   TIME   WCPU COMMAND
78592 root        1  -4    0 39992K  4368K RUN    1  59:10 25.88% find
80176 root        1  70    0 28728K 26792K RUN    0  21:18 24.46% find
80128 root        1  -4    0 28728K 26780K RUN    1  24:39 21.09% find
80154 root        1  53    0 28728K 26792K CPU0   1  25:37 11.57% find
 3248 nobody      1  97    0 11916K  2340K RUN    1  37.8H  8.69% smbd
75227 root        1  -4    0 55352K  1192K RUN    1  23:08  2.69% find
81213 nobody      1  45    0 11920K  3068K select 0   0:00  0.20% smbd
  804 root        1  44    0  7968K   940K select 1   0:49  0.00% nmbd
  706 root        1  44    0  3184K   408K select 1   0:18  0.00% syslogd
  809 root        1  44    0 11656K  1004K select 1   0:04  0.00% smbd
  772 root        1   4    0  5788K  1016K kqread 1   0:04  0.00% cupsd
 1224 dhcpd       1  44    0  3128K   584K select 1   0:03  0.00% dhcpd
80410 root        1  44    0  8428K  2924K select 1   0:00  0.00% sshd
  828 root        1  20    0 11680K   936K pause  1   0:00  0.00% smbd
80459 root        1  44    0  3536K  1676K select 1   0:00  0.00% screen
72020 root        1   8    0  3212K   316K nanslp 1   0:00  0.00% cron
78625 root        1  44    0 11904K  1040K select 0   0:00  0.00% smbd
80461 root        1  20    0  5484K  2696K pause  1   0:00  0.00% csh
80414 root        1  20    0  5484K  2180K pause  0   0:00  0.00% csh
80750 root        1  20    0  3536K  1452K pause  1   0:00  0.00% screen
 1268 root        1  44    0  5752K   432K select 1   0:00  0.00% sshd

What these "finds" do and why they may be running?

Code:
modem_bp# ps ax | grep "find"
75227  ??  R     23:25.82 find -sx / /dev/null -type f ( -perm -u+x -or -perm -g+x -or -perm -o+x ) ( -perm -u+s -or -perm -g+s ) -exec ls -liTd {} +
78592  ??  R     60:40.17 find -sx / /dev/null -type f ( -perm -u+x -or -perm -g+x -or -perm -o+x ) ( -perm -u+s -or -perm -g+s ) -exec ls -liTd {} +
80128  ??  D     25:07.65 find -sx / /dev/null -type f ( -perm -u+x -or -perm -g+x -or -perm -o+x ) ( -perm -u+s -or -perm -g+s ) -exec ls -liTd {} +
80154  ??  D     27:14.81 find -sx / /dev/null -type f ( -perm -u+x -or -perm -g+x -or -perm -o+x ) ( -perm -u+s -or -perm -g+s ) -exec ls -liTd {} +
80176  ??  R     22:06.81 find -sx / /dev/null -type f ( -perm -u+x -or -perm -g+x -or -perm -o+x ) ( -perm -u+s -or -perm -g+s ) -exec ls -liTd {} +

Another problem is in /var/spool/clientmqueue
As far as i know i can remove files there. There must be so many of them i can't even list them, it takes too much time. Is there any more efficient way to remove them?
Should command "rm -v *" display verbose info immediately?

I don't know what file system this machine is using. I will submit it if you tell me how to check it.
 
SirDice said:
Those finds are probably part of periodic(8).

I see some files in /etc/periodic/daily w m, like these:

Code:
modem_bp# ls -l
total 54
-rwxr-xr-x  1 root  wheel  1280 Jan  1  2009 100.clean-disks
-rwxr-xr-x  1 root  wheel  1571 Jan  1  2009 110.clean-tmps
-rwxr-xr-x  1 root  wheel  1100 Jan  1  2009 120.clean-preserve
-rwxr-xr-x  1 root  wheel   703 Jan  1  2009 130.clean-msgs
-rwxr-xr-x  1 root  wheel  1064 Jan  1  2009 140.clean-rwho
-rwxr-xr-x  1 root  wheel   593 Jan  1  2009 150.clean-hoststat
-rwxr-xr-x  1 root  wheel  1752 Jan  1  2009 200.backup-passwd
-rwxr-xr-x  1 root  wheel  1004 Jan  1  2009 210.backup-aliases
-rwxr-xr-x  1 root  wheel   687 Jan  1  2009 300.calendar
-rwxr-xr-x  1 root  wheel  1219 Jan  1  2009 310.accounting
-rwxr-xr-x  1 root  wheel   718 Jan  1  2009 330.news
-rwxr-xr-x  1 root  wheel   524 Jan  1  2009 400.status-disks
-rwxr-xr-x  1 root  wheel   665 Jan  1  2009 404.status-zfs
-rwxr-xr-x  1 root  wheel   720 Jan  1  2009 405.status-ata-raid
-rwxr-xr-x  1 root  wheel   588 Jan  1  2009 406.status-gmirror
-rwxr-xr-x  1 root  wheel   583 Jan  1  2009 407.status-graid3
-rwxr-xr-x  1 root  wheel   582 Jan  1  2009 408.status-gstripe
-rwxr-xr-x  1 root  wheel   582 Jan  1  2009 409.status-gconcat
-rwxr-xr-x  1 root  wheel   556 Jan  1  2009 420.status-network
-rwxr-xr-x  1 root  wheel   695 Jan  1  2009 430.status-rwho
-rwxr-xr-x  1 root  wheel  1436 Jan  1  2009 440.status-mailq
-rwxr-xr-x  1 root  wheel   776 Jan  1  2009 450.status-security
-rwxr-xr-x  1 root  wheel  1688 Jan  1  2009 460.status-mail-rejects
-rwxr-xr-x  1 root  wheel  1377 Jan  1  2009 470.status-named
-rwxr-xr-x  1 root  wheel   491 Jan  1  2009 480.status-ntpd
-rwxr-xr-x  1 root  wheel   728 Jan  1  2009 500.queuerun
-rwxr-xr-x  1 root  wheel   720 Jan  1  2009 999.local

witch seems to be shell scripts. May "finds" mentioned before be running still because a lot of files in /var/spool/clientmqueue?
 
bagheera said:
May "finds" mentioned before be running still because a lot of files in /var/spool/clientmqueue?
That's quite possible. The more files you have the longer it'll take to complete.
 
What would be faster (and why?):

# cd /var/spool/clientmqueue && rm -v *
or
# rm -rf /var/spool/clientmqueue

I committed first command under screen and detached it. Hope it would be done by tomorrow.
 
Command
# rm -v *

from previous post did not worked because of too many arguments, so i I launched in screen
# rm -rf /var/spool/clientmqueue

Hope that will work.

Next problem is connected with perl and webmin. I receive mails in root account with following content:
Code:
Message 20373:
From root@modem Tue Oct 19 11:01:05 2010
Date: Tue, 19 Oct 2010 11:01:05 +0200 (CEST)
From: root@modem (Cron Daemon)
To: root@modem
Subject: Cron <root@modem> /usr/local/etc/webmin/bandwidth/rotate.pl
X-Cron-Env: <SHELL=/bin/sh>
X-Cron-Env: <HOME=/root>
X-Cron-Env: <PATH=/usr/bin:/bin>
X-Cron-Env: <LOGNAME=root>
X-Cron-Env: <USER=root>

syslog-ng: not found
Out of memory during request for 4084 bytes, total sbrk() is 535412736 bytes!

Current server status is:
Code:
last pid: 82018;  load averages:  0.40,  0.34,  0.29                                                                                 up 91+19:31:15  12:59:55
121 processes: 1 running, 120 sleeping
CPU:  0.2% user,  0.0% nice,  3.0% system,  0.8% interrupt, 96.1% idle
Mem: 122M Active, 315M Inact, 177M Wired, 35M Cache, 111M Buf, 343M Free
Swap: 1473M Total, 33M Used, 1440M Free, 2% Inuse

  PID USERNAME    THR PRI NICE   SIZE    RES STATE  C   TIME   WCPU COMMAND
43616 nobody        1  73    0 15376K  8112K select 1 611:27 16.55% smbd
 9105 root          1   4    0  7872K  2948K kqread 0 265:23  0.00% cupsd
  638 root          1  44    0  3184K   884K select 1  66:25  0.00% syslogd
40008 nobody        1  44    0 41852K 30336K select 0  28:46  0.00% squid
  429 _pflogd       1 -58    0  3340K   808K bpf    0   2:48  0.00% pflogd
 9192 root          1  44    0 15628K  4648K select 0   2:34  0.00% httpd
 9141 root          1  44    0 10088K  4748K select 1   2:33  0.00% perl5.8.8
21844 root          1  44    0  8048K  2724K select 1   2:04  0.00% nmbd
 9019 root          1  44    0  3312K   532K select 0   1:33  0.00% proftpd
 9274 root          1  44    0  5876K  1032K select 0   1:30  0.00% sendmail
21851 root          1  44    0 12684K  6180K select 1   0:27  0.00% smbd
 9204 root          1  44    0  5752K   420K select 0   0:26  0.00% sshd
 9312 root          1   8    0  3212K   376K nanslp 1   0:12  0.00% cron

What would cause running out of memory? Some limits? Should i I change them?
 
Solved

I figured out why listing files in /var/spool/clientmqueue takes so long, by default

# ls

command sorts the output, using

# ls -f

works way faster :)

I listed all files into text file and used shell script to delete each one of them, witch took several hours anyway, but it's done.

About webmin bandwith module, this machine support quite a lot of traffic, running rotate.pl manually took several MBs of memory and ends running out of it.

I'll just turn that module off.

At this point I can mark all common load problems solved.

Thanks for all tips and help.
 
Back
Top