High CPU utilization by average load values while no processes in top output have high CPU load?

m4rtin · Nov 12, 2014

I have four scripts running in my single-core server which in total execute 30 snmpset commands in each second and 4 snmpget commands in each second. Once I start those scripts, the server becomes slow and after five minutes the load average is almost 5:

Code:

[s@ ~]$ for i in {1..10}; do top -b -S -C 10; sleep 1; done
last pid: 27453;  load averages:  4.99,  3.72,  2.35  up 158+23:45:09  16:08:50
219 processes: 7 running, 195 sleeping, 17 waiting

Mem: 366M Active, 213M Inact, 321M Wired, 21M Cache, 110M Buf, 62M Free
Swap: 1996M Total, 218M Used, 1778M Free, 10% Inuse


  PID USERNAME  THR PRI NICE  SIZE  RES STATE  TIME  CPU COMMAND
  11 root  1 171 ki31  0K  8K RUN  2873.4  4.69% idle
15836 root  1  44  0  323M  298M select 128:14  1.17% Xorg
27443 s  1  73  0  6680K  3408K RUN  0:00  0.78% snmpset
27448 s  1  72  0  6680K  3200K RUN  0:00  0.78% snmpget
27445 s  1  72  0  5656K  2936K RUN  0:00  0.49% snmpget
27451 s  1  69  0  5656K  2536K RUN  0:00  0.29% snmpget
  12 root  17 -60  -  0K  136K WAIT  468:54  0.00% intr
  17 root  1  44  -  0K  8K syncer 168:30  0.00% syncer
1234 root  1  44  0  3804K  1028K select  53:01  0.00% hald-addon-storage
  3 root  1  -8  -  0K  8K -  28:17  0.00% g_up

last pid: 27459;  load averages:  4.91,  3.72,  2.36  up 158+23:45:10  16:08:51
215 processes: 7 running, 191 sleeping, 17 waiting

Mem: 368M Active, 213M Inact, 321M Wired, 21M Cache, 110M Buf, 60M Free
Swap: 1996M Total, 218M Used, 1778M Free, 10% Inuse


  PID USERNAME  THR PRI NICE  SIZE  RES STATE  TIME  CPU COMMAND
  11 root  1 171 ki31  0K  8K RUN  2873.4  4.05% idle
27451 s  1  96  0  8728K  5208K RUN  0:00  1.66% snmpget
15836 root  1  44  0  323M  298M select 128:14  0.98% Xorg
27456 s  1  96  0  6680K  3444K RUN  0:00  0.78% snmpset
27458 s  1  96  0  6680K  3192K RUN  0:00  0.59% snmpset
27457 s  1  96  0  5656K  2788K RUN  0:00  0.49% snmpset
  12 root  17 -60  -  0K  136K WAIT  468:54  0.00% intr
  17 root  1  44  -  0K  8K syncer 168:30  0.00% syncer
1234 root  1  44  0  3804K  1028K select  53:01  0.00% hald-addon-storage
  3 root  1  -8  -  0K  8K -  28:17  0.00% g_up

last pid: 27469;  load averages:  4.91,  3.72,  2.36  up 158+23:45:11  16:08:52
215 processes: 7 running, 191 sleeping, 17 waiting

Mem: 374M Active, 213M Inact, 321M Wired, 21M Cache, 110M Buf, 54M Free
Swap: 1996M Total, 218M Used, 1778M Free, 10% Inuse


  PID USERNAME  THR PRI NICE  SIZE  RES STATE  TIME  CPU COMMAND
  11 root  1 171 ki31  0K  8K RUN  2873.4  3.76% idle
27467 s  1  96  0  8728K  5020K RUN  0:00  1.86% snmpget
27461 s  1  96  0  8728K  5040K RUN  0:00  1.86% snmpset
27465 s  1  76  0  8728K  5188K RUN  0:00  1.76% snmpset
27456 s  1  76  0  8728K  5188K RUN  0:00  1.56% snmpset
15836 root  1  44  0  323M  298M select 128:14  0.88% Xorg
  12 root  17 -60  -  0K  136K WAIT  468:54  0.00% intr
  17 root  1  44  -  0K  8K syncer 168:30  0.00% syncer
1234 root  1  44  0  3804K  1028K select  53:01  0.00% hald-addon-storage
  3 root  1  -8  -  0K  8K -  28:17  0.00% g_up

last pid: 27479;  load averages:  4.91,  3.72,  2.36  up 158+23:45:12  16:08:53
215 processes: 8 running, 190 sleeping, 17 waiting

Mem: 376M Active, 213M Inact, 321M Wired, 21M Cache, 110M Buf, 52M Free
Swap: 1996M Total, 218M Used, 1778M Free, 10% Inuse


  PID USERNAME  THR PRI NICE  SIZE  RES STATE  TIME  CPU COMMAND
  11 root  1 171 ki31  0K  8K RUN  2873.4  3.37% idle
27477 s  1  76  0  8728K  5188K RUN  0:00  1.86% snmpset
27474 s  1  76  0  8728K  5188K RUN  0:00  1.66% snmpset
27462 s  1  96  0  8728K  5216K RUN  0:00  1.66% snmpset
27467 s  1  96  0  8728K  5208K RUN  0:00  1.56% snmpget
27478 s  1  96  0  7704K  4196K RUN  0:00  1.37% snmpset
15836 root  1  44  0  323M  298M select 128:14  0.68% Xorg
  12 root  17 -60  -  0K  136K WAIT  468:54  0.00% intr
  17 root  1  44  -  0K  8K syncer 168:30  0.00% syncer
1234 root  1  44  0  3804K  1028K select  53:01  0.00% hald-addon-storage

last pid: 27486;  load averages:  4.91,  3.72,  2.36  up 158+23:45:13  16:08:54
213 processes: 7 running, 189 sleeping, 17 waiting

Mem: 376M Active, 213M Inact, 321M Wired, 21M Cache, 110M Buf, 53M Free
Swap: 1996M Total, 218M Used, 1778M Free, 10% Inuse


  PID USERNAME  THR PRI NICE  SIZE  RES STATE  TIME  CPU COMMAND
  11 root  1 171 ki31  0K  8K RUN  2873.4  2.98% idle
27482 s  1  76  0  8728K  5188K RUN  0:00  1.86% snmpset
27481 s  1  76  0  8728K  5188K RUN  0:00  1.86% snmpset
27485 s  1  96  0  8728K  5180K RUN  0:00  1.66% snmpset
27483 s  1  76  0  8728K  5188K RUN  0:00  1.66% snmpset
27484 s  1  96  0  6680K  3492K RUN  0:00  0.98% snmpset
15836 root  1  44  0  323M  298M select 128:14  0.78% Xorg
  12 root  17 -60  -  0K  136K WAIT  468:54  0.00% intr
  17 root  1  44  -  0K  8K syncer 168:30  0.00% syncer
1234 root  1  44  0  3804K  1028K select  53:01  0.00% hald-addon-storage

last pid: 27496;  load averages:  4.91,  3.72,  2.36  up 158+23:45:14  16:08:55
213 processes: 7 running, 189 sleeping, 17 waiting

Mem: 372M Active, 213M Inact, 321M Wired, 21M Cache, 110M Buf, 57M Free
Swap: 1996M Total, 217M Used, 1778M Free, 10% Inuse


  PID USERNAME  THR PRI NICE  SIZE  RES STATE  TIME  CPU COMMAND
  11 root  1 171 ki31  0K  8K RUN  2873.4  2.59% idle
27493 s  1  76  0  8728K  5188K RUN  0:00  1.76% snmpset
27489 s  1  76  0  8728K  5188K RUN  0:00  1.66% snmpset
27494 s  1  96  0  6680K  3428K RUN  0:00  0.98% snmpset
27490 s  1  96  0  6680K  3540K RUN  0:00  0.88% snmpset
15836 root  1  44  0  323M  298M select 128:14  0.78% Xorg
27495 s  1  96  0  5656K  2800K RUN  0:00  0.39% snmpset
  12 root  17 -60  -  0K  136K WAIT  468:54  0.00% intr
  17 root  1  44  -  0K  8K syncer 168:30  0.00% syncer
1234 root  1  44  0  3804K  1028K select  53:01  0.00% hald-addon-storage

last pid: 27502;  load averages:  4.91,  3.74,  2.37  up 158+23:45:15  16:08:56
213 processes: 7 running, 189 sleeping, 17 waiting

Mem: 375M Active, 213M Inact, 321M Wired, 21M Cache, 110M Buf, 54M Free
Swap: 1996M Total, 217M Used, 1778M Free, 10% Inuse


  PID USERNAME  THR PRI NICE  SIZE  RES STATE  TIME  CPU COMMAND
  11 root  1 171 ki31  0K  8K RUN  2873.4  2.29% idle
27495 s  1  96  0  8728K  5044K RUN  0:00  1.66% snmpset
27499 s  1  76  0  8728K  5188K RUN  0:00  1.66% snmpset
27500 s  1  76  0  8728K  5188K RUN  0:00  1.66% snmpset
27498 s  1  96  0  7704K  4040K RUN  0:00  1.07% snmpset
27501 s  1  96  0  6680K  3444K RUN  0:00  0.98% snmpset
15836 root  1  44  0  323M  298M select 128:14  0.68% Xorg
  12 root  17 -60  -  0K  136K WAIT  468:54  0.00% intr
  17 root  1  44  -  0K  8K syncer 168:30  0.00% syncer
1234 root  1  44  0  3804K  1028K select  53:01  0.00% hald-addon-storage

last pid: 27509;  load averages:  4.91,  3.74,  2.37  up 158+23:45:16  16:08:57
213 processes: 7 running, 189 sleeping, 17 waiting

Mem: 374M Active, 213M Inact, 321M Wired, 21M Cache, 110M Buf, 55M Free
Swap: 1996M Total, 217M Used, 1778M Free, 10% Inuse


  PID USERNAME  THR PRI NICE  SIZE  RES STATE  TIME  CPU COMMAND
  11 root  1 171 ki31  0K  8K RUN  2873.4  1.95% idle
27506 s  1  96  0  8728K  5008K RUN  0:00  1.76% snmpset
27501 s  1  76  0  8728K  5188K RUN  0:00  1.56% snmpset
27504 s  1  96  0  8728K  5028K RUN  0:00  1.56% snmpset
27507 s  1  96  0  6680K  3500K RUN  0:00  0.98% snmpset
27508 s  1  96  0  6680K  3428K RUN  0:00  0.78% snmpset
15836 root  1  44  0  323M  298M select 128:14  0.49% Xorg
  12 root  17 -60  -  0K  136K WAIT  468:54  0.00% intr
  17 root  1  44  -  0K  8K syncer 168:30  0.00% syncer
1234 root  1  44  0  3804K  1028K select  53:01  0.00% hald-addon-storage

last pid: 27518;  load averages:  4.91,  3.74,  2.37  up 158+23:45:17  16:08:58
213 processes: 6 running, 190 sleeping, 17 waiting

Mem: 367M Active, 213M Inact, 320M Wired, 21M Cache, 110M Buf, 62M Free
Swap: 1996M Total, 217M Used, 1778M Free, 10% Inuse


  PID USERNAME  THR PRI NICE  SIZE  RES STATE  TIME  CPU COMMAND
  11 root  1 171 ki31  0K  8K RUN  2873.4  1.76% idle
27514 s  1  96  0  6680K  3624K RUN  0:00  1.07% snmpset
27516 s  1  96  0  6680K  3568K RUN  0:00  1.07% snmpset
27515 s  1  96  0  6680K  3540K RUN  0:00  0.88% snmpset
15836 root  1  44  0  323M  298M select 128:14  0.59% Xorg
  12 root  17 -60  -  0K  136K WAIT  468:54  0.00% intr
  17 root  1  44  -  0K  8K syncer 168:30  0.00% syncer
1234 root  1  44  0  3804K  1028K select  53:01  0.00% hald-addon-storage
  3 root  1  -8  -  0K  8K -  28:17  0.00% g_up
33044 s  6  44  0  226M  198M ucond  27:57  0.00% firefox-bin

last pid: 27526;  load averages:  4.91,  3.74,  2.37  up 158+23:45:18  16:08:59
215 processes: 7 running, 191 sleeping, 17 waiting

Mem: 373M Active, 213M Inact, 321M Wired, 21M Cache, 110M Buf, 56M Free
Swap: 1996M Total, 217M Used, 1778M Free, 10% Inuse


  PID USERNAME  THR PRI NICE  SIZE  RES STATE  TIME  CPU COMMAND
27521 s  1  76  0  8728K  5180K RUN  0:00  1.76% snmpget
27517 s  1  96  0  8728K  5040K RUN  0:00  1.56% snmpset
  11 root  1 171 ki31  0K  8K RUN  2873.4  1.46% idle
27525 s  1  96  0  6680K  3540K RUN  0:00  1.07% snmpset
27524 s  1  96  0  6680K  3472K RUN  0:00  0.88% snmpset
27523 s  1  96  0  6680K  3444K RUN  0:00  0.78% snmpset
15836 root  1  44  0  323M  298M select 128:14  0.59% Xorg
  12 root  17 -60  -  0K  136K WAIT  468:54  0.00% intr
  17 root  1  44  -  0K  8K syncer 168:30  0.00% syncer
1234 root  1  44  0  3804K  1028K select  53:01  0.00% hald-addon-storage

[s@ ~]$

As seen above, top command is in "raw CPU" mode which means that it shows the utilization of a process for sample interval (1 second). However, the overall utilization seen in the CPU column is fairly low.

If I compare the top output above with top output at the time when scripts do not run, then the idle process has lot more CPU-time:

Code:

last pid: 23983;  load averages:  0.74,  0.88,  1.01  up 158+23:37:40  16:01:21
207 processes: 2 running, 188 sleeping, 17 waiting

Mem: 359M Active, 212M Inact, 320M Wired, 21M Cache, 110M Buf, 71M Free
Swap: 1996M Total, 219M Used, 1777M Free, 10% Inuse


  PID USERNAME  THR PRI NICE  SIZE  RES STATE  TIME  CPU COMMAND
  11 root  1 171 ki31  0K  8K RUN  2873.4 50.98% idle
  12 root  17 -60  -  0K  136K WAIT  468:53  0.20% intr
  17 root  1  44  -  0K  8K syncer 168:29  0.10% syncer
15836 root  1  44  0  323M  298M select 127:59  0.00% Xorg
1234 root  1  44  0  3804K  1028K select  53:01  0.00% hald-addon-storage
  3 root  1  -8  -  0K  8K -  28:17  0.00% g_up
33044 s  7  44  0  227M  199M ucond  27:43  0.00% firefox-bin
1204 haldaemon  1  44  0  7236K  1868K select  25:23  0.00% hald
  4 root  1  -8  -  0K  8K -  24:52  0.00% g_down
  13 root  1 -16  -  0K  8K -  21:00  0.00% yarrow

What might cause high CPU utilization observed by average load values while there are no processes in top output which load the CPU? Packet interrupts?

SirDice · Nov 12, 2014

CPU usage isn't the only factor when calculating the load. You can have very low CPU usage and still get a high load. If I recall correctly the total number of processes also impacts the load.

m4rtin · Nov 13, 2014

SirDice said:
CPU usage isn't the only factor when calculating the load. You can have very low CPU usage and still get a high load. If I recall correctly the total number of processes also impacts the load.

As far as I know, those processes need to be either in running state or in ready/waiting state. Obviously in my case there are very little processes in running state as CPU utilization is very low. Or are processes in blocking state[](for example because of I/O operation completion) also counted in average load values in FreeBSD?

SirDice · Nov 13, 2014

m4rtin said:
Or are processes in blocking state(for example because of I/O operation completion) also counted in average load values in FreeBSD?

To be honest I'm not sure how the load is calculated. I do know it's a combination of factors and I would imagine processes blocked by I/O would be one of them. I rarely look at the load figures as it's not a good indicator just by itself. I mainly look at CPU, memory usage and how busy I/O is.

m4rtin · Nov 13, 2014

SirDice said:
To be honest I'm not sure how the load is calculated. I do know it's a combination of factors and I would imagine processes blocked by I/O would be one of them. I rarely look at the load figures as it's not a good indicator just by itself. I mainly look at CPU, memory usage and how busy I/O is.

I would also assume that processes in blocked state are taken into account for calculating the average load value. On the other hand, is it possible that blocked processes make the system less responsive? I mean at the time when those four scripts described in my initial post are running, it is not just the average load value which ramps up, but the machine becomes very slow and the CPU counter in the header of the top command for user processes is around 90%. As I already described before, the CPU utilization in the CPU column on the other hand is very low (less than 5% in total for all the processes).

In addition, I checked the interrupts caused by increased network traffic with vmstat -i (30 snmpset commands and four snmpget commands for each second) once the scripts are running and there is no correlation between the interrupts and a heavy load on the server.

Uniballer · Nov 13, 2014

Looking at your top output, all the SNMP get and set processes are in RUN state. Don't worry about the CPU percentages shown per process over short intervals because it takes a while for those numbers to go up even if the process is 100% compute bound (because these values are computed as averages over time: try it). Here's a link to how load is calculated (hint: it is only about processes that are running or ready to run, not stopped, waiting for I/O, etc.).

m4rtin · Nov 17, 2014

Uniballer said:
Don't worry about the CPU percentages shown per process over short intervals because it takes a while for those numbers to go up even if the process is 100% compute bound (because these values are computed as averages over time: try it).

I see. I thought that if I start top with top -s 1 -C, then the CPU usage measurement interval is one second. I tested this out with the command while true; do burnP6 & sleep 1; pkill burnP6; sleep 1; done and the CPU usage of the burnP6 process in top output never exceeded 60 - 70%. As you said, it takes a while for those numbers to go up - two to three seconds to show the actual 99 - 100% CPU usage of the burnP6 process.

One additional observation: if I start the four scripts, then in each second there are multiple new processes created (I can see this with top -s 1 -C "last pid" field). So probably all those snmpset commands (30 in one second) utilize the CPU at nearly 100% for only a fraction of a second and that's why their usage in the CPU column is so low? And in cumulative they cause the ~90% CPU usage? In addition, there is probably additional work for the kernel process scheduler to create and kill all those small processes and maybe do some additional context switching between the processes.

NewGuy · Nov 18, 2014

I'm not 100% sure how FreeBSD calculates load averages, but traditionally load average has measured the number of processes waiting to run. Now, usually, processes being made to wait for the CPU indicates a high % of CPU usage by one or more processes. Basically, if a few "run away" processes are eating all your CPU then other processes will be made to wait and load average goes up.

However, it may also be possible that many processes are using a tiny amount of CPU. In that case each process could display a low CPU %, but with many processes running it would cause a long line-up for the CPU and load average would go up.

It may help to think of load average as a line-up of people. The amount of time you have to wait in line is the load average. The wait in line can be long if either
A) There are just a few people in line, but one person is taking a long time.
B) There are many people in the line all waiting for the same thing.

The top output above shows over 200 processes running with 7 running and 17 waiting. That suggests to me that the 24 processes in running/waiting mode are all competing. Perhaps a little at a time, but they all want to run at once and that is spiking your load average.

m4rtin · Nov 19, 2014

NewGuy said:
I'm not 100% sure how FreeBSD calculates load averages, but traditionally load average has measured the number of processes waiting to run. Now, usually, processes being made to wait for the CPU indicates a high % of CPU usage by one or more processes. Basically, if a few "run away" processes are eating all your CPU then other processes will be made to wait and load average goes up.

However, it may also be possible that many processes are using a tiny amount of CPU. In that case each process could display a low CPU %, but with many processes running it would cause a long line-up for the CPU and load average would go up.

It may help to think of load average as a line-up of people. The amount of time you have to wait in line is the load average. The wait in line can be long if either
A) There are just a few people in line, but one person is taking a long time.
B) There are many people in the line all waiting for the same thing.

The top output above shows over 200 processes running with 7 running and 17 waiting. That suggests to me that the 24 processes in running/waiting mode are all competing. Perhaps a little at a time, but they all want to run at once and that is spiking your load average.

In my case it appears that there are many small processes which all require so little CPU time, that utilities measuring the CPU load like top or ps do not even detect the load, i.e. process is killed before the CPU usage in those utilities goes up (requires at least 1-2 seconds). Is there a utility which allows to measure CPU usage for micro- or millisecond intervals?

NewGuy · Nov 25, 2014

m4rtin said:
In my case it appears that there are many small processes which all require so little CPU time, that utilities measuring the CPU load like top or ps do not even detect the load, i.e. process is killed before the CPU usage in those utilities goes up (requires at least 1-2 seconds). Is there a utility which allows to measure CPU usage for micro- or millisecond intervals?

Yes, at least for some situations. The time command launches a process and then records the amount of CPU time the process used. If you have a bunch of processes that are being started by a script, consider launching them with the time command and redirecting the output to a log file. Something like

Code:

touch mylog.txt
time myscript >> mylog.txt

If you look at the total amount of time the process was running, minus the amount of time the process spent in the CPU, then that should give you an indication of how much of the CPU each process was using.

High CPU utilization by average load values while no processes in top output have high CPU load?

m4rtin

SirDice

Administrator

m4rtin

SirDice

Administrator

m4rtin

Uniballer

m4rtin

NewGuy

m4rtin

NewGuy