FreeBSD 8.2 gets slower over time.

olav · Mar 1, 2011

After I upgraded to FreeBSD 8.2 I'm seeing a strange behaviour where my system will after a few days of uptime start throttling and will eventually stop responding.

While it's throttling I can use the system, though it takes like 10 seconds before what I type will show on screen. I can't find any process which is under heavy load, nor are there any memory leaks. I can't find anything in the logs either.

The really strange thing though is that when I connect a screen directly to the server I see the flying chuck screensaver which work smooth. The system respond fast, right until that moment I try to login. After I've typed in the username and hit enter the system stops responding.

Have anyone had a similar problem before? What can I do?

Zare · Mar 1, 2011

It's a hardware issue, probably. Keep in mind that you get disk I/O while trying to login. Have you checked SMART attributes of your HDD?

olav · Mar 1, 2011

Yeah, they seem fine. I've even mirrored them.

User23 · Mar 1, 2011

This sounds a little bit like a dead lock. The strange thing is, that the system becomes slowly throttled. Maybe you should build a kernel with debugging options to figure it out.

oliverh · Mar 1, 2011

This sounds like ... almost no data. Dmesg, logs (if possible), config ... etc. pp. Otherwise it's just wild guessing.

zeissoctopus · Mar 1, 2011

How did you upgrade your base system? Using buildworld or freebsd-update? Are you sure any system scripts and configuration files in /etc are up-to-date?

nekoexmachina · Mar 1, 2011

Hello, olav!
I've had similar problem on my desk with Radeon (x1950: its r500? if I remember correctly) in KDE4 (both with and without compositing), but not in KDE3 or non-de WMs.

phoenix · Mar 1, 2011

If you don't have a monitor connected to the system, disable the console screen savers. All they are doing is wasting CPU/RAM/video resources. No point, if you can't see them. Just use the blank_saver.ko is you really need one; or simple configure the BIOS to turn off the video output after 15 minutes or whatever.

To help diagnose this, you should connect a monitor to the system, disable all screen savers and power saving, then login on separate virtual consoles and leave running:

nothing, this is to catch console messages
top(1)
gstat(8)
net-mgmt/iftop
tail(1) -f of logs like /var/log/messages
misc/gnu-watch running every 10-15 seconds outputting vmstat -i
anything else that may be helpful

That way, when things slow down, you can just flip through the virtual consoles (ALT+F1 through ALT-F7) to get a snapshot of how the system is running, without having to login.

olav · Mar 2, 2011

I used freebsd-update. I don't think there are any special configuration in /etc causing this.
It is a pure server, with no x-server.

Hey, I like the flying chuck screen saver. Everytime I see him, I feel proud as a FreeBSD user

My server is mostly idling and that screen saver doesn't steal that many cpu cycles

I've configured different virtual consoles as you suggested and will come back with more info when it happens again.

jb_fvwm2 · Mar 2, 2011

Not_relevant maybe, but if that server motherboard has onboard graphics, if you put in an aftermarket video card a *slight* chance the situation will improve.

olav · Mar 2, 2011

Okey it happened again right now. Gstat showed me that the two mirrored OS disks have 100% load. I rebooted and now its fine again. What could be causing this? Gmirror status said that the mirror was okey.

aragon · Mar 3, 2011

Flakey disks and/or controller?

I guess doing some SMART self tests with sysutils/smartmontools is a start.

olav · Mar 3, 2011

I don't belive so as I use two different controllers and smart tests doesn't say anything.

Pushrod · Mar 3, 2011

It's not fsck running, or another disk thrasher, is it?

olav · Mar 4, 2011

I have no idea, how can I check that? Wouldn't fsck show in the log?

_martin · Mar 4, 2011

As @phenix mentioned - what did gstat reported when you hit 100% disk utilization (which FS was busy)? What did top output say during that time? Did you verify the time when this started (maybe cron or periodic related) ?

You can use:
$ ps ax | grep fsck
to verify if fsck is running.

olav · Mar 5, 2011

It's the swap partition which is causing this problem. Should I try to disable it?

_martin · Mar 5, 2011

I would not do that if I were you. Rather check what is actually using your swap. Sort the top output by size:

# top -o size
and check what is eating so much memory.

You can use # ps auxwww | awk '$8 ~ /.W.*/ { print $0}' to check swapped processes (once found this command in FreeBSD mailing lists).

Pushrod · Mar 5, 2011

Is the swap partition being used heavily? If so, you have something (or may things) using more memory than you have in the machine. You will definitely notice a slowdown if so.

What does this machine do all day?

olav · Mar 6, 2011

The thing is, top show no activity. There are no visible processes causing the swap partition to overload. The server mostly idle, it runs a few jails, dns, ldap, ssh. Only the dns and ssh jails is exposed to the internet. It also act as a fileserver with ZFS. The server has 6GB ram, I've configured /boot/loader.conf with the vm.kmem_size="9G" property.

I get this output when I check swapped processes

Code:

[olav@zpool ~]$ ps auxwww | awk '$8 ~ /.W.*/ { print $0}'
root    124  0.0  0.0  2804     0  ??  IWs  -         0:00.00 adjkerntz -i
root   1044  0.0  0.0 16652     0  ??  IW   -         0:00.00 /usr/local/sbin/smartd -p /var/run/smartd.pid -c /usr/local/etc/smartd.conf
root   1591  0.0  0.0 38228     0  ??  IWs  -         0:00.00 sshd: olav [priv] (sshd)
smmsp  2602  0.0  0.0 12192     0  ??  IWs  -         0:00.00 sendmail: Queue runner@00:30:00 for /var/spool/clientmqueue (sendmail)
root   2609  0.0  0.0  8012     0  ??  SWs  -         0:00.00 /usr/sbin/cron -s

[CMD=""]top -o size[/CMD]
show this:

Code:

last pid: 18456;  load averages:  0.05,  0.01,  0.00  up 0+14:25:52  11:11:44
36 processes:  1 running, 35 sleeping
CPU:  0.0% user,  0.0% nice,  0.0% system,  0.0% interrupt,  100% idle
Mem: 2228K Active, 56K Inact, 1164M Wired, 8640K Cache, 623M Buf, 144M Free
Swap: 4096M Total, 15M Used, 4081M Free

  PID USERNAME    THR PRI NICE   SIZE    RES STATE   C   TIME   WCPU COMMAND
 1600 olav          1  44    0 38228K   548K select  1   0:00  0.00% sshd
 1591 root          1  44    0 38228K     0K sbwait  0   0:00  0.00% <sshd>
17370 root          1  44    0 34332K   704K select  0   0:01  0.00% smbd
 1068 root          1  44    0 34112K   160K select  0   0:00  0.00% smbd
 1072 root          1  44    0 34112K   116K select  0   0:00  0.00% smbd
 2756 root          1  44    0 26336K   132K select  1   0:00  0.00% winbindd
 1114 root          1  44    0 26308K   120K select  1   0:00  0.00% winbindd
18441 root          1  59    0 26260K  1004K select  0   0:00  0.00% sshd
 1073 root          1  44    0 26208K   176K select  0   0:00  0.00% winbindd
 2757 root          1  44    0 26196K   120K select  0   0:00  0.00% winbindd
 1062 root          1  44    0 24108K   608K select  0   0:02  0.00% nmbd
 1044 root          1  44    0 16652K     0K nanslp  0   0:00  0.00% <smartd>
 1601 olav          1  47    0 13356K     0K wait    0   0:00  0.00% <bash>
 2596 root          1  44    0 12192K   540K select  0   0:01  0.00% sendmail
 2602 smmsp         1  44    0 12192K     0K pause   0   0:00  0.00% <sendmail>
18454 olav          1  44    0  9408K   968K CPU0    0   0:00  0.00% top
  888 root          1  44    0  8012K   112K select  1   0:00  0.00% rpcbind
 2609 root          1  53    0  8012K     0K nanslp  0   0:00  0.00% <cron>
  866 root          1  44    0  7084K   156K select  0   0:00  0.00% syslogd
 1195 root          1  76    0  7020K    56K select  1   0:00  0.00% rsync
 1003 root          1  44    0  6952K    72K select  0   0:00  0.00% mountd
 2681 root          1  76    0  6952K    72K ttyin   0   0:00  0.00% getty

This is information which is available when the system starts throttling.
I should also mention that I've also noticed now that the /usr partition also show some activity when the system overuse the swap folder.

After reboot top show something interesting

Code:

last pid:  3277;  load averages:  0.05,  0.01,  0.00   up 0+00:39:05  12:21:15
80 processes:  1 running, 79 sleeping
CPU:  0.0% user,  0.0% nice,  0.4% system,  0.8% interrupt, 98.9% idle
Mem: 71M Active, 40M Inact, 1558M Wired, 428K Cache, 30M Buf, [color="Red"]4187M Free[/color]
Swap: 4096M Total, 4096M Free

aragon · Mar 6, 2011

Well, something is truly strange. You have 6 GB of RAM, but your first top output doesn't indicate more than about 2 GB...

_martin · Mar 6, 2011

Indeed it seems you've "lost" some memory between reboots. I bet you have bloody lot of swapping due to ZFS and very low memory.
Check if your system detects memory correctly each time:

#  grep -i "real memory" /var/log/dmesg.*

You can also use sysutils/dmidecode from ports to check how the system seems memory banks and modules.

e.g. you can use:
# dmidecode --type=16,17
to list memory banks (Physical Memory Array) and it's modules (Memory Device).

You should reseat memory modules and do a memtest+ check to verify you have no (further) HW problem.

Galactic_Dominator · Mar 7, 2011

aragon said:
Well, something is truly strange. You have 6 GB of RAM, but your first top output doesn't indicate more than about 2 GB...

Yes, finally data that was asked for so long ago.

Usually this type of symptom can be resolved by a BIOS update.

aragon · Mar 7, 2011

Considering the OP is using ZFS and Samba on 8.2, could the problem be (1b) on this?

Galactic_Dominator · Mar 7, 2011

Well if they haven't disabled sendfile it's a guarantee. And that patch doesn't resolve all ZFS sendfile issues, it should still be disabled. The was a recent thread on stable@ for anyone interested. However, that would have nothing to do with the limited amount of RAM made available to the system which is a separate problem, pretty common on re-purposed Dell's but not limited to them.

That's why when dmesg was requested and not given, it greatly extents the time to resolution.