FreeBSD 10.1 extremely slow I/O, cured by reboot

sdiv · Oct 7, 2015

My company hosts a number of websites across various FreeBSD servers, and I have been doing so personally for over 15 years.

Recently an issue has cropped up that I have not seen before; initially I believed it to be hardware related but am now seeing the same issue on FreeBSD instances inside Google Compute Engine.

In a nutshell, the O/S works perfectly, as expected - but every day or so, all I/O functions take much longer to complete than usual (for example, a simple "ls" will take 3-4 seconds!). The system does not recover from this unless it is rebooted.

We do have some cron jobs that run around the time this happens, basic PHP scripts that resize a number of images (average of about ~200 per night).

We've seen heavy PHP scripts bring a server to it's knees before, but killing or waiting for it finish should allow the server to recover - not so in this case.

During the most recent slow I/O event, I killed every process on the system - including Apache, MySQL, Sendmail - leaving just the essential core system processes. However, I/O still remained impossibly slow.

Here is the output of top and gstat during this event - as you can see, the system thinks it's idle:

top:

Code:

last pid: 66883;  load averages:  0.12,  0.08,  0.08                                                                                                                             up 0+15:34:48  06:22:30
11 processes:  1 running, 10 sleeping
CPU:  0.0% user,  0.0% nice,  0.1% system,  0.0% interrupt, 99.9% idle
Mem: 28M Active, 13G Inact, 1305M Wired, 11M Cache, 1542M Buf, 763M Free
Swap: 1024M Total, 1024M Free
  PID USERNAME    THR PRI NICE   SIZE    RES STATE   C   TIME    WCPU COMMAND
  454 root          1  20    0 14500K  1576K select  3   0:00   0.00% syslogd
66460 simon         1  20    0 86472K  7384K select  3   0:00   0.00% sshd
66471 root          1  20    0 23572K  3488K pause   0   0:00   0.00% csh
  896 root          1  20    0 61204K  4796K select  1   0:00   0.00% sshd
66443 root          1  20    0 86472K  7364K select  1   0:00   0.00% sshd
66883 root          1  20    0 21916K  2848K CPU1    1   0:00   0.00% top
  372 root          1  20    0 13164K  4116K select  1   0:00   0.00% devd
66462 simon         1  20    0 47704K  2704K wait    1   0:00   0.00% su
66461 simon         1  20    0 17064K  2584K wait    2   0:00   0.00% sh
  268 root          1  52    0 14624K  1608K select  0   0:00   0.00% dhclient
  316 _dhcp         1  23    0 14624K  1704K select  3   0:00   0.00% dhclient

gstat:

Code:

dT: 1.041s  w: 1.000s
L(q)  ops/s    r/s   kBps   ms/r    w/s   kBps   ms/w   %busy Name
    0      0      0      0    0.0      0      0    0.0    0.0| da0
    0      0      0      0    0.0      0      0    0.0    0.0| da1
    0      0      0      0    0.0      0      0    0.0    0.0| da2
    0      0      0      0    0.0      0      0    0.0    0.0| da0p1
    0      0      0      0    0.0      0      0    0.0    0.0| da0p2
    0      0      0      0    0.0      0      0    0.0    0.0| da0p3
    0      0      0      0    0.0      0      0    0.0    0.0| da1p1
    0      0      0      0    0.0      0      0    0.0    0.0| da2p1
    0      0      0      0    0.0      0      0    0.0    0.0| gpt/bootfs
    0      0      0      0    0.0      0      0    0.0    0.0| gptid/5132952f-9169-11e4-bb20-05a1851c816d
    0      0      0      0    0.0      0      0    0.0    0.0| gpt/swapfs
    0      0      0      0    0.0      0      0    0.0    0.0| gpt/rootfs

[FONT=Courier New][/FONT]
As stated before, this has happened both on a dedicated server (16GB E3), and now on Google Compute Engine. Both running 10.1-RELEASE.

It could be triggered by our PHP scripts -- but it's very hard to replicate when we run manually, and happens very randomly. At the end of the day, they are simple scripts that have been in service for years without issue.. problem started to occur (we believe) when we started using 10.1.

Also why does the system I/O remain slow even when there is nothing running? Seems bizarre to have to reboot to fix. Very un-FreeBSD-esque, in my experience.

Today, I've upgraded our GCE instance to 10.2-RELEASE to see if it helps.

Has anybody seen anything like this? Any other good troubleshooting techniques for when it's in this state?

Cheers

vadimk · Oct 7, 2015

Just as an assumption - could it be related to FS type? Have you tried on UFS/ZFS? It seems like some resource (like open files) is full.

sdiv · Oct 7, 2015

All filesystems are UFS. The scripts do behave themselves, closing files when not in use. fstat / sysctl -a |grep files is normal and well inside limits. Nowhere near inode limit either.

cpm@ · Oct 7, 2015

Hi,

Have a look at this: Profiling tools, tips and tricks. IMHO it's a good start point to trace back your problem.

sdiv · Oct 8, 2015

cpm said:
Hi,

Have a look at this: Profiling tools, tips and tricks. IMHO it's a good start point to trace back your problem.

Thanks for this - I've already done most of the suggested performance troubleshooting; and it's especially useful for running on a busy system and to identify bottlenecks. Everything reports as normal/idle, nothing abnormal is logged - yet anything I/O related is very slow.

In the particular case above - there is nothing running except for the crucial processes for O/S & networking. Had our scripts caused the problem, everything should have been freed (memory, open files, etc) once those scripts terminated.

I'm working on a way to replicate the problem every time, then I may be able to get to the bottom of it, or at least provide more clues.

Cheers

vadimk · Oct 8, 2015

sdiv, let us know the result of your experiment. Problem is interesting and seems to be related to your PHP scripts, but logically, even if there are some serious bugs - OS should not be harmed. I would also experiment with different FS type, like ZFS. Just to make sure it is not related to FS.

drhowarddrfine · Oct 8, 2015

FWIW, I do the same thing, for 11 years, and don't have any such issues at all.

sdiv · Oct 8, 2015

I'm not sure why I didn't think of this before, but we're hitting the vnode ceiling. Increasing kern.maxvnodes alleviates the problem.

I assumed that vnodes would be released once the scripts are done; but I guess I don't know enough about the internals of FreeBSD to comment further.

FreeBSD 10.1 extremely slow I/O, cured by reboot

Moderator