My company hosts a number of websites across various FreeBSD servers, and I have been doing so personally for over 15 years.
Recently an issue has cropped up that I have not seen before; initially I believed it to be hardware related but am now seeing the same issue on FreeBSD instances inside Google Compute Engine.
In a nutshell, the O/S works perfectly, as expected - but every day or so, all I/O functions take much longer to complete than usual (for example, a simple "ls" will take 3-4 seconds!). The system does not recover from this unless it is rebooted.
We do have some cron jobs that run around the time this happens, basic PHP scripts that resize a number of images (average of about ~200 per night).
We've seen heavy PHP scripts bring a server to it's knees before, but killing or waiting for it finish should allow the server to recover - not so in this case.
During the most recent slow I/O event, I killed every process on the system - including Apache, MySQL, Sendmail - leaving just the essential core system processes. However, I/O still remained impossibly slow.
Here is the output of top and gstat during this event - as you can see, the system thinks it's idle:
[FONT=Courier New][/FONT]
As stated before, this has happened both on a dedicated server (16GB E3), and now on Google Compute Engine. Both running 10.1-RELEASE.
It could be triggered by our PHP scripts -- but it's very hard to replicate when we run manually, and happens very randomly. At the end of the day, they are simple scripts that have been in service for years without issue.. problem started to occur (we believe) when we started using 10.1.
Also why does the system I/O remain slow even when there is nothing running? Seems bizarre to have to reboot to fix. Very un-FreeBSD-esque, in my experience.
Today, I've upgraded our GCE instance to 10.2-RELEASE to see if it helps.
Has anybody seen anything like this? Any other good troubleshooting techniques for when it's in this state?
Cheers
Recently an issue has cropped up that I have not seen before; initially I believed it to be hardware related but am now seeing the same issue on FreeBSD instances inside Google Compute Engine.
In a nutshell, the O/S works perfectly, as expected - but every day or so, all I/O functions take much longer to complete than usual (for example, a simple "ls" will take 3-4 seconds!). The system does not recover from this unless it is rebooted.
We do have some cron jobs that run around the time this happens, basic PHP scripts that resize a number of images (average of about ~200 per night).
We've seen heavy PHP scripts bring a server to it's knees before, but killing or waiting for it finish should allow the server to recover - not so in this case.
During the most recent slow I/O event, I killed every process on the system - including Apache, MySQL, Sendmail - leaving just the essential core system processes. However, I/O still remained impossibly slow.
Here is the output of top and gstat during this event - as you can see, the system thinks it's idle:
top
:
Code:
last pid: 66883; load averages: 0.12, 0.08, 0.08 up 0+15:34:48 06:22:30
11 processes: 1 running, 10 sleeping
CPU: 0.0% user, 0.0% nice, 0.1% system, 0.0% interrupt, 99.9% idle
Mem: 28M Active, 13G Inact, 1305M Wired, 11M Cache, 1542M Buf, 763M Free
Swap: 1024M Total, 1024M Free
PID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU COMMAND
454 root 1 20 0 14500K 1576K select 3 0:00 0.00% syslogd
66460 simon 1 20 0 86472K 7384K select 3 0:00 0.00% sshd
66471 root 1 20 0 23572K 3488K pause 0 0:00 0.00% csh
896 root 1 20 0 61204K 4796K select 1 0:00 0.00% sshd
66443 root 1 20 0 86472K 7364K select 1 0:00 0.00% sshd
66883 root 1 20 0 21916K 2848K CPU1 1 0:00 0.00% top
372 root 1 20 0 13164K 4116K select 1 0:00 0.00% devd
66462 simon 1 20 0 47704K 2704K wait 1 0:00 0.00% su
66461 simon 1 20 0 17064K 2584K wait 2 0:00 0.00% sh
268 root 1 52 0 14624K 1608K select 0 0:00 0.00% dhclient
316 _dhcp 1 23 0 14624K 1704K select 3 0:00 0.00% dhclient
gstat
:
Code:
dT: 1.041s w: 1.000s
L(q) ops/s r/s kBps ms/r w/s kBps ms/w %busy Name
0 0 0 0 0.0 0 0 0.0 0.0| da0
0 0 0 0 0.0 0 0 0.0 0.0| da1
0 0 0 0 0.0 0 0 0.0 0.0| da2
0 0 0 0 0.0 0 0 0.0 0.0| da0p1
0 0 0 0 0.0 0 0 0.0 0.0| da0p2
0 0 0 0 0.0 0 0 0.0 0.0| da0p3
0 0 0 0 0.0 0 0 0.0 0.0| da1p1
0 0 0 0 0.0 0 0 0.0 0.0| da2p1
0 0 0 0 0.0 0 0 0.0 0.0| gpt/bootfs
0 0 0 0 0.0 0 0 0.0 0.0| gptid/5132952f-9169-11e4-bb20-05a1851c816d
0 0 0 0 0.0 0 0 0.0 0.0| gpt/swapfs
0 0 0 0 0.0 0 0 0.0 0.0| gpt/rootfs
As stated before, this has happened both on a dedicated server (16GB E3), and now on Google Compute Engine. Both running 10.1-RELEASE.
It could be triggered by our PHP scripts -- but it's very hard to replicate when we run manually, and happens very randomly. At the end of the day, they are simple scripts that have been in service for years without issue.. problem started to occur (we believe) when we started using 10.1.
Also why does the system I/O remain slow even when there is nothing running? Seems bizarre to have to reboot to fix. Very un-FreeBSD-esque, in my experience.
Today, I've upgraded our GCE instance to 10.2-RELEASE to see if it helps.
Has anybody seen anything like this? Any other good troubleshooting techniques for when it's in this state?
Cheers