Hi all,
I'm trying to troubleshoot a problem that's been going on for a while on a production FreeBSD server I'm running, as recently the issue is getting worse.
Essentially, the problem is that after a certain period of time the server completely freezes. It will ping, but you can't access the web service it's running (served via a Java engine) and you can't SSH the box.
The freezes can occur as infrequently as monthly, to as frequently as daily. The server is racked in a data centre, but putting a keyboard/screen on it reveals nothing. I can't get any input/output to the screen.
A reboot brings everything back to life.
I've tried:
- Hardware checks
- New PSU
- New MB/CPU/RAM (ECC Ram)
- New drives
Everything except swapping out the application itself, or rebuilding the server.
The box has FreeBSD-10.2-RELEASE on it.
It has 16GB of RAM, an 8 drive ZFS array (7 disks in the zpool - RaidZ3 - with 1 spare). It has a dedicated drive for boot OS, and now a dedicated 250GB SATA disk for swap as I thought this was the issue as swap usage seemed to grow and it was originally swapping on ZFS - not ideal.
The server is running a backup utility called Syncrify, which runs via Java and is basically a fancy HTTPS wrapper around Rsync - so when a lot of clients are backing up together there will be a lot of random disk I/O reads and writes, plus high CPU as the blocks on disk are checked.
I had PuTTY running to the box today, with top running when it froze. I'm not sure if this can help any experts shed any light on things?
The machine will never come back to life on it's own, it will need a full power off/on.
I checked dmesg (and dmesg.yesterday) earlier today after this afternoon's reboot and also 'messages' and neither had anything resembling any errors or issues at all.
Any troubleshooting steps I should take before I throw this out the window would be most appreciated!
I'm trying to troubleshoot a problem that's been going on for a while on a production FreeBSD server I'm running, as recently the issue is getting worse.
Essentially, the problem is that after a certain period of time the server completely freezes. It will ping, but you can't access the web service it's running (served via a Java engine) and you can't SSH the box.
The freezes can occur as infrequently as monthly, to as frequently as daily. The server is racked in a data centre, but putting a keyboard/screen on it reveals nothing. I can't get any input/output to the screen.
A reboot brings everything back to life.
I've tried:
- Hardware checks
- New PSU
- New MB/CPU/RAM (ECC Ram)
- New drives
Everything except swapping out the application itself, or rebuilding the server.
The box has FreeBSD-10.2-RELEASE on it.
It has 16GB of RAM, an 8 drive ZFS array (7 disks in the zpool - RaidZ3 - with 1 spare). It has a dedicated drive for boot OS, and now a dedicated 250GB SATA disk for swap as I thought this was the issue as swap usage seemed to grow and it was originally swapping on ZFS - not ideal.
The server is running a backup utility called Syncrify, which runs via Java and is basically a fancy HTTPS wrapper around Rsync - so when a lot of clients are backing up together there will be a lot of random disk I/O reads and writes, plus high CPU as the blocks on disk are checked.
I had PuTTY running to the box today, with top running when it froze. I'm not sure if this can help any experts shed any light on things?
The machine will never come back to life on it's own, it will need a full power off/on.
I checked dmesg (and dmesg.yesterday) earlier today after this afternoon's reboot and also 'messages' and neither had anything resembling any errors or issues at all.
Any troubleshooting steps I should take before I throw this out the window would be most appreciated!