Other Periodic disk activity and system 'hangs' while waiting for IO to complete

I still use physical spinning disks and can hear when they're busy. I had an issue awhile ago where a periodic job was running and writing to disk every few seconds, but AFAIK, I am no longer running that job yet there is something periodically writing to disk every few seconds.

Whenever it does, the system appears to hang, the keyboard and mouse are non-responsive and sometimes, the keyboard input is not even captured during that time. So, if I were typing while the system was out to lunch, they keystrokes were silently dropped.

If I open a new terminal window, it takes about a second for it to load (I have a bunch of things it does). After a few minutes when everything is cached, opening a new tab is instantaneous.

I am running iotop with a refresh every second, but I don't see anything standing out. I recently switched to a different hard drive as I rebuilt my system - I should say that all my drives are in some state of failure and I think this one might be worse off. My workstation and router are running on the same physical box in 2 separate jails, I have recently setup rctl to limit resources the jails can use.

My router uses fairly minimal resources, I have it set to 50% CPU and 2G of ram, listing the resources it consumes, it is well below those limits. For the workstation, I have it set to 300% CPU and 16G of ram. I only approach 16G of ram when using go fix on some larger projects or when digikam is running.

I have no other resource limits set.

htop shows a fairly minimal load on the system both in terms of CPU and memory and perhaps I don't know how ot read iotop, but nothing stood out there either.

What other tool(s) shall I use to investigate this? My other system for reference did not have rctl setup, but I noticed the pausing even before using rctl, so I don't believe that is the culprit or factor. Perhaps it is indeed the drive, I can always swap over to that for comparison. If the drive were going, would dmesg show that or perhaps SMART tools?

EDIT #1:
drive A:
raw read error rate: 51334312
seek error rate: 444635657

drive B:
raw read error rate: 4270608
seek error rate: 498599102

Drive B has a lower raw read error rate, but higher seek error rate. It was powered on for about 2000 more hours.

EDIT #2:
If I look at iostat -w1x, I periodically see the tout, KB/t, and tps numbers increase every 5 seconds which seems to correspond to the hard drive noise I can hear. My system CPU is an i5-3470 and I'm using the onboard GPU, not an external unit. I know the onboard GPU's performance isn't great, but I can generally watch full HD videos without the system pausing. Beyond that, I notice pauses and the frames dropped increases.

iostat -wx1
2 237 16.4 74 1.19 0.0 0 0.00 0.0 0 0.00 2 0 0 0 98

I'm mainly wondering if there is a way to improve this pausing that seemed to crop up recently. Perhaps I try disabling resource limits to see if that has an effect.

I disabled rctl and reenabled it, and that is where I can see the difference. For example, with rctl enabled, whenever I open a new terminal, the terminal sets up an ssh-agent if it needs to. With rctl enabled, it waits for a lock, with it disabled, I can open many tabs concurrently, and they all complete quickly. It seems rctl is affecting lock files? I need to investigate more.

My idea for using rctl was to prevent a jail from bringing down the host by consuming too much resources. However, the only settings I'm touching are CPU and memory.

EDIT #3:
I am not certain the perceived hangup has anything to do with the disk. It doesn't seem like there is that much activity presently. I came across another post that suggests that the system hanging could actually be the monitor:

In my case, I notice it now with watching videos. A video with a bandwidth of 1958 kb/s plays fine, but 4416 kb/s is choppy. Both are the same framerate, codec, and resolution.
 
Last edited:
I think the hangups are due to the device starting to fail. This is just speculation, but I am monitoring the raw error rate and I can see it steadily increasing:

while [ 1 ]; do smartctl -a /dev/ada0 | grep Raw_Read_Error_Rate; sleep 15;done

I'm getting 3000 errors every 15 seconds roughly, that seems a bit high to me. When I started monitoring this:

81797472

and now:

86395168

It seems the drive may die today or at least become unusably slow.
 
It was powered on for about 2000 more hours.
That's not really old.

Here's one of mine:
Code:
  7 Seek_Error_Rate         0x000f   091   060   045    Pre-fail  Always       -       1353733405
  9 Power_On_Hours          0x0032   022   022   000    Old_age   Always       -       68570

which seems to correspond to the hard drive noise I can hear.
Weird noises? Could it be the drive is powering up and you're hearing the initialization "rattle" of the heads? Some power-saving option that turns the drive off? Then gets woken up when accessed? That could cause a slight delay too.
 
I think it is disk access noise, it isn't terribly loud and is roughly about 250 ms in duration. But for instance, when I run digikam to scan my media collection, that makes a ton of that noise for as long as the scan takes place. I believe that is the head moving back and forth. My media collection is on a different disk entirely.

I don't believe it is powered down, I would hear it spin up first, the drives aren't idle long enough for that to happen as it is every 5s, I hear about 250ms worth of disk access. it is more pronounced when the system first boots as the cache is empty.

The raw error rate is now:
96648400

I was just thinking we're comparing whose flesh wound is more serious. There should be a parody of the black knight in Monty Python's, The Holy Grail, but for computers.

I think I made it spike a bunch as I decided to rebuild my system just in case this drive kicks the bucket. That process pulls the old system for packages, it uses it as a package cache, git projects, and ZFS volumes to restore so it puts a bit of strain on the disk. My current cold disk image is about 1 week old.
 
I should say, those error counts are increasing drastically because I am making another system from the current system. It is pulling git and ZFS from the current system. But yes, I backup nightly, it is just the restoration part that might be tricky without a live copy.
 
I swapped out the drive for a slightly 'better' one that has a lower error count and I don't hear the disk activity that I heard with the other one (original drive started at about 8M errors and last had 14M when powered down, this is about 2.2M but similar age). I still think something is going, but I'm not sure what. The system has been up for a little bit of time, so things should be cached. There is still considerable latency when opening a new terminal window. While I have a bunch of scripts that get loaded, I don't believe that is it and it used to be quick with no changes recently. When that new tab opens, the keyboard and mouse don't respond.

Perhaps I need to do a memory test?
 
I moved the drive back over to my backup system and am experiencing the same pausing there after I put it in the main machine. So, perhaps it isn't a hardware issue, or if it is, it is identical, which I find extremely unlikely. The only change I made recently (in the past week) was adding rctl for the jails to limit the amount of resources a jail can use. I will disable that and see if that is the culprit. The original system had periodic disk access every 5s that wasn't a major event, but enough for me to hear. Normally, I don't hear much disk activity unless I'm doing a ton of IO.

Perhaps my configuration for rctl is too low for my workstation jail:
pcpu: 300
memory: 16G

I think I sorted out the pausing on the new system, limiting my workstation to 3 CPUs is a bit too much, it needs more for smooth operation even if the system isn't completely pinned.
 
One other change I made recently was to record the output from SMART at the time I provision a machine so that I can at least have some data points for reference. I'm also considering incorporating this into a periodic job that compares the history of the device to see if there has been a large change.
 
No, they're in 2 separate physical boxes (Active and a "Build" Box). I keep cold spares of my system drive. Drive A was the active one in use and Drive B was a cold spare. I warmed it up and replaced drive A with it while making 2 other cold spares and labeling drive A as "about to fail".
 
Back
Top