ZFS Help me track down random CRAZY SLOW IO

astounding · Jun 15, 2019

I have a system exhibiting bizarre slow IO (or paused IO as if there was a deadlock) at random times that appears to affect almost any process doing disk IO. Here's a quick example from the command-line where I had to execute a time dd if=/dev/zero of=io_test.txt bs=1M count=1000 four times to encounter the randomly occurring crazy slow IO problem plaguing the system. The first three runs look normal. Then the fourth run. Forty-six plus seconds?!?!?!?!?:

Code:

user@host:~$ time dd if=/dev/zero of=io_test.txt bs=1M count=1000
1000+0 records in
1000+0 records out
1048576000 bytes transferred in 0.236960 secs (4425109928 bytes/sec)

real    0m0.238s
user    0m0.000s
sys    0m0.238s
user@host:~$ time dd if=/dev/zero of=io_test.txt bs=1M count=1000
1000+0 records in
1000+0 records out
1048576000 bytes transferred in 0.232177 secs (4516271740 bytes/sec)

real    0m0.235s
user    0m0.000s
sys    0m0.235s
user@host:~$ time dd if=/dev/zero of=io_test.txt bs=1M count=1000
1000+0 records in
1000+0 records out
1048576000 bytes transferred in 0.241685 secs (4338612840 bytes/sec)

real    0m0.352s
user    0m0.000s
sys    0m0.352s
user@host:~$ time dd if=/dev/zero of=io_test.txt bs=1M count=1000
1000+0 records in
1000+0 records out
1048576000 bytes transferred in 0.273421 secs (3835019690 bytes/sec)

real    0m46.998s
user    0m0.000s
sys    0m0.285s
user@host:~$

NOTICE: The last run in the above completes in 46+ seconds. That run encountered the CRAZY SLOW IO issue I'm referring to. It's unpredictable. It can hit ANY process doing disk IO to the ZFS pool in question. One can't predict when.

The filesystem is ZFS. There are NO logs showing up in syslog indicating any timeouts or errors from disks (SATA

The system is running FreeBSD, a 12.0-RELEASE-p3 GENERIC kernel on an AMD Ryzen 7 8-core CPU with a pair of LSI 8-port SATA cards feeding 6-pairs of ZFS mirrored drive pairs in a 52 TB (raw drive capacity, actual is lower) storage pool.

CPU/MEMORY:

Code:

CPU: AMD Ryzen 7 2700X Eight-Core Processor          (3700.08-MHz K8-class CPU)
  Origin="AuthenticAMD"  Id=0x800f82  Family=0x17  Model=0x8  Stepping=2
  Features=0x178bfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,MMX,FXSR,SSE,SSE2,HTT>
  Features2=0x7ed8320b<SSE3,PCLMULQDQ,MON,SSSE3,FMA,CX16,SSE4.1,SSE4.2,MOVBE,POPCNT,AESNI,XSAVE,OSXSAVE,AVX,F16C,RDRAND>
  AMD Features=0x2e500800<SYSCALL,NX,MMX+,FFXSR,Page1GB,RDTSCP,LM>
  AMD Features2=0x35c233ff<LAHF,CMP,SVM,ExtAPIC,CR8,ABM,SSE4A,MAS,Prefetch,OSVW,SKINIT,WDT,TCE,Topology,PCXC,PNXC,DBE,PL2I,MWAITX>
  Structured Extended Features=0x209c01a9<FSGSBASE,BMI1,AVX2,SMEP,BMI2,RDSEED,ADX,SMAP,CLFLUSHOPT,SHA>
  XSAVE Features=0xf<XSAVEOPT,XSAVEC,XINUSE,XSAVES>
  AMD Extended Feature Extensions ID EBX=0x1007<CLZERO,IRPerf,XSaveErPtr>
  SVM: (disabled in BIOS) NP,NRIP,VClean,AFlush,DAssist,NAsids=32768
  TSC: P-state invariant, performance statistics
real memory  = 68719476736 (65536 MB)
avail memory = 66828918784 (63733 MB)

ZFS storage pool:

Code:

user@host:~$ zpool status
  pool: storagepool
state: ONLINE
  scan: scrub repaired 0 in 3 days 04:55:15 with 0 errors on Wed Jun 12 18:26:42 2019
config:

    NAME                                    STATE     READ WRITE CKSUM
    storage                                 ONLINE       0     0     0
      mirror-0                              ONLINE       0     0     0
        gpt/bay01_cableA1_a1_oct2015_6tb    ONLINE       0     0     0
        gpt/bay06_cableC2_a2_oct2015_6tb    ONLINE       0     0     0
      mirror-1                              ONLINE       0     0     0
        gpt/bay11_cableB3_b1_sep2016_8tb    ONLINE       0     0     0
        gpt/bay02_cableC1_b2_sep2016_8tb    ONLINE       0     0     0
      mirror-2                              ONLINE       0     0     0
        gpt/bay09_cableA3_g1_may2017_8tb    ONLINE       0     0     0
        gpt/bay14_cableC4_g2_may2017_8tb    ONLINE       0     0     0
      mirror-3                              ONLINE       0     0     0
        gpt/bay08_cableD2_d2_17jul2017_8tb  ONLINE       0     0     0
        gpt/bay15_cableB4_d1_17jul2017_8tb  ONLINE       0     0     0
      mirror-4                              ONLINE       0     0     0
        gpt/bay13_cableA4_e1_aug2018_14tb   ONLINE       0     0     0
        gpt/bay04_cableD1_e2_aug2018_14tb   ONLINE       0     0     0
      mirror-5                              ONLINE       0     0     0
        gpt/bay07_cableB2_f1_apr2017_8tb    ONLINE       0     0     0
        gpt/bay12_cableD3_f2_apr2017_8tb    ONLINE       0     0     0

errors: No known data errors
user@host:~$

Again, in system logs, there are NO SATA drive errors, ALL drives look good with SMART data, the recent full storage pool scrub found ZERO errors, though it took a VERY LONG time to complete.

The root filesystem, base system, and installed packages (/usr/local) all reside on a single ZFS filesystem, a separate ZFS storage pool, residing on a pair of mirrored SSD drives. That filesystem does NOT exhibit the problem. Running the timed dd test thousands of times never resulted in taking longer than less than 1/3 of a second. The SSDs connect via motherboard SSD ports, and so don't use the LSI cards.

Without ZERO logged errors letting me know where to look, what suggestions might you have to help me track down the source if these bizarre randomly occurring IO pauses slowing down every process doing drive IO? Do I need to compile a new kernel and enable DTRACE and catch a process in the act? (I've never used it before, so I'd be a newborn DTRACE infant--but I'm willing to learn, and tend to pick up things reasonably quickly.)

Thanks for any and all ideas!

-- Aaron

rootbert · Jun 15, 2019

I see nothing slow ... 1GB in 1/4th of a second seems ok. it equals more than 600 MB per second per mirror. maybe your drives are in an energy saving state and have their heads parked and their rotations of the spinning disks throttled?

ralphbsz · Jun 16, 2019

Two things. I see nothing particularly slow. Instead, I see four measurements that are within 5% of each other. That's perfectly repeatable.

Second, you are actually not doing any disk IO at all. Instead, you are just writing to file system buffer cache, meaning memory. Look, you are reading from /dev/zero (which is de-facto infinitely fast), and writing to a file, without doing any sync. Do you understand the semantics of buffered file system writes? They don't actually go to disk right away, they stay in memory. To make things even more amusing, you are actually overwriting the same file. The moment the second dd starts, the file system could (if it wanted to) throw all the results form the first dd away, and never even bother writing them to disk, since they will guaranteed never be read again.

You also have enough memory that all this can be held in memory, if ZFS wants.

Let me ask you a question: What is your actual workload? What is the purpose of this computer? Is it running fast enough for its workload and intended use? What are you trying to benchmark here? Just measuring random things out of the blue is not going to be productive.

astounding · Jun 16, 2019

Sorry, I didn't make it clear. The first three runs of the test work just fine as you both noted. Please check out the fourth and final run. That one hit the random, unpredictable issue. It took 46+ full seconds to complete: real 0m46.998s I will re-edit my post so readers will see that the first time through.

The throughput of the dd command appears normal, the bits written per second. So once the IO begins, it appears quite normal. But something delayed either the beginning or the end of the process (likely the beginning) so dd took a complete 46-seconds plus to complete as measured by time. I noticed that the fourth test was taking a while, so I logged into the system with a separate SSH session, changed to the test directory, and attempted to ls -l io_test.txt prior to the completion. The command instantly hung, and even CTRL-C failed to interrupt it. Then suddenly the dd command completed and instantly the ls -l nearly simultaneously completed as well.

I'm guessing some contention issue. But with zero error logs to point me somewhere, I'm at a loss.

VladiBG · Jun 16, 2019

Do not use /dev/zero (/dev/null) as test. Try with RANDOM and with bigger file size over your raid cache. Your test is useless.
While you are performing write / read operation use another console and monitor the zpool iostat statistic

astounding · Jun 16, 2019

VladiBG said:
Do not use /dev/zero (/dev/null) as test. Try with RANDOM and with bigger file size over your raid cache. Your test is useless.
While you are performing write / read operation use another console and monitor the zpool iostat statistic

I will respectfully disagree. My test showed the problem clearly. The first three runs were normal. The last run took over 46 seconds to complete. Using /dev/zero.

I agree that for testing throughput on a normal system, a bigger file size, and some other source of data would be appropriate. But for what I'm highlighting, this simple test works just fine.

Note that it is NOT the bits/second that is the problem. Apparently once the read/write begins, the tests I posted (all four) are well within the range of normal expected speeds. It is, however, the fact that the dd process total execution time seemed to "hang" and take more than 46 seconds to finish what should have been less than a single second. Note in my second post above in this thread, another ls process simultaneously appeared to "hang" as well in attempting to interact with the same file/directory.

ralphbsz · Jun 16, 2019

Oh sorry, didn't look at the real time spent. That's indeed odd. Why would dd hang before or after writing 1GB? And why for such a long time? In ~45 seconds, it could do a lot of work, much more than writing a 1GB file requires.

Do two things to debug this. First, run some IO statistics gathering tool while dd is running, for starters "iostat 1" is good enough. That will tell us whether it is hanging before or after. Second, run the dd command with system call tracing and times turned on, by putting "truss -d" in front of the dd command. That will tell us what system call dd is executing when it hangs.

My guess is on some oddity in ZFS involving buffer cache management. The situation you create in this test is truly bizarre: You write 1gig of memory (a sizable amount, although a small fraction of your total memory), and then before the file system has time to even think about beginning to write it to disk, you invalidate that buffer cache by recreating the file (which in effect turns into a call to open with the O_CREAT and O_TRUNC flags, telling the file system that the previous gig of memory is garbage now). So within a second or so, you invalidate 3gig of memory, but then try to allocate a 4th 1gig for the buffer cache of the 4th run. That is a truly bizarre memory usage workload. Maybe ZFS needs to run some sort of "garbage collector" to deal with this? Maybe this triggers a long-running memory cleanup operation?

You did observe that during the ~45 second gap, ZFS appeared hung. That sort of justifies my theory that ZFS has to do some internal cleanup.

One more question, related to VladiBG's comment: Do you have compression or deduplication enabled? Trying to compress a 1gig file of all zeroes is trivial, but very unusual (people don't write that much zero data). On the other hand, reduplicating that much memory is insanely efficient (since you are writing something where every buffer is a duplicate). Maybe that triggers the long-running operation. Try this with random data, and see whether the behavior changes.

astounding · Jun 16, 2019

Some more data doing time truss -d dd if=/dev/zero of=io_test1.txt bs=1M count=1000 in one SSH shell and time truss -d dd if=/dev/zero of=io_test2.txt bs=1M count=1000 in a separate SSH shell, both in the same user's home directory both locked up.

A third simple echo "text" > out.txt executed in a different directory (and filesystem but in the same ZFS pool) completed normally while the two processes were hung up. However subsequent testing in that same different directory using dd when once again the other two appeared to lock up, locked up in exactly the same manner and place, then all completed.

I'm assuming the entire truss -d output isn't necessary as the point at which the delay hits is very clear:

Code:

...
0.001754451 sigprocmask(SIG_BLOCK,{ SIGHUP|SIGINT|SIGQUIT|SIGKILL|SIGPIPE|SIGALRM|SIGTERM|SIGURG|SIGSTOP|SIGTSTP|SIGCONT|SIGCHLD|SIGTTIN|SIGTTOU|SIGIO|SIGXCPU|SIGXFSZ|SIGVTALRM|SIGPROF|SIGWINCH|SIGINFO|SIGUSR1|SIGUSR2 },{ }) = 0 (0x0)
0.001774651 sigprocmask(SIG_SETMASK,{ },0x0)     = 0 (0x0)
0.001797491 sigprocmask(SIG_BLOCK,{ SIGHUP|SIGINT|SIGQUIT|SIGKILL|SIGPIPE|SIGALRM|SIGTERM|SIGURG|SIGSTOP|SIGTSTP|SIGCONT|SIGCHLD|SIGTTIN|SIGTTOU|SIGIO|SIGXCPU|SIGXFSZ|SIGVTALRM|SIGPROF|SIGWINCH|SIGINFO|SIGUSR1|SIGUSR2 },{ }) = 0 (0x0)
0.001817421 sigprocmask(SIG_SETMASK,{ },0x0)     = 0 (0x0)
0.001850700 mmap(0x0,4096,PROT_READ|PROT_WRITE,MAP_PRIVATE|MAP_ANON,-1,0x0) = 34366414848 (0x80065e000)
0.001887671 open("/usr/share/locale/en_US.UTF-8/LC_CTYPE",O_RDONLY,013720646105) = 3 (0x3)
0.001907930 fstat(3,{ mode=-r--r--r-- ,inode=725148,size=41192,blksize=41472 }) = 0 (0x0)
0.001931480 mmap(0x0,41192,PROT_READ,MAP_PRIVATE,3,0x0) = 34366418944 (0x80065f000)
0.001951620 close(3)                 = 0 (0x0)
0.001973660 mmap(0x0,86016,PROT_READ|PROT_WRITE,MAP_PRIVATE|MAP_ANON,-1,0x0) = 34366464000 (0x80066a000)
0.002018410 munmap(0x80065f000,41192)         = 0 (0x0)
0.002039130 mmap(0x0,4096,PROT_READ|PROT_WRITE,MAP_PRIVATE|MAP_ANON,-1,0x0) = 34366418944 (0x80065f000)
0.002078620 openat(AT_FDCWD,"/dev/zero",O_RDONLY,00) = 3 (0x3)
0.002097759 fstat(3,{ mode=crw-rw-rw- ,inode=19,size=0,blksize=4096 }) = 0 (0x0)
0.002119010 ioctl(3,FIODTYPE,0x7fffffffd844)     = 0 (0x0)
0.002144069 cap_rights_limit(3,{ CAP_READ,CAP_SEEK_TELL }) = 0 (0x0)
     >>>>>> HERE IS THE DELAY:
55.039338825 openat(AT_FDCWD,"io_test1.txt",O_RDWR|O_CREAT|O_TRUNC,0666) = 4 (0x4)
55.039399595 fstat(4,{ mode=-rw-r--r-- ,inode=66126,size=0,blksize=131072 }) = 0 (0x0)
55.039474625 lseek(4,0x0,SEEK_CUR)         = 0 (0x0)
55.039508044 cap_rights_limit(4,{ CAP_READ,CAP_WRITE,CAP_SEEK_TELL,CAP_FTRUNCATE,CAP_IOCTL }) = 0 (0x0)
55.039537844 cap_ioctls_limit(0x4,0x7fffffffd970,0x2) = 0 (0x0)
55.039577064 cap_rights_limit(0,{ CAP_READ,CAP_SEEK_TELL,CAP_FCNTL,CAP_FSTAT,CAP_EVENT,CAP_IOCTL }) = 0 (0x0)
55.039604014 cap_ioctls_limit(0x0,0x7fffffffd8f0,0x3) = 0 (0x0)
55.039636713 cap_fcntls_limit(0,CAP_FCNTL_GETFL) = 0 (0x0)
55.039661744 cap_rights_limit(1,{ CAP_WRITE,CAP_SEEK_TELL,CAP_FCNTL,CAP_FSTAT,CAP_EVENT,CAP_IOCTL }) = 0 (0x0)
55.039683993 cap_ioctls_limit(0x1,0x7fffffffd8f0,0x3) = 0 (0x0)
55.039705944 cap_fcntls_limit(1,CAP_FCNTL_GETFL) = 0 (0x0)
55.039730203 cap_rights_limit(2,{ CAP_WRITE,CAP_SEEK_TELL,CAP_FCNTL,CAP_FSTAT,CAP_EVENT,CAP_IOCTL }) = 0 (0x0)
55.039752993 cap_ioctls_limit(0x2,0x7fffffffd8f0,0x3) = 0 (0x0)
55.039783123 cap_fcntls_limit(2,CAP_FCNTL_GETFL) = 0 (0x0)
...

And a smaller tidbit from the other locked-up execution:

Code:

...
0.001710872 openat(AT_FDCWD,"/dev/zero",O_RDONLY,00) = 3 (0x3)
0.001726541 fstat(3,{ mode=crw-rw-rw- ,inode=19,size=0,blksize=4096 }) = 0 (0x0)
0.001744601 ioctl(3,FIODTYPE,0x7fffffffd854)     = 0 (0x0)
0.001766261 cap_rights_limit(3,{ CAP_READ,CAP_SEEK_TELL }) = 0 (0x0)
     >>>>>> HERE IS THE DELAY:
c52.478419133 openat(AT_FDCWD,"io_test2.txt",O_RDWR|O_CREAT|O_TRUNC,0666) = 4 (0x4)
52.478470333 fstat(4,{ mode=-rw-r--r-- ,inode=66107,size=0,blksize=131072 }) = 0 (0x0)
52.478511592 lseek(4,0x0,SEEK_CUR)         = 0 (0x0)
...

Each time it happens, it's always after

 

0.001826642 cap_rights_limit(3,{ CAP_READ,CAP_SEEK_TELL }) = 0 (0x0)

completes, and is before the

 

60.565595829 openat(AT_FDCWD,"io_test1.txt",O_RDWR|O_CREAT|O_TRUNC,0666) = 4 (0x4)

finally executes (whatever the actual different filenames I'm for the separate processes).

I mixed it up and used /dev/urandom as a data source, and the results were pretty much the same.

More fiddling confirmed hangs on test processes that were executing on separate ZFS filesystems that were part of the same storage pool, storagepool/data mounted at /data (where various data sets reside) and storagepool/home mounted at /home (where the user's home directory resides).

What information shall I gather next?

Thanks,
-- Aaron

ralphbsz · Jun 16, 2019

OK, so the open() call itself (specialized version of opening the file) is the one that is slow; the truss line is printed when the call finishes. The dd executable probably calls openat() at 0.00214 (right after the previous call finishes), and openat() completes at 55.09.

Which tells us that the problem is really in the file system layer. So either the VFS layer in the kernel or ZFS itself is blocked.

Try the following: Use four different file names for the four tests. Like that, ZFS doesn't have to clear all the buffer cache first when the open() happens. Also, try changing the amount of data for each file. For example, instead of 1000 MB per file, doe 667MB, then the problem should happen on the fifth file (after four), not on the fourth, because 4x667 = 3x1000.

rootbert · Jun 16, 2019

interesting ... maybe a bug/locking problem. I suggest to inform the devs and write to the mailing list

ralphbsz · Jun 16, 2019

Definitely agree. While we can do more research and debugging to figure out what causes the long delay, it clearly is not good behavior. It may be irrelevant, because the situation the OP is creating just doesn't happen in the real world (people don't create GB size files over and over, and throw their content away immediately), in which case the developers will probably not put much effort into a fix. But once more details are available, and e-mail to the developer mailing list, or a PR, would be a good idea.

astounding · Jun 17, 2019

ralphbsz said:
...While we can do more research and debugging to figure out what causes the long delay, it clearly is not good behavior. It may be irrelevant, because the situation the OP is creating just doesn't happen in the real world (people don't create GB size files over and over, and throw their content away immediately), ...

This is a very real world problem. These non-real world tests were created to try to find the actual problem. This affects vi sessions, has made a PLEX server in a jail pretty much useless, slowed down email services on a Postfix and Dovecot server running in another jail.

Do NOT discount the issue as NON-real-world, please!

And thank you VERY MUCH for helping me gather more information. I really appreciate your help.

ralphbsz · Jun 17, 2019

In that case, it would be a good idea to write up a description of a real-world use case that causes the problem. Don't need to do that now, don't need to share it with us here on the forum, but it would be a good idea to include when mailing the developer mailing list or filing a PR.

Here's why. I used to be a file system developer. I used to deal with problem reports coming from real users. When I got a bizarre problem that only occurs in unrealistic benchmarks, and we never heard of it occurring in real production use, I wouldn't put much work into it, because it's probably not important. On the other hand, when I got the same bizarre problem report that seriously affects real production of paying customers, I would drop everything else I was doing and work long hours to fix it. You see the reasoning?

astounding · Jun 17, 2019

Thing is, I don't have a clue what causes the problem. Just that random open()s appear to hang sometimes. I can't consistently control when it happens, and don't know why. I am attempting to gather more information so I can narrow it down. You are very correct, reporting a randomly occurring problem leaves ZFS and/or filesystem devs with insufficient information. Having done various types of software development over the years, I'm acutely aware of how a vague report is quite useless.

I likely will file a PR once I've gathered more information. Let me repeat my thanks for tips helping me gather information and narrow things down. I really do appreciate that a lot! Thanks, ralphbsz, rootbert, and VladiBG, for pitching in to help me.

I'll update this thread as I find out more.

PMc · Jun 17, 2019

45 seconds should be long enough to figure out with the naked eye what the machine is doing.
Is this reproducible in single-user? Does the machine account some CPU load or is it idling during the delay?

_martin · Jun 17, 2019

What state is process in when the delay occurs ? ps output should be sufficient. I'd keep zpool iostat 1 running in other session to see the current state of IOs.
Dtrace could help here. I'm angry with myself I didn't use it much yet, but FreeBSD's wiki has some good tutorials. I'd go for the timestamps when entering and leaving syscalls first. You could use sysutils/ltrace to get the similar timestamps (diff between entering and leaving calls (note: library function calls)):

Code:

ltrace -ST dd if=/dev/zero of=/tmp/tst bs=1024k count=4

When the delay occurs - can other processes access this and/or other filesets from the pool? What is the swap usage on the system? Run vmstat in parallel to observe the page out and swapping behavior.

astounding · Jun 19, 2019

_martin said:
What state is process in when the delay occurs ? ps output should be sufficient.

From ps while the dd process is stuck on open():

 user     41606    0.0  0.0   10852   2468  4  DX+  14:04        0:00.00 dd if=/dev/zero of=io_test1.txt bs=1M count=1000

_martin said:
When the delay occurs - can other processes access this and/or other filesets from the pool? What is the swap usage on the system? Run vmstat in parallel to observe the page out and swapping behavior.

I've got more work to do to test exactly what's going on with other processes, other filesystems. My early tests showed that when a test dd froze in one window, and I attempted another dd on a different filesystem on the same ZFS storage pool, it too froze, and both unfroze at about the same moment.

This system has no swap. It's got 64 GB RAM which I believe is currently sufficient.

vmstat -- I'm not familiar enough to know if it's showing me anything useful (and obviously there's no swapping taking place):

 

procs  memory       page                    disks     faults         cpu

r b w  avm   fre   flt  re  pi  po    fr   sr ad0 ad1   in    sy    cs us sy id

0 0 0  24G  3.9G  2477   1   2   0  3170 13161   0   0 1597  7812 13820  4  1 95

0 0 0  24G  3.9G     1   0   0   0     0  190   0   0  139  1417  2050  0  0 100

0 0 0  24G  4.3G   303   0   0   0 97591  159   0   0 1556 22063 18754  1  2 98

0 0 0  24G  4.3G    48   0   0   0   101   30   0   0  684  1381  5021  0  0 100

0 0 0  24G  4.3G   104   0   0   0   218   29   0   0  507  1174 10337  0  1 99

1 0 0  24G  3.8G  2651   0   0   0   573   24   0   0 6016 68681 69426  1 11 88

1 0 0  24G  3.8G  1573   0   0   0    93   25   0   0 5089 36253 45408  0  4 95

1 0 0  24G  3.2G  2474   0   0   0   738   27   0   0 10096 61119 80047  0 11 88

0 0 0  24G  3.1G  1618   0   0   0   697   26   0   0 3203 43254 30883  1  8 92

*** NOTE: The above line printed, then the dd process in the other window froze for about 3 seconds

0 0 0  24G  3.1G    11   0   0   0    21   23   0   0 8625   356 59784  0  8 92

1 0 0  24G  3.1G     9   0   0   0   219   24   0   0 5256  7126 37933  0  5 94

6 0 0  24G  3.1G  1627   0   0   0   171   28   0   0 3233 47429 53766  1  5 94

*** NOTE: dd process in other window unfroze about here

0 0 0  24G  3.1G  1174   0   0   0   795   23   0   0 8764 34281 66179  1  9 90

1 0 0  24G  3.1G    37   0   0   0     0   26   0   0 1176  1772 26200  1  1 98

...

And now with zpool iostat 1 output:

 

               capacity     operations    bandwidth

pool        alloc   free   read  write   read  write

--------------  -----  -----  -----  -----  -----  -----

storagepool     42.0T  5.21T    464    187  50.0M  5.36M

--------------  -----  -----  -----  -----  -----  -----

storagepool     42.0T  5.21T     47    966  3.88M  24.8M

--------------  -----  -----  -----  -----  -----  -----

storagepool     42.0T  5.21T     21      3  1.82M   349K

--------------  -----  -----  -----  -----  -----  -----

storagepool     42.0T  5.21T      0      1      0   170K

--------------  -----  -----  -----  -----  -----  -----

storagepool     42.0T  5.21T      0      3      0   364K

--------------  -----  -----  -----  -----  -----  -----

storagepool     42.0T  5.21T      1      5   241K   511K

*** NOTE: Somewhere around here the other window's dd process froze on open()...

--------------  -----  -----  -----  -----  -----  -----

storagepool     42.0T  5.21T      3  1.80K   140K  55.0M

--------------  -----  -----  -----  -----  -----  -----

storagepool     42.0T  5.21T      0  3.50K      0   446M

--------------  -----  -----  -----  -----  -----  -----

storagepool     42.0T  5.21T      2  3.38K  16.0K   432M

--------------  -----  -----  -----  -----  -----  -----

storagepool     42.0T  5.21T      4    593   304K  69.5M

--------------  -----  -----  -----  -----  -----  -----

storagepool     42.0T  5.21T      0     15      0  1.99M

--------------  -----  -----  -----  -----  -----  -----

storagepool     42.0T  5.21T      0    641      0  4.92M

--------------  -----  -----  -----  -----  -----  -----

storagepool     42.0T  5.21T      0  2.28K      0   292M

--------------  -----  -----  -----  -----  -----  -----

*** NOTE: And about here, the other window's dd proces unfroze and begin reading/writing

storagepool     42.0T  5.21T      0  2.93K  3.98K   375M

--------------  -----  -----  -----  -----  -----  -----

storagepool     42.0T  5.21T      4  2.36K   309K   297M

--------------  -----  -----  -----  -----  -----  -----

storagepool     42.0T  5.21T      0    818   128K  18.9M

--------------  -----  -----  -----  -----  -----  -----

storagepool     42.0T  5.21T      0    115  7.52K   493K

--------------  -----  -----  -----  -----  -----  -----

...

In my attempts to trigger the freeze, I've found I have to run my dd test several times. I've attempted reducing the amount it writes and just running it repeatedly in a shell loop until it freezes. If I don't write the nearly 1 GB output test files, the freeze-up is far, far less likely to take place.

PMc · Jun 19, 2019

astounding said:
From ps while the dd process is stuck on open():
user 41606 0.0 0.0 10852 2468 4 DX+ 14:04 0:00.00 dd if=/dev/zero of=io_test1.txt bs=1M count=1000

The "D"readed flag was to be expected.

This system has no swap. It's got 64 GB RAM which I believe is currently sufficient.

I don't think that is relates to the issue, but this is a common misconception.
On linux (I suppose) the swap is added to the phys memory. So you would add it as a (very slow) memory extension, only if such is needed.
On Berkeley (and on any unix with a proper virtual memory management that I know) the swap is NOT added, but mapped. Memory is usually file-backed, and memory that does not just hold some copy from the file-system, is backed by the swap. The only reason why going without swap does work at all, is that the system will use the file-backing only when memory gets low. But basically the system expects that any memory page points to a place on file to where this memory page could be evicted in case mem gets low.
So I don't think it is generally a good idea to go without swap.

In my attempts to trigger the freeze, I've found I have to run my dd test several times. I've attempted reducing the amount it writes and just running it repeatedly in a shell loop until it freezes. If I don't write the nearly 1 GB output test files, the freeze-up is far, far less likely to take place.

As a workaround for now, one might try to lower the txg timeout.
# sysctl vfs.zfs.txg.timeout=2
or even 1.

I don't really understand what is happening here. I am familiar with these kind of lockups from ZFS, but I have seen them on old machines with low memory, and Your's is a high end with ample memory. Shouldn't happen.

_martin · Jun 21, 2019

Swap is not an issue here then. What was confirmed (and expected) process is waiting for Disk.

My hunch is that the issue occurs when you are actually writting to the disks - when buffers are flushed. It is somewhat aligned to the output of the zpool iostat - you do see fair amount of I/Os and writers when this happens.

Just brainstorming what I'd check:
- what type of port /speed are HBAs (LSI cards) put in ? Can you create (in theory) bottlebeck on bus (pci-e/pci-x) ?
- do you use JBOD setup (or even plain JBOD fw) on these HBAs? Do those HBAs have memory on board which cache the I/Os ?
- how is the pool built ? from HA perspective it makes sense that left side is on one HBA, right side is on the other

Is there are way for you to test this also without one HBA ?

Eric A. Borisch · Jun 21, 2019

Anything intersting from gstat -pdo before / during /after the slowdown? (Interesting = large queue values or imbalanced activity across the drives backing the pool.)

t1066 · Jun 22, 2019

Your pool is quite full. How fragment is it?

Eric A. Borisch · Jun 22, 2019

Oh, that is pretty full, which is known to be taxing on performance: https://serverfault.com/a/556892

Still interested in seeing gstat output before & during slow-down.

zsb · Jan 7, 2021

Hi,

I'm facing a very similar issue.
This is a 12.1-RELEASE-p12 server.

The simptom is that all IOs are extremely slow. Sometimes even starting 'vim' takes 15 sec or listing directories takes ages. The mysql server starts for 2-3 min (but it's not a big database).

The issue is constant but happens randomly (sometimes the IO is acceptable but mostly terrible).

I have performed the 'truss -d' tests mentioned above and creating a 1GB file takes 2-76 seconds.
This is the result of one of the tests (took 45 sec):

[...]
0.277764819 read(3,"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,1048576) = 1048576 (0x100000)
30.382648068 write(4,"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,1048576) = 1048576 (0x100000)
[...]
30.927167136 read(3,"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,1048576) = 1048576 (0x100000)
42.356276973 write(4,"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,1048576) = 1048576 (0x100000)
[...]

But there were tests that run only for 2 secs as well.

During the tests, 'gstat -ocsa' shows either nothing or the '%busy' value jumps between 80% and 1000%+ (yes, 1000%+)

It is important to mention that this is a quemu guest server with
- 8GB RAM
- 1 disk with 1 zpool
- arc max is set to 3GB
- vfs.zfs.txg.timeout is set to 2
- about 70% of the space is being used but this issue started almost immediately after the server got installed
- fragmentation is 25% (according to 'zpool list')
- there is 2GB swap but currently 100% free
- 'top -aHs1' shows nothing interesting during the test, the whole system is idle, there is no load, etc.

The VM provider has already relocated my VM to different storages a couple of times but it didn't help. Also there are several linux servers at the very same provider with no IO problems at all.

This is a mail server with ~20 virtual users using mysql as a backend and there are lot's of infrequently used static files (photos, docs, etc. uploaded to nextcloud).

This is my very first FreeBSD server and I'm totally clueless what's going on.

Any help is appreciated.

SirDice · Jan 7, 2021

zsb said:
The VM provider has already relocated my VM to different storages a couple of times but it didn't help.

What virtualization device is used? virtio_blk(4), virtio_scsi(4)? Or something else?

zsb · Jan 7, 2021

Neither 'pciconf -lv | grep -i virt' nor 'grep -i virtio /boot/loader.conf' returns anything.
Am I looking at the wrong place?

ZFS Help me track down random CRAZY SLOW IO

Administrator