I have a system exhibiting bizarre slow IO (or paused IO as if there was a deadlock) at random times that appears to affect almost any process doing disk IO. Here's a quick example from the command-line where I had to execute a
NOTICE: The last run in the above completes in 46+ seconds. That run encountered the CRAZY SLOW IO issue I'm referring to. It's unpredictable. It can hit ANY process doing disk IO to the ZFS pool in question. One can't predict when.
The filesystem is ZFS. There are NO logs showing up in syslog indicating any timeouts or errors from disks (SATA
The system is running FreeBSD, a 12.0-RELEASE-p3 GENERIC kernel on an AMD Ryzen 7 8-core CPU with a pair of LSI 8-port SATA cards feeding 6-pairs of ZFS mirrored drive pairs in a 52 TB (raw drive capacity, actual is lower) storage pool.
CPU/MEMORY:
ZFS storage pool:
Again, in system logs, there are NO SATA drive errors, ALL drives look good with SMART data, the recent full storage pool scrub found ZERO errors, though it took a VERY LONG time to complete.
The root filesystem, base system, and installed packages (/usr/local) all reside on a single ZFS filesystem, a separate ZFS storage pool, residing on a pair of mirrored SSD drives. That filesystem does NOT exhibit the problem. Running the timed
Without ZERO logged errors letting me know where to look, what suggestions might you have to help me track down the source if these bizarre randomly occurring IO pauses slowing down every process doing drive IO? Do I need to compile a new kernel and enable DTRACE and catch a process in the act? (I've never used it before, so I'd be a newborn DTRACE infant--but I'm willing to learn, and tend to pick up things reasonably quickly.)
Thanks for any and all ideas!
-- Aaron
time dd if=/dev/zero of=io_test.txt bs=1M count=1000
four times to encounter the randomly occurring crazy slow IO problem plaguing the system. The first three runs look normal. Then the fourth run. Forty-six plus seconds?!?!?!?!?:
Code:
user@host:~$ time dd if=/dev/zero of=io_test.txt bs=1M count=1000
1000+0 records in
1000+0 records out
1048576000 bytes transferred in 0.236960 secs (4425109928 bytes/sec)
real 0m0.238s
user 0m0.000s
sys 0m0.238s
user@host:~$ time dd if=/dev/zero of=io_test.txt bs=1M count=1000
1000+0 records in
1000+0 records out
1048576000 bytes transferred in 0.232177 secs (4516271740 bytes/sec)
real 0m0.235s
user 0m0.000s
sys 0m0.235s
user@host:~$ time dd if=/dev/zero of=io_test.txt bs=1M count=1000
1000+0 records in
1000+0 records out
1048576000 bytes transferred in 0.241685 secs (4338612840 bytes/sec)
real 0m0.352s
user 0m0.000s
sys 0m0.352s
user@host:~$ time dd if=/dev/zero of=io_test.txt bs=1M count=1000
1000+0 records in
1000+0 records out
1048576000 bytes transferred in 0.273421 secs (3835019690 bytes/sec)
real 0m46.998s
user 0m0.000s
sys 0m0.285s
user@host:~$
NOTICE: The last run in the above completes in 46+ seconds. That run encountered the CRAZY SLOW IO issue I'm referring to. It's unpredictable. It can hit ANY process doing disk IO to the ZFS pool in question. One can't predict when.
The filesystem is ZFS. There are NO logs showing up in syslog indicating any timeouts or errors from disks (SATA
The system is running FreeBSD, a 12.0-RELEASE-p3 GENERIC kernel on an AMD Ryzen 7 8-core CPU with a pair of LSI 8-port SATA cards feeding 6-pairs of ZFS mirrored drive pairs in a 52 TB (raw drive capacity, actual is lower) storage pool.
CPU/MEMORY:
Code:
CPU: AMD Ryzen 7 2700X Eight-Core Processor (3700.08-MHz K8-class CPU)
Origin="AuthenticAMD" Id=0x800f82 Family=0x17 Model=0x8 Stepping=2
Features=0x178bfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,MMX,FXSR,SSE,SSE2,HTT>
Features2=0x7ed8320b<SSE3,PCLMULQDQ,MON,SSSE3,FMA,CX16,SSE4.1,SSE4.2,MOVBE,POPCNT,AESNI,XSAVE,OSXSAVE,AVX,F16C,RDRAND>
AMD Features=0x2e500800<SYSCALL,NX,MMX+,FFXSR,Page1GB,RDTSCP,LM>
AMD Features2=0x35c233ff<LAHF,CMP,SVM,ExtAPIC,CR8,ABM,SSE4A,MAS,Prefetch,OSVW,SKINIT,WDT,TCE,Topology,PCXC,PNXC,DBE,PL2I,MWAITX>
Structured Extended Features=0x209c01a9<FSGSBASE,BMI1,AVX2,SMEP,BMI2,RDSEED,ADX,SMAP,CLFLUSHOPT,SHA>
XSAVE Features=0xf<XSAVEOPT,XSAVEC,XINUSE,XSAVES>
AMD Extended Feature Extensions ID EBX=0x1007<CLZERO,IRPerf,XSaveErPtr>
SVM: (disabled in BIOS) NP,NRIP,VClean,AFlush,DAssist,NAsids=32768
TSC: P-state invariant, performance statistics
real memory = 68719476736 (65536 MB)
avail memory = 66828918784 (63733 MB)
ZFS storage pool:
Code:
user@host:~$ zpool status
pool: storagepool
state: ONLINE
scan: scrub repaired 0 in 3 days 04:55:15 with 0 errors on Wed Jun 12 18:26:42 2019
config:
NAME STATE READ WRITE CKSUM
storage ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
gpt/bay01_cableA1_a1_oct2015_6tb ONLINE 0 0 0
gpt/bay06_cableC2_a2_oct2015_6tb ONLINE 0 0 0
mirror-1 ONLINE 0 0 0
gpt/bay11_cableB3_b1_sep2016_8tb ONLINE 0 0 0
gpt/bay02_cableC1_b2_sep2016_8tb ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
gpt/bay09_cableA3_g1_may2017_8tb ONLINE 0 0 0
gpt/bay14_cableC4_g2_may2017_8tb ONLINE 0 0 0
mirror-3 ONLINE 0 0 0
gpt/bay08_cableD2_d2_17jul2017_8tb ONLINE 0 0 0
gpt/bay15_cableB4_d1_17jul2017_8tb ONLINE 0 0 0
mirror-4 ONLINE 0 0 0
gpt/bay13_cableA4_e1_aug2018_14tb ONLINE 0 0 0
gpt/bay04_cableD1_e2_aug2018_14tb ONLINE 0 0 0
mirror-5 ONLINE 0 0 0
gpt/bay07_cableB2_f1_apr2017_8tb ONLINE 0 0 0
gpt/bay12_cableD3_f2_apr2017_8tb ONLINE 0 0 0
errors: No known data errors
user@host:~$
Again, in system logs, there are NO SATA drive errors, ALL drives look good with SMART data, the recent full storage pool scrub found ZERO errors, though it took a VERY LONG time to complete.
The root filesystem, base system, and installed packages (/usr/local) all reside on a single ZFS filesystem, a separate ZFS storage pool, residing on a pair of mirrored SSD drives. That filesystem does NOT exhibit the problem. Running the timed
dd
test thousands of times never resulted in taking longer than less than 1/3 of a second. The SSDs connect via motherboard SSD ports, and so don't use the LSI cards.Without ZERO logged errors letting me know where to look, what suggestions might you have to help me track down the source if these bizarre randomly occurring IO pauses slowing down every process doing drive IO? Do I need to compile a new kernel and enable DTRACE and catch a process in the act? (I've never used it before, so I'd be a newborn DTRACE infant--but I'm willing to learn, and tend to pick up things reasonably quickly.)
Thanks for any and all ideas!
-- Aaron