Solved rsync blocks zfs operations and blocks input devices, unusable

I know I'm an early adopter of FreeBSD
That's funny right there.

BSD has existed for about 40 or 50 years. FreeBSD for about 25 or 30. Quite a few people on this forum have run it for that long. I'm only at about year 15 (although I used other BSD-derived OSes in the 1980s).

Your post here is still completely unhelpful.

Here would be my suggestion if you want help debugging this: First, turn of all GUI; perform the same operations from the CLI. Make sure no other processes are running. While doing these operations, do some monitoring, like some combination of top, vmstat and iostat. Save the output of those. And then post exactly what you were doing, and how many IOs were actually running, and what the memory pressure is. You might be having an IO scheduling problem, or you might have any other problem.
 
  • Like
Reactions: mer
I still can't figure out what filesystems are involved. Is there a second ZFS? What about the ext4 mentioned on the first place. Is the rsync target an internal SSD or also USB?
 
You are all not helpful.

Specific solution: Disable the sync on the dataset

rsync tries to be safe. It often asks the filesystem: "Are you SURE this is written to disk?" (fsync).
When ZFS receives this request, it must write to the ZIL (ZFS Intent Log) immediately.

You can tell ZFS: "Ignore the safety checks. Just cache it in RAM and write it when you can."

The Fix:
Set this on the Target Dataset (where you are copying TO):

Code:
zfs set sync=disabled zroot/your/target/dataset

(Warning: If the power cuts out during the copy, you lose the last 5 seconds of data. For a home copy, this is usually acceptable).

All this tells us that IOPS (Input/Output Operations Per Second) and Latency, not bandwidth, are the culprit.

Keeping sync disabled is usually not a problem. You might lose some files within the 5 seconds before abrupt power loss and your browser profile might get corrupted. But definitely not desirable to disable sync on the entire zroot, although will not corrupt your system.

Or does it make sense to disable sync on the entire zroot home folder permanently for desktop users? It's such a small fix with little risks that prevents freezing/locking up to your entire machine during write intensive tasks that are so prevalent for desktop users who would most likely write a lot into their home folders.

[By the way, nothing else worked, I tried so many different things from rtprio to kernel zfs settings, etc etc; only disabling sync works, and it makes sense now too]
 
if you “dd if=usb-device bs=1m of=/dev/null” while doing lots of it on your ssd zfs, do you see a similar slowdown or is it fast?

You can also play with hw.usb.xhci.use_polling and couple other sysctl to see if things improve. You can also play with scheduling parameters such as sysctl kern.sched.steal_thresh…. Websearch for settings for better interactive performance on freebsd. Or ask Gemini!
 
You are all not helpful.
People who argue while asking for help don’t help their cause. Better to acknowledge others who take time to respond & are trying to help you even if you don’t find their suggestions helpful.
bursts of writes to my zroot SSD ranging from 50-85 MB/s or so
You can use zpool iostat 3 for ex to get iostats reported every 3 seconds. What is the raw speed on your ssd? 50-80 MBps seems a bit low. I guess this is limited by your usb hdd?
By the way, nothing else worked, I tried so many different things from rtprio to kernel zfs settings, etc etc; only disabling sync works
Your immediate problem may be solved but I suspect this works around the underlying cause….
 
rsync tries to be safe. It often asks the filesystem: "Are you SURE this is written to disk?" (fsync).
It does not do this by default. It only does this if you use the --fsync option on rsync. Did you use that option? If yes, it seems a bit silly first to ask rsync to write every file synchronously, and then ask ZFS to ignore the sync requests you just ordered rsync to make.

But you also said early on that the target disk (the one with the zroot using ZFS) is an SSD (you wrote SDD). On an SSD, the latency penalty for an fsync request should be just one random write, which should be about a ms; on a spinning HDD that would be about 10ms. The fsync request only happens once per file, so the distribution of file sizes being copied would be helpful to know. If all your files are large (for example 100MB each), the one extra fsync should make very little difference; if all your files are extremely small (say 2KB), the throughput penalty of waiting for that one extra write would reduce your rsync bandwidth.

You also said early on that while the rsync was running (very slowly), your system because unusable because mouse and keyboard were not responsive. Making the file system be effectively faster (because it does not wait for sync writes) should not affect mouse and keyboard at all. That's why I asked for some monitoring data: could it be that mouse and keyboard are causing other processes to perform disk-intensive operations?

Or does it make sense to disable sync on the entire zroot home folder permanently for desktop users?
How valuable is your data to you?

[By the way, nothing else worked, I tried so many different things from rtprio to kernel zfs settings, etc etc; only disabling sync works, and it makes sense now too]
It would help if you shared the measurements you took while you "tried so many things". And I don't agree with the statement that "it makes sense".
 
gstat -pdoBI5s output (ctrl-c to end) during the slowdown can be useful to see what devices are doing.

If the SSD wasn’t trimmed before installing FreeBSD, it can significant slow down performance on a pre-used drive. Take a look zpool-trim(8).
 
If the SSD wasn’t trimmed before installing FreeBSD, it can significant slow down performance on a pre-used drive.
I trimmed it, and have cron set to trim weekly.
f you use the --fsync option on rsync. Did you use that option?
No.
while the rsync was running (very slowly), your system because unusable because mouse and keyboard were not responsive.
Rsync is running at full throttle, not slowly, and yes, input devices would freeze intermittently for a while.
It would help if you shared the measurements you took
I can replicate the problem at any point, and I will, and I will try to get some things that others have suggested a little while later.
 
L(q) for the zroot is around 10-20. Is that a lot?
Depends on the write speed. How many [ms] per write is reported? If it is, say, 50ms - then you are looking at about a second latency for writes to the pool. This is optimized for troughput, not latency.
 
Depends on the write speed. How many [ms] per write is reported? If it is, say, 50ms - then you are looking at about a second latency for writes to the pool. This is optimized for troughput, not latency.
I'd say average about 50-60ms.

I get that it's optimized for throughput. What's the way to optimize it for latency?
 
L(q) for the zroot is around 10-20. Is that a lot?
That seems high. What does some of the gstat -pdoBI5s look like under this load? At the very least, it seems your data source is significantly outmatching your destination, so it's getting a large queue waiting to be serviced, which (since it is your system drive) can lead to lots of things being slow to react. It's trying to optimize throughput at the expense of latency (this is the server vs. desktop bias showing through), would be my take on it.
 
L(q) for the zroot is around 10-20. Is that a lot?
Depends. As SirDice said: for a system that is trying to optimize the throughput of each disk, and the disks are spinning disks (not SSDs) that is not a lot. The underlying reason is this: disks are one of the few things in the universe that "work better" (get more efficient) the more overloaded they are. That's because if you give the IOs to the disk one at a time, the disk has to move the head to the correct track for each IO, and wait for a partial rotation for each of them. This is by its nature slow, since these are mechanical processes: A seek takes on average 10ms, and for normal 7200 RPM disks, waiting on average half a rotation takes another 4ms. Compared to that, the actual read/write time is small, typically a fraction of a ms to 10ms (for the largest IOs that occur in practice, about 2MB).

Now in contrast, if you give the disk 20 IOs at once, and tell it "do them in any order, just be efficient about it", the disk will make a map of all IOs that need to be done (on the platter), and decide on finding an optimal path that minimizes the seek distance and waiting for rotations. This typically makes the total throughput of the disk better by roughly a factor of 2 or 3. So yes, overloading a disk (giving it its of IOs to do, meaning large L(q) = queue length) makes it MUCH faster, in the sense that each individual IO takes half the time.

There is a price to pay: These 10 or 20 IOs were probably submitted by 10 or 20 different programs / processes / threads, which may or may not be interconnected. If you are a human user or program/process/thread that is waiting for any one IO to finish, you don't know where in the queue that IO is, and you may have to wait on average for 5 to 10 other IOs to get done.

The real question here is this: What are these 10-20 IOs in the queue? If a few are coming from latency-critical interactive workload, while most are from throughput-sensitive batch processing workload, then you can see that the interactive user will be unhappy if impatient.

I'd say average about 50-60ms.
That is awfully long. A normal spinning disk should on average take 10ms per IO (reads perhaps a little faster), maybe twice as much for very large IOs (MB or so). SSDs should take about 1ms per IO. I wonder where that 50-60ms per IO comes from.

I get that it's optimized for throughput. What's the way to optimize it for latency?
That is REALLY hard to do in general. On FreeBSD, the only thing to adjust is the nice parameter. Linux has way more knobs to tune, but unless you put weeks of effort into it (BTDT), the results will be bad or mixed. And highly tuning your IO subsystem for one workload tends to screw over other workloads.

The other thing which makes this hard is that you need to be super clear on your goals: Are you optimizing to minimize the latency for ONE workload, and ignoring all other workloads? Or maximizing the throughput for ONE workload (independent of latency, which only makes sense if that workload is multi-threaded or parallel, but most important ones are these days), while ignoring all other workloads? Or do you need some combination of "good for one, decent for the others"? Or are you doing an optimization of $/<some metric>, to spent the least amount of money to meet your SLOs?

For amateur operations, the best way to perform this optimization is to only run one workload at a time, and banish "unimportant" or "only throughput relevant" workloads to off times (like the middle of the night).
 
What's zpool iostat -w zroot look like? And I'm still curious about gstat -pdoBI5s, too
I'll get that in a bit.
What kind (vendor/model) of SSD is this?
The SSD is an older one. Here's camcontrol identify for it:
Code:
pass2: <TOSHIBA THNSNJ256GCSU JURA0101> ACS-2 ATA SATA 3.x device
pass2: 600.000MB/s transfers (SATA 3.x, UDMA5, PIO 8192bytes)

protocol              ACS-2 ATA SATA 3.x
device model          TOSHIBA THNSNJ256GCSU
firmware revision     JURA0101
serial number         64FS112HTB4W
WWN                   500080d91016da62
additional product id
cylinders             16383
heads                 16
sectors/track         63
sector size           logical 512, physical 512, offset 0
LBA supported         268435455 sectors
LBA48 supported       500118192 sectors
PIO supported         PIO4
DMA supported         WDMA2 UDMA5
media RPM             non-rotating
Zoned-Device Commands no
 
Ok, interestingly, I'm also getting some occasional stutter when I'm `zfs send/pv/zfs receive` between two external USB HDD drives. gstat says they are both busy 100%, and I'm getting L(q) jumping to 10-20 on the device that's being written to. It's minor stutter, but I don't know why IO between two external drives would be stuttering/blocking even once in a while. Again, this is with sync on.

EDIT: this small stutter may have been due to me recompressing during the send by accident and going heavier on CPU. The big blocking issue remains.
 
zpool iostat -w zroot:
Code:
zroot        total_wait     disk_wait    syncq_wait    asyncq_wait
latency      read  write   read  write   read  write   read  write  scrub   trim  rebuild
----------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
1ns             0      0      0      0      0      0      0      0      0      0      0
3ns             0      0      0      0      0      0      0      0      0      0      0
7ns             0      0      0      0      0      0      0      0      0      0      0
15ns            0      0      0      0      0      0      0      0      0      0      0
31ns            0      0      0      0      0      0      0      0      0      0      0
63ns            0      0      0      0      0      0      0      0      0      0      0
127ns           0      0      0      0     17     18      0      0      0      0      0
255ns           0      0      0      0  1.62K  1.18K     63     51      0      0      0
511ns           0      0      0      0  6.54K  6.07K    360    236      1      0      0
1us             0      0      0      0  3.94K  11.9K    438    597      2      0      0
2us             0      0      0      0  1.88K  6.51K    135    611      1      0      0
4us             0      0      0      0    480  1.05K     20    467      0      0      0
8us             0      0      0      0     65    120      6    140      0      0      0
16us            0      0      0      0      5     29      0    208      1      0      0
32us           12     82     13    204      3     44      1    465      2      0      0
65us          152  5.67K    179  9.64K     10    100     13  1.03K      3      0      0
131us       1.75K  7.41K  1.79K  10.9K     25    164     33  1.29K      9      0      0
262us       4.51K  3.21K  4.57K  2.72K     45    289     47  2.16K     16      0      0
524us       6.08K  6.76K  6.17K  7.69K      8    421     97  2.77K     13      0      0
1ms         2.48K  7.11K  2.62K  12.9K     15    717    104  1.96K      5      0      0
2ms           677  4.22K    487  11.9K      3  1.36K    116  1.20K     10      0      0
4ms           504  5.78K    362  17.9K      8  2.50K     85  2.38K      6      0      0
8ms            60  9.28K     10  3.86K      0  3.31K     26  3.73K      6      0      0
16ms           10  9.66K      7  5.10K      0  4.52K      0  3.34K      1      0      0
33ms           25  8.38K     25  15.7K      0  5.01K      0  2.10K      0      0      0
67ms            8  8.04K      8  5.21K      0  5.76K      0  1.43K      0      0      0
134ms           1  6.48K      1     42      0  4.79K      0  1.25K      0      0      0
268ms           0  5.71K      0      0      0  4.17K      0  1.09K      0      0      0
536ms           0  6.48K      0      0      0  5.25K      0   1004      0      0      0
1s              0  4.99K      0      0      0  4.33K      0    469      0      0      0
2s              0  3.88K      0      0      0  3.25K      0    528      0      0      0
4s              0    770      0      0      0    546      0    172      0      0      0
8s              0      0      0      0      0      0      0      0      0      0      0
17s             0      0      0      0      0      0      0      0      0      0      0
34s             0      0      0      0      0      0      0      0      0      0      0
68s             0      0      0      0      0      0      0      0      0      0      0
137s            0      0      0      0      0      0      0      0      0      0      0
---------------------------------------------------------------------------------------
gstat -pdoBI5s:
Code:
dT: 5.011s  w: 5.000s
 L(q)  ops/s    r/s   kBps   ms/r    w/s   kBps   ms/w    d/s   kBps   ms/d    o/s   ms/o   %busy Name
    0      0      0      0  0.000      0      0  0.000      0      0  0.000      0  0.000    0.0  cd0
    0      0      0      0  0.000      0      0  0.000      0      0  0.000      0  0.000    0.0  ada0
    0    209      1    102  0.521    197  17469   20.3      0      0  0.000     11    1.0   40.6  ada1
    0     10      0      0  0.000     10    124   78.4      0      0  0.000      0  0.000   40.7  md0
    0      0      0      0  0.000      0      0  0.000      0      0  0.000      0  0.000    0.0  da0
    1    176    176  11914    7.9      0      0  0.000      0      0  0.000      0  0.000   91.4  da1
dT: 5.002s  w: 5.000s
 L(q)  ops/s    r/s   kBps   ms/r    w/s   kBps   ms/w    d/s   kBps   ms/d    o/s   ms/o   %busy Name
    0      0      0      0  0.000      0      0  0.000      0      0  0.000      0  0.000    0.0  cd0
    0      0      0      0  0.000      0      0  0.000      0      0  0.000      0  0.000    0.0  ada0
    0    210      1    102    1.2    198  16519    7.3      0      0  0.000     11    2.3   29.3  ada1
    0     11      0      0  0.000     11    175   45.9      0      0  0.000      0  0.000   24.6  md0
    0      0      0      0  0.000      0      0  0.000      0      0  0.000      0  0.000    0.0  da0
    0    187    187  12043    7.6      0      0  0.000      0      0  0.000      0  0.000   89.6  da1
dT: 5.008s  w: 5.000s
 L(q)  ops/s    r/s   kBps   ms/r    w/s   kBps   ms/w    d/s   kBps   ms/d    o/s   ms/o   %busy Name
    0      0      0      0  0.000      0      0  0.000      0      0  0.000      0  0.000    0.0  cd0
    0      0      0      0  0.000      0      0  0.000      0      0  0.000      0  0.000    0.0  ada0
    0    197      1     77  0.457    187  16564    1.1      0      0  0.000     10  0.945    4.4  ada1
    0      9      0      0  0.000      9    105    7.0      0      0  0.000      0  0.000    3.2  md0
    0      0      0      0  0.000      0      0  0.000      0      0  0.000      0  0.000    0.0  da0
    2    182    182  11370    8.1      0      0  0.000      0      0  0.000      0  0.000   92.2  da1
 
You have quite a few drive IOs (disk_wait write; the I/O time excluding queue waits) that are falling well north of 10ms, pushing the average up to 3ms, but without who knows if that is abnormal for your particular drive.

You could install sysutils/smartmontools and see what you get out of smartctl -x /dev/ada1. Each vendor seems to have their own stuff they stick in there for SSDs, but if this (you mentioned it is old) drive has "seen things", there might be some hints in there. You could also issue a zpool trim -w root (be prepared to wait, and expect IO to be impacted unless you rate-limit with -r) and see if that helps at all, but you indicated a trim had been performed at some point.

One item on your sync=disabled query from earlier: if you are loading bulk data onto a user's filesystem (like a large rsync to transfer over from a different machine, where the data is reproducible should something unexpected happen) you can certainly set sync=disabled on the filesystem to facilitate getting things rolling. For example, I do system upgrades into mounted boot environments, and I set sync=disabled during the installation process to speed things up. (And again, this is a situation where if power were cut, I would just run make install again, and I wouldn't be impacted by any lost writes to the boot environment being installed.)
 
Thanks.
smartctl -x /dev/ada1
Smartctl is fine. I've been running tests on it for years, to make sure it's okay. It's got only 72% of lifetime remaining.
One item on your sync=disabled query from earlier: if you are loading bulk data onto a user's filesystem (like a large rsync to transfer over from a different machine, where the data is reproducible should something unexpected happen) you can certainly set sync=disabled on the filesystem to facilitate getting things rolling. For example, I do system upgrades into mounted boot environments, and I set sync=disabled during the installation process to speed things up. (And again, this is a situation where if power were cut, I would just run make install again, and I wouldn't be impacted by any lost writes to the boot environment being installed.)
This. It's the crux of the matter. Why do *I* have to manage that? Why is there not a scheduler to balance the load in an extremely minimal way - only so that if your IO is maxing out you can allow up to 1-2% of system IO through in some balanced way. That way you preserve throughput everyone so much desires while ensuring that throughput is not blocking the OS. I mean, this is dangerous.

For now, one way is to create a separate dataset for all the reproducible large data storage that's moving around in crazy IO throughput ways and set sync=disabled on just that dataset. I think that's referred to as "scratch space"? It's sort of like what you suggest but without turning sync on and off constantly.

Linux' BTRFS is so awful that it has sync off by default, and you can't even turn it on, you can only set it during mount. (small rant). 💩
 
Smartctl is fine. I've been running tests on it for years, to make sure it's okay. It's got only 72% of lifetime remaining.
That leaves the question of why the SSD is so slow when writing. If you look at the gstat data, ada1 sometimes has a mean (!) write latency of 20ms, way more than a SSD should have for individual writes (it means 50 IOps, which is very wrong). If you look at the zpool iostat data, a significant fraction of writes are over 100ms, extending all the way to over a second. With a mean queue depth of 10-20 that is even worse than the mean 20ms. For a spinning disk, that number would be bad; for an SSD it is awful.

Why do *I* have to manage that? ... only so that if your IO is maxing out you can allow up to 1-2% of system IO through in some balanced way. ... I mean, this is dangerous.
It is not dangerous at all. On the contrary. Most server-class systems try to maximize IO throughput, both for long streaming IOs, and small random IOs. They really want 100% of the systems hardware resources to be utilized, and utilized efficiently. As I explained above, getting high efficiency out of disks usually means getting long IO queues, and server applications are usually multi-threaded and handle that just fine.

Why do you have to manage that? Because the OS can't know what your desires and goals are, and what workload is important to you (in what fashion) when there are multiple.

Linux' BTRFS is so awful that it has sync off by default, and you can't even turn it on, you can only set it during mount. (small rant). 💩
There is a lot of other things that are awful about BTRFS. The basic design (the modifiable CoW B-tree) is pretty nice, and closely matches what other systems (WAFL, ZFS) use. The real-world implementation is a train wreck (or at least was as of a few years ago). I used to refer to it as a "machine for losing data", and my colleagues would knowingly chuckle.
 
The real-world implementation is a train wreck (or at least was as of a few years ago). I used to refer to it as a "machine for losing data", and my colleagues would knowingly chuckle.
In 2009 a Linux friendly sysadmin at a large corp was telling me how BTRFS was soon going to wipe the floor with ZFS. The reality distortion field around all cults is rather strong....
 
Most server-class systems try to maximize IO throughput, both for long streaming IOs, and small random IOs.
But on a root partition?
That leaves the question of why the SSD is so slow when writing.
I think it's that whole ZIL intent queue and going back and forth between it and the actual writes. ZIL writes metadata, which is much worse than sequential writes, and the sync option forces it to always interrupt sequential writes? ZFS even has an option to have a dedicated ZIL drive (sounds like an enterprise-level configuration) to alleviate this issue (BTRFS doesn't 🤭 even though the entire Meta/Facebook is on BTRFS).

SSDs also have this crazy thing where they have fast burst writes (something like they do a "superficial" voltage write of 3 bits as 1 at first on some special subset of space [which is fast] and then they do like a "deep" voltage write of 1 bit for each 1 bit [which takes longer and is a major slow down]). Or something like that. Correct me if I'm wrong please.

BTRFS. The basic design (the modifiable CoW B-tree) is pretty nice, and closely matches what other systems (WAFL, ZFS) use. The real-world implementation is a train wreck (or at least was as of a few years ago). I used to refer to it as a "machine for losing data"
To be fair, I never lost data on BTRFS except the "negligible" data due to no sync. But ZFS is so much better. You can just sense the solidness of ZFS.
 
Back
Top