ZFS Slow NVMe ZFS Mirror Pool w/ Samsung 980 1TB Drives

snakedoctr · Jan 3, 2023

Hi all,
I'm having an issue on my system with my Samsung 980 1TB NVMe drives. They're setup in a ZFS mirror as my root drive. Whenever I push a lot of I/O through the pool (e.g., doing an rsync of a lot of data -- either large files or smaller ones), my system will come to a crawl until I stop the disk activity process. In general, Firefox activity halts (e.g., attempting to visit a page will just spin and spin, or refreshing a page will do the same), trying to open a terminal in i3 just doesn't respond, etc. Whenever I stop the process that's doing the disk activity, after a while (depending on the amount of what's being copied) the system starts responding as if nothing happened. I can watch "zpool iostat zroot 3" and see the copy happen when the system becomes slow, and when the writes go back down that's what the system starts responding again. "top" shows no glaring issue of a process consuming CPU/memory, and load doesn't increase that much.

I generally don't notice it because I'm not doing a ton of copying on this pool, but recently started and am now noticing it more. I did see the issue a few months ago when I was rsyncing files to this host, but just shrugged it off since it became responsive again after the copy.

I saw something about switching from the nvd driver to nda. Would that be beneficial? Though, it's probably not going to be an easy switch since my pool is setup for nvd0 and nvd1 currently.

Any thoughts on what could cause this and/or the solution?

Some data:

Code:

Motherboard: ASRock B560 Pro4 LGA 1200

hw.model: Intel(R) Core(TM) i5-10400 CPU @ 2.90GHz

$ pciconf -lv
nvme0@pci0:2:0:0:    class=0x010802 rev=0x00 hdr=0x00 vendor=0x144d device=0xa809 subvendor=0x144d subdevice=0xa801
    vendor     = 'Samsung Electronics Co Ltd'
    device     = 'NVMe SSD Controller 980'
    class      = mass storage
    subclass   = NVM

Thanks!

Eric A. Borisch · Jan 3, 2023

I find gstat -pdo to be informative. The d/s (and following two) are for deletes (trims), and the o/s ms/o are waits for flushes.

Do you have some form a TRIM-ing being performed? Either with zpool set autotrim=on poolname or periodically (via cron) with zpool-trim(8)? If not, make sure you have one of those in place so that freed space can be reaped ahead-of-time by the drive.

If you have autotrim=on, you may (depending on the hardware) see a high d/s and ms/d counters; if this is the case, consider moving to periodic (nightly when not busy) calls to zpool trim and set autotrim=off. As described in zpoolprops(7):

Be aware that automatic trimming of recently freed data blocks
can put significant stress on the underlying storage devices.
This will vary depending of how well the specific device handles
these commands. For lower-end devices it is often possible to
achieve most of the benefits of automatic trimming by running an
on-demand (manual) TRIM periodically using the zpool trim
command.

That said, it sounds like you're getting a write buffer filled up, and it's eventually getting to the point where it is bottle-necking all IO on the root pool. Addressing TRIM as described may improve write performance (depends on the hardware and IO patterns) to the drive to help avoid this; if not, you may want to consider running the destination with sync=always on the particular dataset (doesn't have to be the entire pool), as this will keep ZFS from over-promising write performance, and hopefully avoid whole-system-slowdowns. (In general for "normal" data this isn't recommended, as it disables many of the performance features of ZFS.)

A final option if all else fails (or causes other system performance issues) would be to use the --bwlimit=RATE with your rsync(1) call to prevent over-running the write performance of the system.

Eric A. Borisch · Jan 3, 2023

I'll also note that the 980's aren't exactly speed demons.

Since this device relies on the host memory buffer (doesn't have its own RAM), you might also consider bumping hw.nvme.hmb_max.

Or, as you say, try the nda driver instead; there are some general guidelines in nvme(4), but it always comes down to "depends on the hardware and the workload", so try it and see.

snakedoctr · Jan 3, 2023

Thanks, Eric. I don't have autotrim enabled or a cron-based trim job. I did just do a manual trim, though, and will enable autotrim on the root pool to see if that has any effect. Based on gstat, it's doing a lot of TRIMing, so I'll wait for it to finish and give it another test.

Sadly, it's not just rsync that's creating the issue -- other processes that do a lot of writes to the pool also create the slowdown.

I also forgot to mention this is on FreeBSD 13.1 w/ 48GB RAM, just for reference.

richardtoohey2 · Jan 4, 2023

Eric A. Borisch said:
I'll also note that the 980's aren't exactly speed demons.

Since this device relies on the host memory buffer

I’ve noticed the slowdown on SSDs - ones with buffer RAM/cache - once the buffer gets full. Greased lightening for a while at the start (of doing lots of file operations) and then “clunk“ down into treacle mode.

snakedoctr · Jan 4, 2023

I set autotrim=on for the root pool after the "zpool trim" work finished. I then tried the rsync again and it had no issues like before. I then tried the other workload and it was humming along for the majority of it, but then went back to the sluggishness in Firefox and I could see the write speed drop considerably. After the writes finished, the sluggishness was gone, and I saw "gstat -pdo" showing some TRIMs happening. Some more testing and it was smooth as possible without the sluggishness for both workloads -- during that time gstat showed TRIMs periodically happening as well.

I'm not sure what the fix is for this, but it's definitely improved since enabling autotrim on the pool.

Eric A. Borisch · Jan 4, 2023

hp550c said:
I set autotrim=on for the root pool after the "zpool trim" work finished. I then tried the rsync again and it had no issues like before. I then tried the other workload and it was humming along for the majority of it, but then went back to the sluggishness in Firefox and I could see the write speed drop considerably. After the writes finished, the sluggishness was gone, and I saw "gstat -pdo" showing some TRIMs happening. Some more testing and it was smooth as possible without the sluggishness for both workloads -- during that time gstat showed TRIMs periodically happening as well.

I'm not sure what the fix is for this, but it's definitely improved since enabling autotrim on the pool.

It’s likely much more positive impact from the completed trim than the autotrims currently being issued. If you have a known downtime when you could tolerate running zpool trim, you might find the best balance from running that periodically and leave autotrim off. With a drive of that size, it likely takes a significant period of time (measured in days or more, not minutes or even hours) to get to a point where a full trim is needed again. Try scheduling a weekly trim and see if that keeps things reasonable. (Again, assuming this is a system where you can, not a laptop, for example.)

Eric A. Borisch · Jan 4, 2023

richardtoohey2 said:
I’ve noticed the slowdown on SSDs - ones with buffer RAM/cache - once the buffer gets full. Greased lightening for a while at the start (of doing lots of file operations) and then “clunk“ down into treacle mode.

Yeah this one has no RAM; it uses the host buffer, that’s why I mentioned tuning that might be worthwhile, but the obvious caveat is if the power goes out, or the system crashed, you’re looking at potential issues. I assume it honors things like sync() such that when ZFS think is it written to non-volatile, it has been, but I do not know.

richardtoohey2 · Jan 4, 2023

TIL that hardware RAID means no TRIM. Not directly relevant here, but interesting.

e.g. https://www.redhat.com/sysadmin/trim-discard-ssds

(Not saying a lot of the article is useful, just something that mentions that hardware RAID means TRIM might do nothing.)

sko · Jan 4, 2023

Samsung NVMes are notorious for running (much) hotter than most other vendors and throttling *really* hard at higher temperatures. Shouldn't be as bad as with earlier drives, but depending on the environment (e.g. fanless and/or very compact systems) they still tend to run very hot and show considerable performance drops...