zfs synchronous write performance

Hi Folks,

I am experimenting with ZFS on the latest FreeBSD release, and looking at its performance. Particularly interested in synchronous write performance. Using iozone for testing.

Hardware Setup/Tuning
=====================
Memory = 16GB
Default ARC settings ~15GB
L2ARC = 240 GB on a SSD1 partition 1
ZIL/SLOG = 16 GB on the SSD1 partition 2
HDD = 2 partitions of 924g each from different disks. Striped(by default). No RAID.
Compression off
Dedup off
Rest of the settings are mostly default.
======================

WIth this I get the following numbers for a single stream write of 32G file, with 512K record size (512k on each write)
Sequential write without O_SYNC = 211 MB/s
Sequential write with O_SYNC = 112 MB/s

Random write without O_SYNC = 199 MB/s
Random write with O_SYNC = 103 MB/s

The sync write performance is almost half. While monitoring the disk usages and performance while the test is running, I notice there is flushing of data from the ZIL which is causing the bursty writes. ZIL seems to accumulate about 990 MB of data before flushing it out.

I played around a bit with the tunables mentioned in https://wiki.freebsd.org/ZFSTuningGuide. But could not improve sync write performance much.

Could you please suggest best practices and tuning suggestions to improve sync write performance.

Thanks
 
I'll be maximally obnoxious. Not because I'm trying to hurt your feelings, but to make a point. So please don't take it personally. Let me ask a few questions:

1A. Why are you writing synchronously? What exactly are the atomicity needs of your application that make this necessary?

1B. If you don't state the performance needs of your application, I will refuse to help you. Because life is too short to polish round balls that are already shiny. If you tell me that you have an application that absolutely needs to write a 400 MB/s because it's connected to data acquisition hardware that will otherwise lose data, and that you need to write synchronously because you have implemented your own distributed database on top of the file system but didn't bother to implement your own transaction logging for consistency, then that's something that we can work with (in which case the answer is going to be: forget it). On the other hand, if you tell use that the purpose of your system is to serve music files over a 100-base-T ethernet network, and you really only need it to run at that speed, then the difference between 100 and 200 MB/s is irrelevant, since either is already 10x better than what you really need (and the answer is going to be: ignore it).

2. What is your hardware? 100 MB/s could be excellent (if you have first generation SSDs from 4 years ago, and two 3600 RPM IDE disks), or it could be horrible (if you have Violins for SSDs, and a JBOD with 84 15K RPM SAS drives connected via a half dozen 6GBit SAS cables). Your system is probably in the middle between these ridiculous extremes. In particular with synchronous writes, the latency of your SSDs and the RPM of your disks will make a huge difference to performance. If we don't know where the bottleneck is (which can only be guessed if we know the hardware), then we can't help you tune to widen the bottleneck.

3. About the measurements you report: What parameters did you use for IOzone? Which tests did you run? When you say GB or MB/s, do you mean decimal or binary giga/megabytes (MiB or MB)? Were the tests at the inner or outer edge of the disk, in the middle, or all over? How big and how full is the file system?

In general, I fully expect the sync write performance to be much slower than non-sync, unless your SSDs are very very fast (and consumer-type SSDs are probably not fast enough), and the working set of your write workload fits into the write cache provided by the SSDs (which it doesn't, your ZIL is only 16 GB and your working set is 32G). Matter-of-fact, with 512KiB writes on two disks, your 103 - 112 MB/s number is pretty good. I just went and for fun measured a modern fast disk (10K RPM SAS 2.5" 300GB), and for 512KiB synchronous sequential writes at a queue depth of 1, it only gets 54.70 MiB/s at the outer edge of the disk, fundamentally because it misses a rotation after each write. With your two disks, your performance is in the ballpark, and given that your disks are probably considerably slower than 10K RPM disks, your performance might actually be very good.
 
While it's possible to put L2ARC and SLOG onto the same devices, it's generally considered bad practise to do so due to the vastly different performance characteristics of the two (one write-only, one read-mostly). Especially if you are trying to benchmark right after the system has booted or the pool has been imported, as the L2ARC goes into "write-boost" mode to try and fill the cache as quickly as possible.

An SSD has a finite number of IOps. For the absolute best performance, use it for one task only; either L2ARC or SLOG.
 
Back
Top