ZFS Optane 905p as ZFS SLOG device?

Hi,

these are offered rather cheaply out of China on eBay.

I have never built a server with a dedicated SLOG device.
I will need to build a server with HP hardware with > 100T.
The current server uses 1.2T disks (48 in 6 RAID z2 vdevs) but I will need to at least double that.

We would need a lot of expansion enclosures and the HP VAR has advised us to go with bigger disks. That dramatically reduces my number of spindles of course and I fear it's all going to be very slow.

So, I was thinking about speeding up writes (lots of small writes from thousands of devices sending syslogs....) with the above SSDs (I'd use the HHHL PCIe cards).

The alternative is to just configure one of these Supermicro servers with 45+ top-loading disks and use that - but by definition, it's a single system that cannot be easily made highly available and so far, HPE servers have proven to be more reliable over time (while much more expensive and potentially less powerful than Supermicro servers - you don't get more bang for the buck anywhere else).
 
Intel Optane are higher endurance drives with good random i/o performance so they are a good slog choice for those reasons. I don't recall which may have power loss protection built in but it is good to look into also; its usually a battery or capacitor that can keep the drive powered on long enough for final writes to finish. It is still wise to setup redundancy since the pool dies with the slog device's death. Its recommended to also confirm there is a need for it before adding it.

Use case also matters. SLOG is used when writes need to be synchronous. If your programs don't require that then they won't be using it. Pending asynchronous writes will go from RAM to the main pool, skipping the ZIL whether its present as part of the pool's main storage or on a separate device. I'm not sure but doubt syslogs would request synchronous writes.

More spindles 'may' equal more speed. All other things being equal, as density increases on platters you usually have faster throughput because you pass across more bytes per rotation as they get physically smaller so larger drives give more throughput; seek time is not similarly impacted. With bigger disks and without using the additional space of the bigger disk it is more likely data can be present on faster parts of the platters so speed could go further up. When I/O depends on more disks, it is more likely there is a disk toward its longer seek time and the seek time for a transfer from several disks normally is the time of the worst disk (or close to it); fewer disks means fewer chances that they are all going to have a longer seek. Fewer spindles of comparable disks also makes failure of a disk statistically less likely.
 
When contemplating fast auxiliary storage...

If you don't have a lot of synchronous writing, then a SLOG won't help. Not many people have synchronous writing (and those who do generally know it).

An L2ARC might help, depending on your load.

A special VDEV dedicated to the metadata and (optionally) small files might also help, possibly a lot. But sizing special VDEVs is a black art.

But, you can't manage what you don't measure. You need to test...

And remember, the redundancy of those optional VDEVs needs to be at least as good as the zpools they serve.
 
How much is a lot?

(loghost </root>) 0 # zpool iostat -w


datapool total_wait disk_wait syncq_wait asyncq_wait
latency read write read write read write read write scrub trim rebuild
---------- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
1ns 0 0 0 0 0 0 0 0 0 0 0
3ns 0 0 0 0 0 0 0 0 0 0 0
7ns 0 0 0 0 0 0 0 0 0 0 0
15ns 0 0 0 0 0 0 0 0 0 0 0
31ns 0 0 0 0 0 0 0 0 0 0 0
63ns 0 0 0 0 0 0 0 0 0 0 0
127ns 0 0 0 0 0 0 0 0 0 0 0
255ns 0 0 0 0 28.4K 140K 11.5M 12.1K 631K 0 0
511ns 0 0 0 0 719K 6.57M 292M 3.18M 14.4M 0 0
1us 0 0 0 0 8.92M 53.2M 203M 78.9M 358M 0 0
2us 0 0 0 0 10.3M 1.54M 27.2M 236M 67.0M 0 0
4us 0 0 0 0 516K 191K 2.28M 22.1M 6.15M 0 0
8us 0 0 0 0 464K 39.3K 261K 3.01M 1.21M 0 0
16us 0 0 0 0 12.4K 2.56K 208K 4.64M 397K 0 0
32us 867K 0 1.10M 0 2.17K 471 372K 8.33M 448K 0 0
65us 360M 6.66K 362M 57.6K 900 10 721K 16.0M 595K 0 0
131us 105M 24.3M 107M 250M 1.23K 1 1.38M 29.1M 851K 0 0
262us 53.9M 34.2M 65.7M 1.29G 2.60K 5 2.38M 51.1M 1.15M 0 0
524us 17.2M 146M 30.4M 1.01G 5.32K 3 3.98M 54.2M 2.18M 0 0
1ms 24.2M 59.3M 105M 684M 11.5K 3 6.86M 27.7M 3.57M 0 0
2ms 58.3M 26.9M 543M 463M 23.8K 1 10.6M 34.1M 6.10M 0 0
4ms 125M 47.0M 834M 1.08G 53.2K 1 15.1M 68.5M 9.37M 0 0
8ms 193M 166M 1024M 4.11G 123K 0 16.6M 190M 13.2M 0 0
16ms 118M 484M 451M 1.98G 203K 0 17.3M 460M 16.9M 0 0
33ms 91.0M 924M 128M 280M 254K 0 15.7M 848M 22.4M 0 0
67ms 61.6M 1.35G 19.5M 27.3M 253K 0 9.58M 1.26G 34.4M 0 0
134ms 69.4M 2.00G 3.34M 3.94M 181K 1 7.36M 1.98G 56.9M 0 0
268ms 108M 3.60G 731K 233K 106K 0 6.79M 3.54G 99.3M 0 0
536ms 175M 1.69G 99.3K 16.0K 39.9K 0 7.53M 1.60G 167M 0 0
1s 255M 82.2M 15.1K 1.89K 11.9K 0 9.15M 77.5M 245M 0 0
2s 330M 26.6M 1.26K 0 7.94K 0 11.3M 26.5M 319M 0 0
4s 374M 51.2M 0 0 6.97K 0 11.5M 51.2M 362M 0 0
8s 350M 100M 0 0 4.60K 0 8.23M 100M 341M 0 0
17s 311M 162M 0 0 558 0 2.96M 162M 308M 0 0
34s 358M 85.1M 0 0 228 0 597K 85.1M 358M 0 0
68s 135M 32.5M 0 0 2 0 21.3K 32.5M 135M 0 0
137s 61.2K 98.8M 0 0 0 0 0 98.8M 61.1K 0 0
---------------------------------------------------------------------------------------

zroot total_wait disk_wait syncq_wait asyncq_wait
latency read write read write read write read write scrub trim rebuild
---------- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
1ns 0 0 0 0 0 0 0 0 0 0 0
3ns 0 0 0 0 0 0 0 0 0 0 0
7ns 0 0 0 0 0 0 0 0 0 0 0
15ns 0 0 0 0 0 0 0 0 0 0 0
31ns 0 0 0 0 0 0 0 0 0 0 0
63ns 0 0 0 0 0 0 0 0 0 0 0
127ns 0 0 0 0 0 0 0 0 0 0 0
255ns 0 0 0 0 4.50K 8.38K 648 3.03K 0 0 0
511ns 0 0 0 0 806K 425K 129K 721K 0 0 0
1us 0 0 0 0 3.62M 6.09M 115K 6.09M 4 0 0
2us 0 0 0 0 1.08M 1.40M 29.5K 11.3M 6 0 0
4us 0 0 0 0 15.0K 95.7K 1.51K 4.23M 0 0 0
8us 0 0 0 0 4.61K 16.6K 224 216K 0 0 0
16us 0 0 0 0 875 1.23K 300 286K 0 0 0
32us 5 0 7 0 240 131 714 515K 1 0 0
65us 63.3K 0 64.1K 0 77 4 2.11K 981K 4 0 0
131us 4.10M 3.01M 4.11M 3.50M 103 0 5.58K 2.14M 0 0 0
262us 275K 3.64M 275K 15.1M 328 0 5.41K 1.34M 4 0 0
524us 63.5K 2.99M 65.6K 9.36M 846 5 2.82K 1.79M 4 0 0
1ms 53.5K 2.75M 53.3K 4.95M 2.46K 3 3.41K 2.15M 3 0 0
2ms 96.6K 2.27M 103K 4.52M 1.38K 2 4.29K 2.52M 11 0 0
4ms 277K 3.10M 295K 27.2M 5.23K 4 6.83K 3.89M 21 0 0
8ms 543K 10.7M 573K 78.0M 18.1K 21 7.47K 10.1M 49 0 0
16ms 213K 32.1M 228K 69.2M 26.4K 19 4.10K 27.8M 61 0 0
33ms 163K 48.7M 161K 10.8M 17.4K 0 3.31K 43.4M 110 0 0
67ms 81.8K 68.5M 63.7K 1.10M 6.39K 2 3.16K 62.3M 75 0 0
134ms 14.4K 43.8M 7.25K 35.8K 876 0 2.20K 33.0M 66 0 0
268ms 5.80K 2.22M 1.11K 10.9K 395 0 3.99K 1.12M 41 0 0
536ms 9.46K 2.07K 157 82 624 0 8.63K 1.37K 10 0 0
1s 15.1K 0 2 0 1.22K 0 13.9K 0 0 0 0
2s 11.6K 0 0 0 281 0 11.2K 0 0 0 0
4s 8.34K 0 0 0 0 0 8.22K 0 0 0 0
8s 0 0 0 0 0 0 0 0 0 0 0
17s 0 0 0 0 0 0 0 0 0 0 0
34s 0 0 0 0 0 0 0 0 0 0 0
68s 0 0 0 0 0 0 0 0 0 0 0
137s 0 0 0 0 0 0 0 0 0 0 0
---------------------------------------------------------------------------------------
 
I don't recall which may have power loss protection built in but it is good to look into also; its usually a battery or capacitor that can keep the drive powered on long enough for final writes to finish.
3D XPoint (intel brand name: Optane) is non-volatile memory: additional PLOP not needed. It was fast when it was launched, but didn't scale (as) well as the all-prevalent flash memory (think density and speed at at current PCIe Gen4 & 5 speeds and beyond). It combines its inherent property of non-volatility with low latency and high write endurance. Have a look at for example Glorious Complexity of Intel Optane DIMMs and Micron Exiting 3D XPoint

Use case also matters. SLOG is used when writes need to be synchronous. If your programs don't require that then they won't be using it.
It is not so much the case when writes need to be synchronous, but just when you have a lot of them; relative to normal read & write disk IO and to the IO speed that normal storage speeds can handle. A lot of synchronous writes arise in specific circumstances, like DBs, VMs and NFS. It is about offloading ZIL writes to a much faster device (most often mentioned) and about offloading ZIL writes to a SLOG (not competing with normal read and write operations); these two go usually hand in hand together.

Understanding OpenZFS SLOGs September 22, 2021 by Dru Lavigne:
-- So, Why a SLOG?
Having the ZIL reside on the storage disks can result in contention: in other words, ZIL writes and reads must compete with other disk activity. This can cause some performance issues, especially on a system with a lot of small, random writes. On a busy pool limited by disk seek speeds, ZIL performance gets slower as pool activity increases.
[...]
-- Putting it all Together
OpenZFS provides several mechanisms to ensure that data gets written to disk. On a busy system that utilizes synchronous writes, moving the ZIL to faster SLOG media can reduce contention and might yield a performance boost.
 
AFAIK, nothing really. Optane became the SLOG go-to as a more robust (power-wise) alternative to battery backed RAM based secondary storage. In its early days, IIRC, Optane beat flash memory on every level except price. Optane may be old as a technolgy that hasn't managed to scale as expected and hoped for, but as a SLOG it is certainly not obsolete. Your use case with lots of spinning rust seems like a good place to opt for an Optane SLOG.
 
OK, thanks.
I thought I was missing something because Optane seems to be EOL just about everywhere.
Not even sure HPE is still selling some - I have asked now.
The actual quote for the servers wasn't that outrageous as the list-prices, thank god.
 
Synchronous writes go to the ZIL, later they also go to the pool just like any other asynchronous write. Asynchronous writes go straight to the pool skipping the ZIL but are buffered in RAM. ZIL is only read from when replaying it such as after a power outage. Synchronous write commands only return when the data has been written to disk; ZIL is used to let that happen sooner while not waiting to follow the usual organizational rules (better organized writes, compression, etc). Asynchronous writes return while the data is still just buffered in RAM waiting to go to the disk; they get reordered, compressed, etc. to go to disk in a more optimized form but at a later point in time (up to vfs.zfs.txg.timeout but forced sooner if the buffer is too full).

You can override what happens by changing sync=standard to sync=always which will force all writes be synchronous or sync=disabled to never have synchronous writes. logbias=throughput will also bypass log devices.

ZIL always exists, whether in main pool or on a separate SLOG. If it is on separate media then the main pool media doesn't use I/O on ZIL writes and on replay after crash/outage the main pool doesn't use I/O on ZIL reads. Therefore ZIL, when on the same media and writes are synchronous, is I/O amplification, and unoptimized I/O at that.

A fast SLOG + sync=always does not accelerate asynchronous writes. The data going to the pool for asynchronous writes returns immediately saying it is on disk even though it is just in RAM and RAM is still faster than any SLOG. SLOG is not used to increase the size of the RAM cache; more specifically, ZFS doesn't benefit from a SLOG being larger than the ZFS ARC. Data will go from RAM to the pool without reading the ZIL unless it is being read such as after a power outage.

So why use a SLOG? because you are doing synchronous writes and need them to be committed to disk sooner than the main array can write them, need to reduce write amplification for those writes, and need writes to be committed to disk safer sooner. For asynchronous writes, your main pool's slow write speed is otherwise still slow. Consider faster disk(s), controllers, and more ARC RAM if you aren't bottlenecked by synchronous writes either taking too long to return or bottlenecking by ZIL write (or rarely read) amplification & I/O.

If anyone says something contradicting this such as ZIL sized greater than ARC is fully used now, ZIL also receives asynchronous writes to cache some of the write I/O temporarily, etc., then could you please point to a source (sourcecode, openzfs pull request, official documentation, etc.) as I'd be happy to learn that such additional optimizations have become a thing.
 
We spec'ed a Supermicro server and originally, they offered 3d-xpoint (DC P5800X 400Gb NVMe PCIe 4.0x4)
https://www.intel.com/content/www/u...b-2-5in-pcie-x4-3d-xpoint/specifications.html
but later realized it's no longer on the pricelist.
Now, they offer these:

It looks like these "only" have half the write-IOPS as the Intel.

If only I knew if I need them...
 
So why use a SLOG? because you are doing synchronous writes and need them to be committed to disk sooner than the main array can write them, need to reduce write amplification for those writes, and need writes to be committed to disk safer sooner. For asynchronous writes, your main pool's slow write speed is otherwise still slow. Consider faster disk(s), controllers, and more ARC RAM if you aren't bottlenecked by synchronous writes either taking too long to return or bottlenecking by ZIL write (or rarely read) amplification & I/O.
Agree. I have also experimented with SLOG and eventually dropped that. In most cases it is a waste of a drive and makes the system more vulnerable. If it is a single drive, then accidental disconnect or drive failure will probably crash the pool. What comes to fast cache drive, it seems to be always useful to speed up the system, and it is safe.
 
In 90%+ of all cases where someone thinks they 'need' an SLOG vdev, they actually need a SPECIAL vdev.

A special vdev considerably speeds up all ZFS housekeeping and metadata operations (e.g. listing a few hundred snapshots on spinning rust can take several dozen seconds up to minutes...).
Additionally, a carefully adjusted special_small_blocks property for the pool or distinct datasets massively improves small/random I/O.
Those two are usually the major bottlenecks with spinning rust pools, so adding an nvme-based special vdev makes such pools much more bearable to use and maintain.

It goes without saying that special devices have to be redundant, because as with any other vdev without redundancy, if that vdev fails, the whole pool is gone.
 
Back
Top