ZFS Sharing NVMe between root partion and ZIL(SLOG)?

daemontrainer · Oct 17, 2016

Hey guys,

I searched the forums and mailing lists for answers (both freenas and freebsd) and I couldn't find a definitive answer.
If this was already resolved, please link me the thread and I will read thru it! ^^

Anyways.

I am preparing to build a storage machine for myself. Among the things I am trying to figure out is this one
Should I get an NVMe (since we ported DragonflyBSD's support just recently! Thanks guys! ^^) and than partion it (as they don't come smaller than 128gigs it seems) to host my root folder (basically the whole system) and section of 15 gigs or so and use that partion as a ZIL(SLOG) for the main storage pool?

I am planning to start with one vdev that will be RAIDZ1 4x4Tb drives.

So has anyone done this successfully? is it a good idea?
P.S. I found that one should not let ZIL and L2ARC share a drive due to how L2ARC activates, so that idea is out. But if I can share ZIL and the system on the NVMe, it would be worth buying 140 buck device as oppose to the smallest Corsair Neutron I can find just for the ZIL, since write speeds is all that counts for it to be good, not the space.

Thanks in advance! ^^

Phishfry · Oct 18, 2016

I dunno about shared root and ZIL combined but I can say if writes are important look carefully at the specs. The smaller sized modules have very degraded write speeds.
For instance on the Toshiba XG3 series NVMe the PDF has this:

1TB and 512GB --2400mb/s read and 1500mb/s writes
256GB --2400mb/s read and 1100mb/s writes
128GB --2100mb/s read and 590mb/s writes

daemontrainer · Oct 18, 2016

Thank you for feedback! ^^
I was planning on purchasing this guy "plextor m8pe" but in 128gig flavor.
I will keep an eye out on what you said. I suppose if I can't get the crazy write speed why bother? There is always a matter of consecutive writes using multi-threaded scheduling though(which was a big selling point for NVMe). I wonder how much difference would that make? (as far as I recall FreeBSD supports that functionality).

So has anyone else did what I am trying to do? Maybe not even with NVMe but just with normal SSD (Splitting it and using a partion as ZIL and partion as / )

sko · Oct 19, 2016

daemontrainer said:
P.S. I found that one should not let ZIL and L2ARC share a drive due to how L2ARC activates, so that idea is out.

As NVMes have multiple and very deep command queues (65535 x 65535) compared to the single 32-commands queue on AHCI-drives, there is no reason for not using one device for L2ARC and SLOG, especially in a relatively low-load scenario like a home storage system. As long as the SATA-link (or HBA) doesn't impose too big of a bottleneck for the given workload, placing L2ARC and SLOG on the same SSDs is also fine.

I'm using a 128GB Samsung NVMe as 123GB/5GB L2ARC/SLOG in my desktop at home. My storage system at home currently uses 2x 120GB SATA SSDs as 110GB mirrored L2ARC and 2x10GB SLOG. My desktop at work uses a single SSD and our production virtualization host (smartOS) and storage system (FreeBSD) both use 2 SSDs (one on each HBA) for L2ARC (mirror) and SLOG.

All my SLOGs are way too big I think - I've never actually seen the SLOG using more than a few MBs on my home storage system, except when AMANDA is collecting backups - although I'm not sure why, as these should be asynchronous writes. The only scenario where I maxed out the SLOG was when benchmarking with random sync writes on an zvol exposed as an FC target to my desktop.
The smartOS host runs at ~15-20MB SLOG usage during the day, mainly caused by a horribly badly configured MSSQL server (and inferior application using it) on a win2k8 VM, causing a s***load of (unnecessary) synchronous writes.
This may differ on single-vdev pools due to their lower write speed, but ZFS (now) generates backpressure if the SLOG grows too fast and the disks can't keep up with the periodical commit of the SLOG. So SLOG is NOT a cure for slow disk/vdev performance on big writes, but mainly to speed up synchronous writes like from databases or VMs.
So if you don't have any workloads with high synchronous writes, you may be just fine without any SLOG, especially on a pure fileserver where the bottleneck is the LAN. To speed up asynchronous writes, just add more RAM; thats where these transaction groups go to.

What you should always consider: L2ARC and especially the SLOG impose a very high write load on a flash drive, so you should go at least for high quality/endurance consumer drives or ideally for server-grade SSDs, which can handle a high GB/day write load. I've once used 2 dirt-cheap 60GB SSDs in my storage system at home for testing last year - within 3 months SMART 231 (SSD_Life_left) went down to <50%.
I'm currently using Samsung 850 and SM951 or Intel DC series in my 'normal' systems, for servers/production systems only Intel DCs. The Samsung drives tend to get quite hot under load and start to throttle, especially the SM951 NVMe, which drops to ~50% normal speed after a few dozen GB without additional cooling. So you might want to attach a small heatsink to your NVMe and provide a sufficient airflow.

Also make sure to provide enough RAM for the L2ARC or you might even hurt performance by adding it. Rule of thumb: 25MB RAM per 1GB L2ARC. So for 100GB L2ARC 2.5GB RAM will be used. If the system is already running low on memory, don't add an L2ARC but more RAM. Generally speaking you should first add as much RAM as possible before adding/increasing L2ARC. Everything read from ARC in RAM will be faster by several magnitudes than anything in L2ARC, even if it sits on NVMe.

daemontrainer · Oct 19, 2016

sko
Thank you for quite a detail write up man! ^^
There is apparently a ton of contradicting information on the subject = \
I believe your experience. I am curious of what performance gains you were able to achieve with those set ups?

I understand the issue regarding the wear/tear. But as you stated, isn't SLOG suppose to write at most 1gig at a time of high write load, therefor writing maybe double its expected data per day(most ssds are rated to do 50gig of writes per day it seems)? So it would follow lifetime would get cut in half, which is fine. Its still couple years, or is the doubling of the load has more of a quadratic impact on the SSD's lifespan? Oo
Currently I was looking at maybe cheapening out and buying OCZ vector 180 (120gig) to add as SLOG or as system/SLOG as it has lowest latencies of all consumer grade SSDs coupled with 4th/5th highest write speeds even at its lowest 120 gig version (the speed tends to rise as the volume rises according to all benchmarks it seems)
Other options were plextor NVMe 128 gig drive or Intel DC S3700 100gig (but this one is NOT cheap = \)

So has anyone else had experience having slog be roomate of a system drive or an L2ARC cache drive?

p.s.
So I suppose a bit of the storage array background to show my situation better. Originally I just dind't want to make it "help me with my build" thread.
My final goal is to build this
4tbx4tb SAS drives in RAIDZ1 configuration x 3 vdevs + 1 SLOG device on 32 gigs of ram and a hexacore AMD processor (fx-6300).

All this is to host not only files, but hordes of vms, databases and whatever else I can think of to do with 30 TB of storage haha. VM hosting will certainly be present almost from day one. I am exploring any possible "cheap" options to wire up this storage facility with my muscular VM hosting ...errr machine at 4GbE.

*EDIT*
So I can't multiply apparently.. heh. So it seems I would be running on 36 TB of space supported with 32gigs of RAM. I might shave the 3rd vdev and make it 3x4TB in RAIDZ1 to arrive at 32 tb and 32 gigs of ram. Which is fine in terms of space. But that means I wouldn't get much use of L2ARC would I? Since my ram won't exceed the recommended minimum?

It is all pretty confusing... especially with so many sources telling different stories. Most of my knowledge on the subject comes from ZFS dev blogs, personal use experience of ZFS for past 3 years and now a week of google information hunt.

sko · Oct 20, 2016

Did you already buy this hardware? The CPU is intended for desktop - so no ECC RAM, which should be a complete no-go for a ZFS storage system or any server at all...
Also if you want to run VMs (bhyve) on this machine, you'd be better off with an intel CPU (Xeon E3/E5) as the FX-6300 has no VT-x/VT-d/EPT support.

For this size of pool 32GB RAM is a rather minimal configuration. Rule of thumb: ~1GB RAM for 1TB poolsize. The system itsself and running services will also need enough RAM, so for this pool size I'd go _at least_ for 64GB. If you want to run "hordes of VMs" go at least for 128GB. If your budget is limited, the priority should be set on RAM, not on L2ARC or SLOG devices.

daemontrainer · Oct 20, 2016

sko

I think you might have misunderstood me on the "hordes of vms". This machine will only act as an iSCSI target for another machine running said hordes. This machine will not run anything outside of FreeBSD OS, some monitoring tools and maybe one jail that will spend most of its time idling until I stick a usb stick in to get data of it. Said data will be automatically dumped into the jail, scanned and than move out of jail. So I doubt that jail will use any resources. At least not according to jails I run now (it seems when jail idles it uses barely any CPU time/ram).

As far as the RAM limitation. Yeah that's the one thing that is sketchy, but as I've stated I can shave of the last vdev. But that mobo does not support greater than 32gigs = \ I would have gone with something else, but trouble is, that amd mobo + that cpu costs 150 bucks... I honestly doubt I'll find anything new for that price with those capabilities with confirmed support. = \

Also please please don't take it as an attack or anything like that but on Non-ECC ram
http://jrs-s.net/2015/02/03/will-zfs-and-non-ecc-ram-kill-your-data/
So it is somewhat of a contested topic in my humble opinion = \

sko · Oct 21, 2016

I know this article (rant) about ECC-RAM/ZFS - its completely based on assumptions, not a single bit of actual data. Quite amusing to read though.
I've had several bad/malfunctioning/dying RAM in the past, producing the weirdest, most annoying and always data-corrupting errors. Often these errors went unrecognized for a long time, resulting in massive data corruption.
ZFS has the ability to protect the data from notoriously unreliable spinning rust (and its firmware), but if the RAM is also a potential source of errors, ZFS can't do its job, especially when RAM is also scarce. On a storage system whose only purpose should be to reliably store and retain data, there is no point in using ZFS without ECC RAM and such a low RAM limit.

If budget is limited have a look at the ASUS P9D-E/4L (or the P9D series in general). I've built my home storage system and 2 branch servers with this board. My home server ran on a humble Celeron G1820 for the first few months, which costs a mere 30 EUR and ~~supports~~ runs with ECC RAM. Except for when building ports or heavy (un)compress jobs, this CPU was quite up to the normal daily workload - especially because ZFS doesn't add that much CPU load.
So a system with ECC and a relatively decent Memory limit can be build on a tight budget without resorting to unsuitable hardware. But building a storage system this size on top of old/crippled desktop hardware with severe limitations is bound to fail - especially with a RAM limit this low.

Instead of saving on the board/cpu/ram side, I'd rather start with less disks. These can be added later on a running system when budget is available (and the storage actually needed), but the main system hardware can't be changed without a (major) outage.

update:
corrected the mentioning of ECC "support" by LGA1150 Celerons. They don't *support* ECC but will run with ECC RAM installed; can be used to get a system running on a budget and later upgrade to an ECC-capable CPU (e.g. E3-12xx)

Phishfry · Aug 1, 2018

Sorry to come thread crashing. I almost asked the exact same question but found this.
Let me dump it here in case I missed anything for my build. A supercharged version.
This post had some ratio's from sko that I can really use. I forgot this thread existed.
##################################################################

I want to build a disk array of 2TB Samsung hard drives.NOS.
My first ZFS build outside NAS4Free. I have around 10 with 2 more maybe held for spares.

I want to put an NVMe in front of some disk drives to act as a super fast cache.
Mirroring the NVMe's is what I was wondering and it sounds wise.

Are there ratios? SLOG, L2ARC, ZIL to the zpools. How much front end NVMe cache to storage.
With 2each 512GB NVMe my cache is pretty big. I would like to run zroot from it too.

So lets reverse the question. How much storage should I use for Mirrored 512GB NVMe front end.
I could drop that to two 256GB NVMe drives but they have slower writes.
I will go for 2 redundant zpools of disk drives. 4 or 5 each.
Maybe I will use 128GB SSD instead of HDD. I have a Chenbro 24 bay SAS2 chassis.

General front/back Ratios or Suggestions? Good Reading?
Am I expecting too much for FreeBSD here? ZFS can do this right?
Managing storage between NVMe and backing disk pools.

I am messing with 10Gbe and I need to fill them pipes.

sko · Aug 1, 2018

As already stated: the SLOG can be pretty small for most workloads and often can be completely omitted. Our main storage server that hosts some jails (gitlab, NIS, ldap, ansible, pkg-mirror...) and a bunch of NFS shares, including all user /home directories, currently uses a whopping 2.7M (!) of its SLOG...

With 2 512GB NVMes I'd probably use ~150G for the zroot (mirrored), depending on how much is added to the base system and how frequently snapshots are taken and how long they are being held.
Another 100-150G (striped/total size) for L2ARC, but heavily depending on available RAM. With only 64G or even 32G I'd go with much less (max 50G).
Our storage server has 96GB of RAM with 200G L2ARC (2x100G - there is no need to mirror the L2ARC) and ARC max size of 64G, which is very rarely exceeded. Usually the ARC size hovers around 40-50G during working hours and often drops to <30G during weekends.
L2ARC is filled 99% after a few days, but the data is pretty "cold" and L2ARC isn't accessed very much except after huge transfers that flush the ARC (e.g. laptop backup images), which are relatively rare. These values could be tweaked and are far from "perfect", but they worked well and without any issues for ~2 years now. With what I know today I'd probably allocate a much smaller L2ARC on this system (100-150G).
Regarding mirroring/striping the L2ARC: always use stripes for performance reasons. If a L2ARC provider fails, the data is just missing from the cache and ZFS goes to disk as usual, so no data is lost. A mirror just hurts the performance, which is the last thing we want on a cache.

Finally a max. of 1G for SLOG (mirrored!) and the rest for a separate "fast" pool e.g. for build jails or *very* performance-critical databases. This pool (as well as the zroot) won't need any additional SLOG or L2ARC - NVMe already is the next fastest storage tier after RAM, so no need to add any additional (slower) caching layer.

So as said before: The single most important performance factor for ZFS is RAM - if you have to direct your budget either to NVMe or RAM, always go for RAM first until reaching the maximum capacity of the system, especially if you have to saturate 10G links. (which isn't that much of a problem with ZFS even with spinning disks, given you don't screw up pool configuration - I'm saturating 8G FC links for several minutes with a bunch of old SATA disks in 4 mirrors on my home test server...)

In case you haven't already stumbled over this recommendation: Get the books on ZFS from Michael W. Lucas and Allan Jude; they cover anything you might want (or don't want) to know about ZFS and everything you might need during day-to-day operations and troubleshooting:
https://www.tiltedwindmillpress.com/?product=fmzfs
https://www.tiltedwindmillpress.com/?product=fmaz

Phishfry · Aug 1, 2018

Thanks so much.
So without the benefit of the great book you recommended, What about really low power E3-1220L V2 or V3.
Does ZFS loaded up take many cpu cycles or pretty much just ECC RAM intensive for a quick NFS file-server.
These are pathetic 2 core CPU's (Ivy Bridge and Haswell) but sip power. I want to dial my fans way back.
With 2 disk controllers and 2 NVMe and a slot for 10Gbe. I probably need more PCIe lanes.
I had no idea they sold a Celeron with ECC support. Thanks for that tip too.

sko · Aug 1, 2018

I was using a tiny Celeron G1820 for a while in my home/test server. A E3-1220 should be perfectly fine for a storage system as ZFS itself doesn't really has high CPU demands apart from compression (which isn't really CPU-demanding in case of LZ4 and can be tweaked/disabled if this should pose a bottleneck).

Depending on what jails/services you want to run and what your compression and de/encryption demands are, I'd go with an E3-1231 as it is/was the "best bang for the buck" of the low-end L1150 Xeons (smallest/cheapest SKU to offer HT).
OTOH I don't know if i'd still go for a legacy LGA1150 platform today except if you have plenty of DDR3 RAM laying around you could use on this platform. Otherwise I'd opt either for a Xeon-D based or - considering the massive amounts of f*ck-ups from intel regarding their chip design and IME flaws - even an AMD based system. Although the EPYC series doesn't really have any counterparts for the low-end E3 Xeons and EPYC 3000 embedded systems haven't hit the market yet...

Phishfry said:
I had no idea they sold a Celeron with ECC support. Thanks for that tip too.

Sorry, this was quite horribly worded by me - What I meant is, it *works* with ECC but it doesn't exactly support or use the ECC features. You can put ECC RAM in the system (if the board/chipset supports it!), and it will run. This can be used as an intemediate step to later upgrade to a proper ECC-capable CPU (Xeon E3) in case you run out of budget (happend to me when my main desktop machine and my storage server both died within ~3 months). Tried ECC with Pentiums and they refused to boot; so I was surprised when the Celeron worked just fine with ECC RAM installed and that's what I wanted to emphasize.