ZFS ZFS + PostgreSQL - with a twist

scorpion7 · Jun 9, 2024

All,

Looking to implement a dedicated server to perform data processing/analysis and it needs some fairly significant I/O. The "twist" is that the data's lifespan is the duration of time it takes to import, process, analyze and report on - then the database and the data is deleted. Storage need is on the order of about 800GB-1TB at any given time (max). Current notion is to use 4 x U.2 Optane (400GB) NVMe drives attached to an LSI 9500-16i [HBA] controller. It's practically double the need for space, but under RAID0, it would maximize the I/O. Implementation would be based on adding a new rc script that would check the zpool and re-create the pool (minus any failed disks) as RAID0, followed by initializing the Postgres data directory. Otherwise, it would boot as normal. This over simplifies but provides a sense of how "disposable" the data is and the target level of performance. OS and Application(s) would be installed on a separate pair of NVMes (not on the LSI bus) in a [ZFS] mirror configuration.

Questions:

While a RAID controller may be "easier" in some regards, would think that ZFS is a better option, no? (see later note about disk I/O amplification)
Any reason to consider controller based RAID0 over ZFS RAID0? ("santiy check" question)
Given RAID0 and use case - seems that ZIL/SLOG are irrelevant and unneeded, correct?
One significant advantage is the ability for "exact match" of recordsize to page size (eg: Postgres 8K to 8K recordsize and later ability to change Postgres to a larger value, destroy pool and re-create with matching recordsize - RAID controller might not provide exact match for alignment). This should avoid a significant amount of disk I/O amplification, which would likely impact overall lifespan of the drive(s) amongst other things.

As a validation point, if you were looking for nightmarishly fast storage with 800GB-1TB of usable space and trying to keep the storage cost 'in check' would you do this any differently? Open to ideas and suggestions that aren't astronomical in cost (this does have a budgetary constraint). Outside of controller and associated drives, the rest is built and waiting. Just need to validate either 'right' answer or potentially go down another path.

Thanks!

Mirror176 · Jul 21, 2024

My order of consideration would be ZFS (has known performance limitations+enhancements and overhead), UFS is good too if ZFS features are not needed/wanted while comign iwth less overhead too, then software RAID (more overhead than hardware RAID but can offer more features, less dependency on that card/vendor, etc.), then hardware RAID. Remember that RAID is not a replacement for backing up data and is there as a tool to join the space of multiple disks with/without redundancy and performance impacts depending how it is done. RAID redundancy helps maintain uptime during disk failure; ZFS knows whether data is correct or not for any array while other disk redundancy hardware controllers and software usually say, "we lost sync, copy all sectors of one disk to the other or recreate all parity data now" without knowing which disk is right or wrong except on a full disk replacement being provided.

Only reason I know of to consider separate hardware RAID controllers is things like battery backup on power outage (I'd say its better implemented in the drives though, or in their powersource), hardware controller may give more, faster, and different types of ports than what a system has built in, and it can offload RAID processing from the main system to the controller card (usually not an issue).

I'd leave out zil/slog unless you diagnose that you see a need. You also need to have faster drives for such cache than the main media or have a workload where it spread the load in some significant way. Some of your RAM is used to keep track of disk cache too so you then have less RAM and RAM is your fastest cache. If your drives are fast, then just use the drives without extra caching drives. Even if there are faster drives now for total throughput than optane, you need them to be faster for random I/O and for ZIL they need to be durable and mirroring (or more) a ZIL is normally done as losing it takes out the pool with it.

I thought I recall some ZFS RAID layouts are along the lines of ashift * number of drives for the smallest record size. ashift impacts smaller record size and usable space on raidz but I don't remember its impact for stripe but I'd expect it happens there too. A RAID controller may let you work past that without manually adjusting drive ashift to lower settings per drive, but performance will be impacted just as much as a bad ashift value would have done. ashift forcing 4k sectors (unless you don't do that,. but its common & recommended for most modern drives), then on 4 drives, now means 16k smallest record. If you need random reads of small blocksizes from compressed (or encrypted?) ZFS records then smaller record sizes mean smaller reads and less decompression/decryption processing to get the data you wanted out of it but also means ZFS compression would be less beneficial. If the reads are larger, a larger recordsize (and maybe comparable database tuning? would perform better, The more writes are large and/or sequential, the better they are handled by comparably larger record sizes. As you stated, you can always retune this as you will be destroying+creating the database regularly and may find an inbetween value of 16k, 32k, or 64k recordsizes are better than 8k or 128k+ with your particular workload.

Compared to consumer SSDs, optane drives are much more durable and faster with random/smaller I/O. Having more SSD than you need is good for improving SSD life though that matters more with lower end SSD technologies. Having room to grow is always a good plan though. There are faster drives than optane for sequential I/O but I haven't heard of anything surpassing its random I/O. RAM beats any drive and is always good to make sure you have enough and extra means more cache space.

ZFS ZFS + PostgreSQL - with a twist

scorpion7

Mirror176