New server hardware - EPYC 9004?

rainer_d · Jun 24, 2024

For the longest time, we just bought HPE servers.
Getting the HBA in the latest Gen10 servers to work was a long an arduous task.

Now, I get to buy new servers and it looks like HP just doesn't really offer what I need.

So, I turned to Supermicro:

AS -1125HS-TNR | 1U | A+ Server | Products | Supermicro

1U Hyper with 8 hot-swap 2.5" NVMe/SAS/SATA bays and 3 PCIe 5.0 x16 slots

www.supermicro.com

I intended to fit it with 8 960GB NVMe SSDs and two boot-drives in the rear.
I also intend to put a RAID-Z2 on the 8 NVMes - is this advisable? Or just get two very large NVMes and mirror them?

I only need GB ethernet, so I intend to get the i350 AIO card.

Is there anything to be on the lookup for?

It seems that performance improvements have been made:

https://old.reddit.com/r/freebsd/comments/1di1iae/freebsd_141_vs_dragonflybsd_64_vs_netbsd_10_vs/

pming · Jun 24, 2024

There are already NVME Gen 5 devices listed in the compatibility specification for this system but there are no 960 GB versions (yet?). With newer generations comes more bandwidth, but can the devices saturate these links? Is another question.

rainer_d said:
I also intend to put a RAID-Z2 on the 8 NVMes - is this advisable? Or just get two very large NVMes and mirror them?

What is more important? Speed, capacity, fault tolerance? What is the application of this system? For virtualization, mirrored-stripe would be my preferred setup to achieve maximum performance. NVMe devices are very fast so I don't think there is a primary benefit to using more smaller devices instead of less that are larger.

Erichans · Jun 24, 2024

rainer_d said:
I also intend to put a RAID-Z2 on the 8 NVMes - is this advisable? Or just get two very large NVMes and mirror them?

I don't have any real experience in the server domain area, and I don't know if you are new to ZFS. Looking at the hardware, I imagine you'll be deploying U2(3?) NVMes. Depending on your intended usage there could very well be a significant different impact on your performance. Perhaps these might help: Choosing The Right ZFS Pool Layout & Scaling ZFS for NVMe - Allan Jude - EuroBSDcon 2022

sko · Jun 24, 2024

rainer_d said:
I also intend to put a RAID-Z2 on the 8 NVMes - is this advisable? Or just get two very large NVMes and mirror them?

I'd *always* prefer mirrors over any raidz configuration except if I *really, really* have to, e.g. due to physical space constraints.
Mirrors beat raidz in maintainability, migratability and flexibility any time. Also resilvering is *much* faster (we're speaking minutes/hours vs hours/days on SSDs; on spinning rust raidz vdevs can easily take a week or longer to resilver...).

Depending on what data you are putting on that pool (and what interconnect the host has) I'd go for less and bigger drives. There's very few use cases where even a single NVMe would pose a bottle neck, let alone a mirror of 2 drives. So unless you have 50Gbps+ interconnects and clients that can actually ingest data at such speeds, that pool won't be anywhere near its bandwidth limits even with a single vdev.

Erichans · Jun 24, 2024

Exactly as sko says. What's the reason you're considering a (8 drives no less) RAIDZ2 option in a 1U server?

sko · Jun 24, 2024

I just did a quick check for SAS/SATA/U2 SSDs based on price per TB - the current "sweet spot" seems to be at ~2-4TB for SATA drives, followed by some of the huge (15+ TB) U.2 NVMe drives and then 2-4TB U.2 NVMe.
So even from a price standpoint going for bigger drives is the better solution - e.g. instead of 8x960GB simply use 2x4TB in a mirror and call it a day. Power consumption (and hence cooling requirements) will also lower for that configuration. Given you only have 1Gbps links even SATA drives would easily suffice, so you can save even more with those, plus you won't have to immediately buy a tri-mode or dedicated NVMe HBA for that new system and simply use the onboard controller or a cheap LSI SAS9300-based HBA.

rainer_d · Jun 24, 2024

Hi thanks for the feedback - I'll take a look at the presentation.

This is to replace a server hosting possibly hundreds of different php-fpm instances - shared hosting basically.

I want to use more smaller devices due to the tendency of SSDs to "just die" - I'm even contemplating mixing two different types.

It's currently not possible to virtualize this, because it relies on ZFS quotas and virtualized ZFS is usually slow.

I'm currently resilvering a 1.2T SAS HD....

pool: datapool
state: DEGRADED
status: One or more devices is currently being resilvered. The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scan: resilver in progress since Fri Jun 14 14:18:28 2024
41.1T / 42.0T scanned at 49.0M/s, 7.88T / 9.23T issued at 9.40M/s
545G resilvered, 85.37% done, no estimated completion time

In the past, I found it's very difficult to make this thing "high available" without sacrificing performance or creating other (S)POFs - thus, the individual system needs to have a lot of redundancy built in.

I actually only need maybe 4-6 TB of space at most, so the 8x960 minus RAID Z2 overhead would be "ok" in my book.
I was under the impression that I don't need a HBA at all if I don't want to create a hardware RAID?

I was also asking myself if the overhead of maintaining the RAID z2 on these devices is noticeable vs. "just" having a mirror?

ralphbsz · Jun 25, 2024

Erichans said:
Exactly as sko says. What's the reason you're considering a (8 drives no less) RAIDZ2 option in a 1U server?

8 disks with mirroring: You get 4 disks' worth of capacity, and can tolerate simultaneous failure of any 1 disk. If you are unlucky, a failure of 2 disks (or 1 disk + one sector on a second disk) may cause data loss.

8 disks with RAID-Z2: You get 6 disks' worth of capacity, and can tolerate simultaneous failure of any 2 disks. It would require 3 disk failures (or 2 disks + one sector on a third disk) to cause data loss. This means that dual-fault tolerant RAID is much more reliable

In the above, the term "simultaneous" means: The failures happen so quickly that the first failure hasn't finished resilvering yet. This in particular favors RAID-Z2, because to get 3 disk failures, you start with 1 disk failure. If you get lucky, that first failure has finished resilvering before the second disk fails. If you are unlucky, the second disk fails before the first has finished. But now only very few pieces of data have two damaged areas (no data loss yet), and resilvering will work on them first, so it is very likely they will have been repaired before the 3rd disk fails. For this reason, RAID-Z2 is exceedingly more reliable than single-fault tolerant RAID-Z1 or mirroring, even more than the above argument alone would predict.

rainer_d · Jun 25, 2024

Thanks, usually, when I have more than a boot-disk, I go with RAID-Z2 (usually with 8 disks in a vdev). It just "feels" safer.

Erichans · Jun 26, 2024

rainer_d said:
[...] Or just get two very large NVMes and mirror them?

compared with your follow up messages later in the thread:

rainer_d said:
[...] thus, the individual system needs to have a lot of redundancy built in.

rainer_d said:
[...] usually, when I have more than a boot-disk, I go with RAID-Z2 (usually with 8 disks in a vdev). It just "feels" safer.

feels a bit contradictory to me where it relates to the desired level of redundancy. You mentioned a mirror of two drives in your OP: i.e. a 2-way mirror, that looked to me as an option that you were actually considering; as it did to sko I imagine.

When you want safety by means of extra redundancy then you'd be comparing a 3-way mirror against a RAIDZ2 solution to be on equal redundancy footing; a 2-way mirror does not compare favourably with RAIDZ2, as ralphbsz explained in detail. I can imagine that then you'd opt for RAIDZ2 over a 3-way mirror.

However for the RAIDZ2 variant I'd also consider a 6 (or 7) drive RAIDZ2 deployment (with a slight loss in efficiency, depending on the properties of the stored files) in favour the 8 drive solution. I also see no price advantage of the 8 drives version over a 6 drive version, as sko quickly checked. This 1U server only comes with 8 drive slots at the front (optionally extended with 4 bays); populating only 6 gives more options:

with the default 8 bays, you're able to let a failing drive to be replaced online while resilvering
- somewhat faster, safer resilvering
2 bays immediately available for RAIDZ expansion in the near future
- RAIDZ expansion has been merged into OpenZFS and likely to be available when FreeBSD updates to it; likely with OpenZFS 2.3; 14.1-RELEASE is at OpenZFS 2.2.4)

rainer_d said:
I want to use more smaller devices due to the tendency of SSDs to "just die" - I'm even contemplating mixing two different types.

Any idea why these tend to "just die"?
(and I have no idea what you mean by two different types)

EDIT: is this observed perhaps specifically in combination with ZFS?
and could you be a bit more specific:
- completely inaccessible as if they were not present at all
- an unexpected rise in ever increasing errors reported to the OS

richardtoohey2 · Jun 26, 2024

Erichans said:
Any idea why these tend to "just die"?

I’ve read this a number of places but don’t have any experience or real data; this thread sums up some of what I’ve read in the past:

Why do SSDs tend to fail much more suddenly than HDDs?

Modern PCs tend to use one of two types of internal-storage devices:1 2 Hard-disk drives (HDDs) store data on spinning disks coated in a magnetic recording medium, a technology dating back to the ...

superuser.com

But I would be very interested to know what “the truth” is. SSDs been very reliable for me so far across a number of platforms.

sko · Jun 26, 2024

I prefer a drive to just die and shut up than limping on and causing various failure modes... SATA drives are as always by far the worst here - you get anything from happily accepting all writes but returning only garbage (ZFS throws those out pretty quickly and the system is fine); to stalling for minutes just to fake being alive again for a few seconds and going catatonic again - dragging the whole system to a crawl because all IO is stalling and has to time out; up to completely locking up the whole controller/expander they are connected to, even preventing the system to boot... Almost always the useless "SMART health status" still blatantly lies to your face and returns "OK".
SAS drives are usually much better at handling failures - they accept they are failing, return errors and allow the system to keep operating. (However, I haven't had a SAS SSD fail yet)
NVMEs usually just go dark - they might still be recognized as PCIe devices, sometimes even the controller responds, but they are dead and don't interfere with the system.

That being said, I had *far* less drive failures with any kind of flash than with spinning rust over the years. Especially when it comes to ageing drives, where HDDs tended to fail due to mechanical wear, SSDs are usually completely fine if they are still well below their rated TBW. We have some Intel S4500 and S3510 in less-important systems which are near or even over 70k Power_on_hours and they are perfectly fine, not showing a single reallocated sector or other signs that they will fail soon (they are between 10%-50% media wearout). Most of the SATA SSDs in our servers are Kingston DC series; not a single failure yet.
With NVMe I can only remember 2 failures with micron 7400 in very short succession, which just went silent but still showed up as a PCIe device but the controller wasn't responding any more (I still suspect a firmware issue; replacements had another version and are still running fine). In both cases the failing drive caused a panic and system reboot, afterwards the drive was just silent. Otherwise I had no failures with "server rated" NVMe (M.2) yet.
The ~20 1.92TB (HPE-branded) Toshiba and Sandisk SAS SSDs, which we have been running for ~5 years now had zero failures.
We run ZFS mirrors everywhere. Those old intels are running in less important systems and usually have at least one newer drive in their mirror(s) (some 3-way mirrors) and I deliberately keep some of those old things running just to see how far you can push them.

When it comes to consumer/desktop drives, we used samsung 850 pro SATA for a long time with only 1 or 2 failures, which were due to wearout over many years and simply showed failing sectors (of course, running windows/ntfs this lead to massive data corruption and hard crashes); but after those SATA drives samsung had been the worst for us - their 9xx NVMes were especially terrible with several failures within the first few months. Given that their TBW ratings have gone down where all other vendors ratings are usually going up, and the fact that samsungs are by far the most power hungry space heaters, we/I avoid them like the plague nowadays... We now mostly deploy WD blues in all of our client systems (NUCs) as well as in low-power/low-load appliances (e.g. branch VPN routers) and up to date had zero issues with them, with the oldest SN520s being ~5 years of age now. They have been so uneventful over the years, I even use them as boot drives in some (non-critical) servers and network appliances.
Again: *much* less (near-zero) failures with SSDs than back with spinning rust. Usually they only get replaced because they get too small...

Over the last 15 years I encountered 4 catastrophic storage failures leading to outages and/or data loss - one was thanks to a hardware bug in the RAID-controller (i.e. the vendor cheaping out and using some chipset with weird 12-bit registers that overflowed, leading to returning wrong addresses and overwriting existing data), one was thanks to Seagate drives (HDDs) which were dying like flies almost exactly after 3 years in very short succession - i.e. 3 out of 6 drives failed within 2 weeks, 2 more followed in the weeks after...
The other 2 failures were caused by dying SAS-HBAs (both SAS2008) and involved ZFS pools that got their metadata corrupted.
So except for those Seagates (never bought a single drive from them ever since...) basically none of the "catastrophic failures"/outages caused by storage were related to dying disks. However, my sample set isn't terribly huge with currently ~10 running servers, ~20 smaller appliances and ~50 clients, so YMMV.