Traveling salesman and other discussions.

Zvoni · Aug 18, 2021

Chipping in to the Cluster-Filesystem-Topic (in this case GlusterFS):
In my skydiving-club i've a setup exactly like that:
Two old commodity-PC (DELL's with an i5 and 8GB RAM each, 3 HD's each --> ada0=OS, ada1 and ada2 as Bricks) running FreeBSD as OS and GlusterFS.
The Gluster-Volume is setup as a 2x2 Distributed Replica (Brick1 on S1 replicated with Brick1 on S2 a.s.o.).
Meaning: If a Server goes down, all files are still accessible on the other Server (and i actually really got into that situation! Worked like promoted!)

On a Performance-Note: I didn't notice any grave differences in Performance/Speed for read and write (and i would have noticed that, since the Gluster-Volume is mainly used for Video-Files which on average are about 1.5-2.0GB in Size). At a guess i would say maybe up to 5% for a loss in Perf./Speed.

Lessons learned:
1) Use UFS for the Bricks, since Gluster doesn't play well with ZFS on FreeBSD (even mentioned on the official Gluster-Documenation) --> A sound backup-strategy is still needed (why would we have to discuss something essential like a backup-strategy?)
2) Setting up Samba to export the Gluster-Volume was a PITA, since i initally used the binary packages (pkg install samba), and missed, that the support for GlusterFS is per Default OFF. So i had to use ports for Samba.

ralphbsz · Aug 19, 2021

Jose asked (in a different place): "What is your opinion of HAST"

Answer:

I've never used it, but know how it works. For a small system with a fast network and moderate performance needs, it might work.

It has a few massive costs. First, it doubles the amount of disk space you need, since it forces RAID-1 on everything. In theory, you can get some of that space usage back, if you are using RAID too (like ZFS). But that assumes that HAST has as good read error handling as ZFS does, including being tied in with the CRC system ... which it doesn't (if ZFS detects a CRC error on copy #1, it needs to read copy #2; HAST doesn't do that). So to protect against silent data corruption you still need some form of RAID on top of it, so you end up with something like RAID51, and at that point your space overhead is annoyingly high.

Second, it has a nasty performance cost, because you need to wait for every write to be done twice. But because of its extent-based metadata, that's not on a per-sector basis, but per extent. And if you are doing truly random IO above HAST, that will lead to massive false write sharing (a 4K sector write turns into a 2M extent write on recovery). In default mode, it is also not completely safe, since it doesn't harden the extent bitmaps to disk immediately. I suspect it might even have "unstable reads" if you switch over between primary and backup server frequently enough (because the extent bitmaps might get lost); if not careful, that could wreak havoc with the file system on top.

For a small system, where the factor of 2 space overhead is not that important, and where write speeds are low, it might work great. But there another problem comes in: HAST needs to be integrated with a control mechanism above that switches primary and backup nodes. For a small system user (typically a hobbyist), getting that correct in the presence of all possible failure situations is SUPER hard. Distributed consistency is a nasty problem, in particular in two-node clusters (where you can't use quorum algorithms to force consistency, since with just two voters there is never any clear majority quorum). So for an actually secure system, one would probably have to implement something like Disk Paxos on top of HAST ... and implementing Paxos or Chandra-Toueg is exceedingly hard, and finding pre-cooked packages that implement it is nasty too.

Deep underneath this criticism is two observations. One is still true, the other only partially. HAST is fundamentally a RAID layer (it gives redundancy). The first one: To work best, RAID needs to be tightly integrated with the file system. RAID needs to know which blocks are live (so it doesn't waste effort on tracking / resilvering unallocated space), the RAID implementation needs to be tuned to the IO pattern of the file system (log structured vs. extent based vs. traditional allocating, and matching block sizes), and checksums need to be end-to-end. Second: RAID can't be done over the network, it's too slow. That statement has become partially wrong today, because (a) fast networks (Infiniband and 10G+ Ethernet) have latency comparable to disk latency, so going over the network "only" doubles the latency, and (b) some of the performance hits of RAIDing over the network can be fixed by tightly integrating the RAID layer with both the network and the file system layer, so the cost of RAID updates is amortized, and RAID data structures match the file system data structures. For example, you can redesign your file system so metadata is stored mirrored (even more than 2-way mirrored, I've seen 5-way replication be a good thing), while file data is stored very efficiently parity encoded. Like that the cost of degraded operation and resilvering hits areas that are less latency sensitive, and the throughput-vs-latency tradeoff can be made.

But: HAST is not integrated in the file system, and it runs over the network without any remediation for the shortcomings. So it's not going to be great.

Jose · Aug 19, 2021

ralphbsz said:
The reason it's really hard to find cluster file systems on FreeBSD is simple: there is no demand for them.

What about Ceph?

Intro to Ceph — Ceph Documentation

docs.ceph.com

Manual Deployment on FreeBSD — Ceph Documentation

ralphbsz · Aug 19, 2021

Jose said:
What about Ceph?

I'm so sorry! I completely forgot that Ceph exists for FreeBSD. It is an excellent system for large cluster storage. I particularly like that you can use it to get both file storage and object storage on the same system; that's really important if you have existing file-based workflow that still need to be supported, while also trying to migrate to more efficient object storage. Don't know how easy it is to set up Ceph on a small system, and how well it works with the minimal solution (a 2-node cluster).

And just to be clear: HAST and other such solutions are a block-based RAID layer (just a networked one). Ceph and friends are file systems (just networked ones).

fr33bsd · Aug 20, 2021

What about glusterfs?

ralphbsz · Aug 20, 2021

All I know about glusterfs is that it exists. I don't know any of its internal details.

And just to be clear: When I said above "really hard to find cluster file systems on FreeBSD" what I mean is the following: Cluster file systems are hard to write, typically done by large commercial companies (Oracle, IBM, EMC, Ceph -> RedHat, in earlier days Sun and Digital), and are typically used in large commercial or academic institutions. They are not typically found in hobbyists or small commercial settings.

Cath O'Deray · Aug 22, 2021

fr33bsd said:
… other distributed/cluster FS for FreeBSD? …

I thought of Andrew File System, only because I toyed with it many years ago on a Mac.

<https://www.freshports.org/net/openafs/> deleted a couple of years ago | <https://wiki.freebsd.org/AndrewFileSystem> | <https://www.bsdcan.org/2006/papers/afs_bsd_slides.pdf>

fr33bsd · Aug 22, 2021

ralphbsz said:
I'm so sorry! I completely forgot that Ceph exists for FreeBSD. It is an excellent system for large cluster storage. I particularly like that you can use it to get both file storage and object storage on the same system; that's really important if you have existing file-based workflow that still need to be supported, while also trying to migrate to more efficient object storage. Don't know how easy it is to set up Ceph on a small system, and how well it works with the minimal solution (a 2-node cluster).

And just to be clear: HAST and other such solutions are a block-based RAID layer (just a networked one). Ceph and friends are file systems (just networked ones).

I know ceph. In theory it is cool concept. In real world it sucks due to high latency and terrible read/write performance. I did a ceph PoC on a cluster, but even with tuning and special advises of experts and some custom deployment scripts ceph acted as fast as a slimy Octopodidae in fresh concrete. Ceph is ? (crap). Even bigger deployments of ceph suck! In a company I worked for we had ceph. The company wasted $500,000 on the hardware and then another 200,000 on "best" NVMe SSDs on the market at that time to improve performance by upgrading ssd-hardware, but ceph kept being slow as ... ?
Due to some missing but important information I even once suffered data loss, because they forgot to mention something in their documentation about some parameter that makes ceph stop working without any message (not even error in logs).
I actually pissed at sage's leg because of that, but he did not care

He sold this company to redhat for 20 mio. Not even a lawsuit could make him fear, because redhat, now IBM, is responsible for ceph aka crap (how I call it).

free-and-bsd · Aug 23, 2021

Yea... But you know, this upgradability is not always such a good thing!
Windows 10, for example, is SO UPGRADABLE that my friend at work can't find a way to BLOCK those imposed updates. The machine runs some very-very $$important software. Win 10 is pushing the updatable updates... right!
Only the update install ends up with black screen on the next reboot and no way out but triple Reset. 'Known problem', yesss, 100 fixes published but none works for that particular machine...
So much for easy upgradability. So... you better stay where you are and say thank you

fr33bsd · Aug 23, 2021

free-and-bsd said:
Yea... But you know, this upgradability is not always such a good thing!
Windows 10, for example, is SO UPGRADABLE that my friend at work can't find a way to BLOCK those imposed updates. The machine runs some very-very $$important software. Win 10 is pushing the updatable updates... right!
Only the update install ends up with black screen on the next reboot and no way out but triple Reset. 'Known problem', yesss, 100 fixes published but none works for that particular machine...
So much for easy upgradability. So... you better stay where you are and say thank you

I experienced this upgradability fetishism on hardware, too. sometimes it brings you only few IOPs more, but costs the price as all installed server mainboards together. Then rather wait and buy complete new hardware (i.e. upgrade pcie3->pcie4). But sometimes even bombarding the problem with hardware won't fit it, just because the software you use is crap (examples: windows*, ceph, linux (in some cases like systemd)). then rather switch to freebsd. back to the roots

or get other software (regarding apps) open src or write it by yourself/your developers.

But there is a problem: if your boss is a stupid greedy one, then change the job! Over 14 years ago, I had a meeting with the company's boss. He seriously asked me the purpose of buying few expensive IBM x86 servers (dual socket, ECC RAM, SAS ...) , because he was sure he can conquer all the workload by using ten cheap noname consumer lowend PCs (no ecc ram) from walmart. Why buy 900G 15k SAS12 HDD with proper broadcom 9400 HBA for $500 each if you can buy 18T (where you can store more data) 7.2k SATA HDD for $450 that you can connect to the mainboard of your PC? Why use raid if you can go tape backup if your drive fails - so, why buy things twice, right? ??

This guy was so smart, he would accept running windows 95 (32-bit) on epyc7xx3 CPUs with 64 cores, 1T RAM, 12 NVMes, 12 SAS 15k drives, 10MBit NIC with BNC connector and a 14.4 dail-in modem by motorola to act as new company high end windows file share server.

?

What have I done? I left that company. ?

Deleted member 30996 · Aug 28, 2021

astyle said:
I think FreeBSD does have the potential to be a usable desktop if I put in the effort it takes. I don't see that as 'being out of comfort zone' for FreeBSD.

But is it out of your comfort zone?

This is when I like to show a shot of my X61 at 306 days uptime:

The W520 that took it's place is at 127 days and 306 looks likely... Sans electricity outage.

Traveling salesman and other discussions.

Deleted member 30996

Guest