HA cluster + encrypted storage

hyurakishiro · Dec 26, 2016

Hi,

I'm looking for advice and tested solutions.

I need to build cluster (automatic failover with 1 floating IP, active-standby) with 2 nodes and Apache, MySQL onboard. Unfortunately MySQL datadir + Apache files in docroot have to be encrypted.

In case of MySQL I can use master/master replication with datadir on encrypted ZFS pool and manually decrypt with passphrase. Maybe other ideas?

Do u have any ideas how to deal with docroot files? HAST with encrypted ZFS Pool? How to handle failover?

Thanks in advance.

gkontos · Dec 26, 2016

Yo can use geli encryption. Have a look here: https://www.freebsd.org/doc/handbook/disks-encrypting.html

I would avoid the use of HAST with ZFS.

The simplest solution is to use mysql replication like you said, however be careful with master/master. Then a cron job could synch the data dirs.

getopt · Dec 26, 2016

gkontos said:
Yo can use geli encryption.

Geli makes only sense if you need a protection for a systems that are shut down (power switch off). Typical use is for a powerdown laptop that might get stolen. Another use is that geli protected harddisks do not need to be sanitized when the key has been removed.

But as long as a geli device is attached (after passphrase) it is decrypted blockwise and thus readable to everyone having access to the system.

gkontos · Dec 26, 2016

getopt said:
Geli makes only sense if you need a protection for a systems that are shut down (power switch off). Typical use is for a powerdown laptop that might get stolen. Another use is that geli protected harddisks do not need to be sanitized when the key has been removed.

But as long as a geli device is attached (after passphrase) it is decrypted blockwise and thus readable to everyone having access to the system.

I am aware of that. I was under the impression that the OP wants exactly that.

I also believe that Oracle Solaris works in a similar way.

Petr Fischer · Dec 26, 2016

gkontos said:
I would avoid the use of HAST with ZFS.

Why? (I'm just curious.)

gkontos · Dec 26, 2016

Petr Fischer said:
Why? (I'm just curious.)

Because HAST is a network mirror exposing the resources on top of it. So, imagine the following scenario:

You have 3 disks in each host. You create a raidz1 using /dev/hasta, /dev/hastb and /dev/hastc. Now, for some reason disk1 dies on machine a. You would expect ZFS to be aware of it. It will not, running zpool status, will show you that the pool is fully functional and not in a degraded state.

Why? Because the pool will be using Disk1 from machine b, Disk2 and3 from machine a.

sko · Dec 28, 2016

I'm using pefs on top of datasets for encrypted home-directories. Because pefs sits on top of ZFS, even data within ARC/L2ARC is encrypted. Snapshots or incremental and deduplicated zfs backups work just as expected - datasets with encrypted PEFS data don't have to be decrypted for any zfs operations.
Automatic decryption could be performed e.g. with pull-based config management (chef or even ansible) when the machine comes up.

Anand Suresh · Mar 15, 2017

gkontos said:
Because HAST is a network mirror exposing the resources on top of it. So, imagine the following scenario:

You have 3 disks in each host. You create a raidz1 using /dev/hasta, /dev/hastb and /dev/hastc. Now, for some reason disk1 dies on machine a. You would expect ZFS to be aware of it. It will not, running zpool status, will show you that the pool is fully functional and not in a degraded state.

Why? Because the pool will be using Disk1 from machine b, Disk2 and3 from machine a.

Isn't that what is needed in an HA scenario though... the ability to transparently failover? Granted that it hides a bad disk in a subtle way, isn't there an alternate way to detect the failed drive, like using a nagios check?

Alternately, how would one build an HA ZFS storage, if not using HAST?

Scott Miller · Mar 16, 2017

This is a fascinating subject. However my only real experience is using a hidden facility built into Windows NT-4 Server to mirror data between machines (unfortunately it would get into grand battles with my defragger). And in more recent years manually running the cross-platform app Unison. But I need to learn this stuff.

There are many roads to HA. Having gone the N+1 power, RAID, hot-plug drives, and redundant fans route (and spending quite a bit of money on each server) my current approach is a pair of cheap machines (ECC RAM being my one bow to hardware redundancy) and then mirroring the data between machines. Distributed file systems have their place, but that's not my current need and so it sounds like HAST is out (at least for me). If something happens to a machine I want it out of the picture, not pretending that all is okay because it's borrowing its twin's HDD. I'll pull the dead machine from the rack and fix the problem. And if I need to do a major upgrade I can do it one machine at a time.

I'm a complete newbie with BSD (using 90s-era Solaris as a graphical workstation and the "AppleBSD" command line doesn't count). A couple of months ago I saw Mason's YouTube video on FreeNAS, and then re-watched it last night. On second watch I expanded the vid to full screen and noticed there are two file replication utilities supported by FreeNAS: BTSync and Syncthing. There are three approaches to file replication: on a schedule, on demand asynchronously, and on demand synchronously. NT 4 Server, Unison, and BTSync are all on a schedule (e.g. once an hour, or whatever). Not sure about Syncthing. Wiki has this handy article comparing file replication utilities.

I think I'll try Syncthing for my new LAN servers first. Got my itty-bitty SSDs (for boot) yesterday, then discovered I need power "Y" cables because my cheap servers of choice are too cheap to have more than one SATA power connector. I await the cables' arrival.

The answer is out there.

sko · Mar 16, 2017

Because ZFS already handles the failure of single/multiple disks, the better path (IMHO) is to multipath the disk access with 2 controller nodes. I've never been a fan of adding a whole network, its protocols and abstractions and yet another layer of storage protocols and abstraction on top of each other for HA storage. I've been testing with GlusterFS and ceph in our network a while ago. Both were relatively slow and error prone. If a connection to a node drops for a longer period of time, mostly you have to get it back running manually. Both were miles away from what I would even think about putting any data on, let alone in production... I suspect you'd need a horde of admins to always rub it in the right places and a farm to provide the daily sacrificial goats if you want to run one of them at production-scale...

For HA storage within a LAN, have a look at the BeaST project [1]. It has a lot of very nice concepts and committed a lot of patches to remove some pitfalls or bugs that might come up with such a setup. As said: drive failure is already handled by ZFS, so why put a lot of other abstraction layers on top and/or underneath and take this away?

[1] https://mezzantrop.wordpress.com/portfolio/the-beast/

Scott Miller · Mar 17, 2017

Distributed file systems are fascinating, but they don't make any sense unless they are across a very fast network, preferably dedicated. My professor (I may be an old fart, but I'm back in school) was experimenting with a distributed file system on Linux (don't remember which FS), but he saw no speedups when adding more disks or hosts. But then he was running on gigabit Ethernet.

I used to design chips at Compaq's Enterprise Storage Division in Colorado Springs (formerly DEC's facility) until Compaq was absorbed by HP (2002). We did rack-sized (and larger) SAN over fibre channel. SAN differs from a distributed file system in that SAN is stand-alone virtualization of storage. From the SAN admin UI you set how many virtual disks and their sizes that a particular client sees. The client (typically a sever) then formats those disks. It would be insane if the client used ZFS to format their blocks. Our SAN (and other competitive brands at the time) used redundant controllers, redundant networks, and redundant disk drives (even the disk drives had redundant interfaces), and offered many of the features now found in ZFS such as snapshots, copy on write, and checksums. SAN done this way costs about as much as a house.

A distributed file system seems to want to do a similar thing, but in a completely different way. Each participant is both client and server: "I'll let you use some of my disk if you let me use some of your disk." And the overlying FS software is responsible for some level of redundancy so that if a member goes down, the other members don't lose data. It might make sense to put ZFS under that (at the member-server level), maybe. It depends on how deeply the distributed FS software wants to get it's roots intertwingled into each client/server.

A quick peek at BeaST gives me the impression that it's a method to use multiple HBAs and disk drives to achieve improved fail-safe operation on a single machine using ZFS.

Thinking out loud here:

Disk drives fail, but then so do other components: fans, power supplies, network links (including the ports at either end), memory glitches (or fails outright), and very rarely other bits of hardware too. It doesn't make sense to have redundant disks on a single server unless they are hot-swap (or a repair time of minutes is acceptable), but hot-swap means another layer of connectors and connectors wear out and fail. All this redundancy adds cost. Yes, it's a whole lot more convenient to the clients if a server never seems to vanish or lose data, even in the middle of a transaction. But to achieve that you need a machine from Tandem (now owned by HP) or something from NEC's FT series, which makes a "fully redundant" server from Dell or Supermicro look downright cheap. NEC's FT ran (back when I looked at them) redundant motherboards with each core of the CPU lock-stepped with a core from it's twin on the other mobo. That's crazy! (but in a fun way) Of course everything else, except the case and passive mid-plane, was redundant as well.

There is no ideal solution. Anything we do has weaknesses. For example on my LAN I want to try duplicating the whole server. It'd be insanely great if when a file is modified (create/change/delete) that the file is immediately modified on the other server. Asynchronous means the change is queued, but the server's client is told the deed's done (it should happen within a few seconds). Synchronous means the file must be changed on both before the client's told the deed's done. Synchronous is great, and pretty much a requirement for an active/active pair. However it must fall back to asynchronous when one of the servers goes down. For this application a file-copy scheme that runs periodically (e.g. once per hour) is wholly inadequate.

Active/active pairs are a problem in themselves. How do the server's clients pick which machine to use this moment? Okay, a load-balancing router, but what if that fails? Active/standby is a heck of a lot easier to use. But on bootup who is active, the first server to come online? And if the active server dies, how does the standby promote itself? And if the active server glitches and the standby promotes itself to active (and taken the former active server's network identity), then what? When the former active server comes back from it's momentary brain fart it must to play nice when it discovers it's twin is now active. And of course how does each server reliability determine if it's twin is alive or dead, or even active or standby? And when a machine comes back from the dead it must reconnect with its twin, become standby, and then catch up on all those queued asynchronous file mods. I'd love to find existing software for FreeBSD that does this.

For my personal situation it'd be easier to go with mirrored disks under ZFS, and then keep an extra (identical) machine so I have spare parts. If I'm around at the time of failure I can fix it within a few minutes, and keeping the standby machine turned off will save electricity. But what fun is that?

Conclusion:

Of course bullet-proof hardware is great, but it does nothing for human error (or maliciousness). Still need backups.

Petr Fischer · Mar 18, 2017

Nice elaborate Scott. Gigabit ethernet is rather slow for distributed file system. One can aggregate more ethernet NICs for more speed, but how about InfiniBand (aggregated Infiniband) - it's normal PCI relatively cheap cards, with 40Gbps speeds + low latency + better protocol (no collisions?), right? Better?

Example with 2 hosts, 2 disks in each host - mirrored ZFS ZVOLS from two disks on each host (hotswapable), then HAST over this 2 hosts, and on the last layer, ZFS normal mirrored filesystem datasets over /dev/hasta and /dev/hastb?

1. if one disk fails on some host, it's OK, ZFS tell it, you can hotswap, rebuild mirror inside host, no downtime, no data loss
2. if one host/server completely die (motherboard, something else), it will be also OK, last ZFS layer tell it, you can setup new server, swap it, rebuild top layer ZFS mirror, no data loss, no downtime

Possible in practice?

Scott Miller · Mar 18, 2017

I'm an old hardware guy with some experience in old-school Windows NT Server. Just getting my feet wet with BSD and working on my CS degree.

I've heard of Infiniband (of course) but never priced out cards or cables until just now. 10 Gbps Infiniband is about 30% cheaper than 10 Gbps Ethernet (a pair of cards and a cable). I found this: http://serverfault.com/questions/678532/connect-two-infiniband-cards-to-each-other-without-a-switch I wouldn't even mess with trying to bond a bunch of 1 Gbps Ethernet ports.

Umm, so I get the ZFS layer, and that you're doing something with HAST, then you lost me. It'd help if I read up on HAST.

gkontos · Mar 18, 2017

Petr Fischer said:
1. if one disk fails on some host, it's OK, ZFS tell it, you can hotswap, rebuild mirror inside host, no downtime, no data loss

If one disk fails, ZFS will report the pool as optimal and not degraded. The reason for that is because HAST is a network mirror of individual disks in two different hosts. HAST runs on top of ZFS, therefore HAST will use the mirrored disk and ZFS has no idea of the failure.

Petr Fischer said:
2. if one host/server completely die (motherboard, something else), it will be also OK, last ZFS layer tell it, you can setup new server, swap it, rebuild top layer ZFS mirror, no data loss, no downtime

If one host dies then it is up to HAST again to report the problem.

mezantrop · Mar 18, 2017

Scott Miller said:
A quick peek at BeaST gives me the impression that it's a method to use multiple HBAs and disk drives to achieve improved fail-safe operation on a single machine using ZFS.

Let me correct you. With the BeaST I'm trying to implement dual-contollered, full active-active or at least ALUA storage system with or without ZFS.

HA cluster + encrypted storage

␢