ZFS ZFS SLOG Mirror split across HAST and local device

So, I have been experimenting with methods for providing "clustering" on ZFS for a while now and I recently purchased a Supermicro 6036ST-6L Shared Storage Bus system. It's provides an interesting opportunity as both nodes in the host can share the storage via SAS, and the SAS topology seems somewhat configurable for attachment to either host.

I have used HAST for providing resiliency to host targets for zvols hosting iSCSI access. On this most current build, I'm looking at moving to an active-passive model where one node owns the pool and leverages CARP to manage service IPs terminating NFS/CIFS access. I've moved to power protected NVME SLOG devices. Unfortunately, the SLOGs are local to each host and not accessible via the Shared SAS midplane (as far as I can tell). I would like to keep the SLOG warm in the secondary cluster by using hast to present a dev from the secondary host to the master to include in the SLOG mirror. The chassis includes two 10G intel (ix) interfaces on on the midplane that interconnect both nodes, so this seems sufficient for limiting latency and maximizing throughput, and access to the cluster from remote hosts will be served by a chelsio T520-CR card in each node.

The interesting bit is I'd like to be able to run pairs of service VMs which run in bhyve on both hosts (vm-1 on node-1 and vm-2 on node-2). These are just instances of FreeBSD which access storage via NFS on the underlying nodes and serve up low overhead functions as individual jails (KDC, DHCP, DNS, TFTPBOOT (PXE) and a Management Web interface. So I'll be getting IO from a local VM, the peer VM, and the real LAN.

That being said, I'm a little unsure of the dynamics of how a write lands on the SLOG when mirrored and at what point the sync request is acknowledged to the filesystem. My worry is that I could arrive in a situation where the local SLOG is not in sync with the remote SLOG during a scheduled failover event. My intent is to modify my rc scripts I used previously to manage this behavior and provide a mechanism where I can tell the master to release control of the array deferring to the other node.

Tentatively, I'm thinking the process will require:
  1. Pause local VMs and peer VMs.
  2. Stop NFS/CIFS services.
  3. Sync disks.
  4. Validate writes are all flushed.
  5. zfs export on primary
  6. Trigger failover event with CARP by shutting down local local CARP loopback interface.
On the secondary, the CARP failover event would then need to perform the following:
  1. zfs import on secondary
  2. start NFS/CIFS services locally
  3. resume local VMs and peer VMs.
I'll be updating this thread as I go through the implementation, but I was curious if anyone had done something similar to this and if there are any gotchyas I should be aware of before treading down this path.

Thank you everyone!
 
I don't share SLOG: in my case, this is redundant and will require another fast communication channel (other than the 10 Gb/s backbone via LAGG).
 
Back
Top