Solved Using ZFS send/recv to migrate services; continuity of snapshot history

I have an application where we run many services in jails. Each service is isolated from the others and keeps its local data in a ZFS dataset that is passed into the jail. We're working on increasing redundancy in the system. Right now we are using ZFS snapshots sent to a backup server to cover DR needs. If we need to recover, we spin up a new VM, restore everything from Backup to NewVM and start things up again.

I want to break the services and hosts apart, and have been considering a multi-active-node setup. Our total data volume is relatively small (and *incredibly* compressible with ZFS compression), so I don't mind having multiple copies of it across multiple servers.

Let's assume for the use case I am describing that we have very good tracking and orchestration in place.

Is it feasible to change the "live head" of a dataset from server to server with ZFS send and recv? I'll sketch out the sort of thing I would expect to do:

3 servers: VM1, VM2, VM3
3 zpools: pool1, pool2, pool3 (hosted on same-numbered VMs)
1 service which moves around.

  1. Service starts on VM1.
  2. zfs snapshot pool1/service@snap1
  3. zfs send to each of VM2 and VM3
  4. Stop service on VM1
  5. (open question) zfs clone, zfs rollback, zfs promote snap1 on VM2 to pool2/service
  6. Start service on VM2
  7. zfs snapshot pool2/service@snap2
  8. zfs send to each of VM1 and VM3
  9. Stop service on VM2
  10. repeat steps 5-9, migrating the service to VM3 and snapshotting to snap3
  11. eventually land the service back on VM1, and start using snap3 as the basis for the service
Assume I'd have other services moving around by the same mechanism at the same time.

Is such a setup viable? Are there major gotchas to doing something like this? Are there issues from having different pools, but datasets otherwise named the same? What mechanism would be best for bringing up the snapshot on another VM: clone, rollback, promote?

Am I completely missing anything?
 
Yes, you can change what replication/copy is "in the lead". I would set the "follower" datasets to have readonly=on (you can still receive updates via send/recv into a "readonly" filesystem -- readonly applies to changes via the filesystem layer). Only the "active leader" is allowed to write to the (replicated) dataset. (You could further set them to be unmounted, but readonly is sufficient to keep them from drifting from the received snapshot.)

So you'll have your (single,active) running service on VM1, periodically snapshot-ting and updating the followers. To pivot off VM1 to VM2 (cleanly) you would:
  1. Shutdown the service on VM1
  2. Change VM1's dataset to readonly
  3. Final snapshot and send|recv to update VM2,3
    • At this point any of the replications (1,2, or 3) can become the "leader" by changing to writable and running.
  4. Change VM2's dataset to writable
  5. Start service on VM2.
  6. Periodically take new snapshots on VM2, and send to VM1,3.
If VM1 ends unexpectedly (and you want to pivot to VM2, and don't care about any data lost past the last distributed snapshot from VM1):
  1. Change VM2 's dataset to writable
  2. Start service on VM2
  3. Once VM1 is ready to be "recovered"
    • Rollback VM1's dataset to the last snapshot that had been distributed to VM2/3
So there should be no need to clone/promote things here. The main thing is to make sure only one copy is having new transactions applied to the "tip" of the dataset, and ensuring you never have more than one set writable helps do this. You also need to make sure you don't change who is writable without synchronizing state first. If you have an unplanned exit, you'll need to rollback that system to the last shared state once it is recoverable. (You will lose any work between the last snapshot and the unplanned exit.)
 
  • Thanks
Reactions: ggb
Thanks for the very thorough reply! You also covered some of the things about the cutover process I'd left hand-wavy so far. Thank you specifically for the advice on keeping the non-tip copies readonly.

One additional question based on your comments regarding an unclean shutdown of VM1:
Does rolling back VM1's dataset matter so much? Could I just start sending from VM2 to VM1, even though VM1's version of the dataset is "dirty" (has writes later than most recent snapshot)?
Is this more a question of good hygiene, or hard requirement that I clean up VM1 and roll it back?

Would you (or anyone else visiting this thread) mind giving some more thematic feedback on the ideas below?

I think I have over-indexed my understanding on the idea of a dataset *belonging* to a pool. In reality, the pool a dataset was originally created in has no fundamental ownership over the dataset through its lifecycle. The name of the dataset matters, because that will have to be globally unique among all my hosts, but it doesn't really matter what I name the pools or which pool any dataset originated on.

In general I am okay with a bit of data loss in the short term. We are much more concerned with long term data retention. We're mostly capturing and consolidating data. I can easily re-fetch within a few minutes if there's an unplanned cutover and data loss. But I can't lose anything more than a few hours old, because I can't necessarily get that data back. So far we're lucky / I've been paranoid in building the one-node solution, but it's definitely time to get better failover into place.
 
One additional question based on your comments regarding an unclean shutdown of VM1:
Does rolling back VM1's dataset matter so much? Could I just start sending from VM2 to VM1, even though VM1's version of the dataset is "dirty" (has writes later than most recent snapshot)?
Is this more a question of good hygiene, or hard requirement that I clean up VM1 and roll it back?

If your shared state is

vm1:vol/ds@last
vm2:vol/ds2@last

and both vm1 and vm2 have modified the dataset past @last, you can't take a new vm2:vol/ds2@new snapshot and send it to upgrade vm1:vol/ds@last without rolling back to @last (the last shared state) on vm1:vol/ds.
ZFS doesn't provide a way to "merge" changes between diverged states like this. It makes sense when you think of a zfs dataset as a series of ordered transactions that each builds upon the previous state.

The datasets don't have to have unique filesystem names; you could certainly have:

vm1:vol/service1
vm2:vol/service1
vm3:vol/service1

For critical data, be sure you have some other form of backup. If something goes south with your data, and you replicate it to all three, you've got three copies of bad data. (Although having some snapshot retention will provide you some level of revert-ability.)
 
  • Thanks
Reactions: ggb
Thanks!

I wasn't thinking any sort of merging, but wondering if I could clobber VM1's dirty data since @lastcommon by forcing something from VM2's @newsnap Rolling back is not problematic here, though.

Backup and archival are separate from this. My focus with these questions is just about the approach we're considering to increase redundancy for HA purposes.
 
Back
Top