Solved Making simultaneous ZFS snapshots across multiple computers (distributed ZFS snapshots)

We will be using a distributed database spreading its data across multiple nodes (computers). Each node only holds a small part of the data, therefore to maintain data consistency creating a snapshot of the database would require creating a snapshot at the exact same point in time on every computer. In practice, I believe the database system would be resilient in case snapshots are done with a couple of seconds or minutes apart (less than 5 minutes). However, I would prefer not to rely on this assumption and make the distributed snapshot as atomic as possible since there are already enough things to worry about when one gets to the point where a backup is needed....

I was thinking that one option could be to have one central storage server for all database instances, so that each node uses a different dataset on the same server since ZFS makes it easy to snapshot multiple datasets simultaneously on the same machine. However this would sort of defeat the scalability benefits of sharding this database system, since the storage server would become a bottleneck in terms of I/O. I was thinking that maybe assigning one hard drive per dataset would address the problem, but I feel this goes against the intended use of ZFS since it relies on the idea that hard drives are abstracted in pools. By experience (generally speaking) I fear that going against the intended usage could end up being a rabbit hole where our life would become increasingly painful as we try to tweak the system to use it in a way it wasn't meant to be used. Also even if creating one zpool per hard drive to have one hard drive per dataset turned out to be perfectly fine and supported, other aspects of the single storage server could end up being a bottleneck (RAM memory, CPU, etc...).

And maybe more importantly, restricting ourselves to storing the data of all sharding nodes in the same data center would cancel many of the benefits of using this database system as it has seamless support for advanced replication scenarios such as data locality (replicating specific parts of the data in geographically distributed data centers to ensure that data is stored close to the user to reduce latency). So the best option for me is to figure out the best way to create or coordinate a ZFS snapshot on multiple machines to have the highest possible degree of atomicity while keeping the management of the system reasonably convenient. We need to treat every node as an independent commodity computer.

We are talking about multi-terabytes databases serving heavy workloads (including storing and streaming media files to web - this database system is optimized for that) with high-availability requirements. Thus the concerns with I/O and single points of failure in general. Initially we are talking about 3 nodes only, but this will rapidly grow to a couple dozens or hundreds nodes across several independent deployments and hundreds of terabytes of data. So we need a scalable approach that we will be able to manage using shell scripts as our needs scale.

The database system itself offers pretty good data redundancy and replication features so such ZFS features would not even be used. The reason I am considering ZFS here is mainly for backups using the snapshot feature as replication does not help when there are application bugs causing data corruption or a ransomware attack.
 
Absolute synchronization of the snapshot won't be possible without modifying ZFS itself into a distributed system. And that would be prohibitively difficult, since you would have to go into the ZFS source code and add hooks to distributed protocols and state machines everywhere. I've worked a cluster file system that performed globally consistent operations (such as snapshots) during workload without significant interruption, but that was a file system that was designed from the get-go for distributed operation.

Since the perfect path is not available, let's see what we can do instead. The sloppy solution is to issue a command from a single central console machine, and fan that command out. For example run dozens of ssh instances in parallel from a shell with "&" at the end. The reason this is sloppy is that each separate command has different network delays, different time the receiving host needs to process it; I would expect variability that's at least dozens of ms, perhaps approaching a second.

Here's a suggestion for getting much closer: Your machines are certainly all running ntp, and their system clocks are probably synchronized to within a fraction of a ms. The simple solution is to implement a small daemon on each machine, and that daemon can on command pre-schedule when to take the ZFS snapshot. Then you notify all the daemons a few seconds ahead of time (saying for example "at exactly 10:00:00 perform a snapshot"), and it gets done to within the accuracy of the system clocks.

Is there any way you can quiesce the workload without breaking something? If you can afford a sub-second outage, you could even do the following: 200ms before the snapshot time, the daemon orders all database processes to go to sleep. Those 200ms should be sufficient to drain IOs that are in flight (that's about 20 pending requests per disk). Then perform the snapshot on the pre-agreed time, and a few dozen ms later, re-enable the workload. This is probably the closest you can get to synchronization without re-engineering ZFS.

All this is relatively simple when working on a system where all nodes are up. If you handle doing this in the presence of outages (network partition, down nodes), it gets really hard.
 
It depends on the importance of the data
For the "really" significant ones you can't rely on filesystem snapshot systems, but you need the database ones
For MySQL/MariaDB, for example, a dump made in a single transaction
In this case you get a really consistent backup
 
Back
Top