ZFS using ZFS to safeguard my data

tOsYZYny · Jul 17, 2020

I've been using ZFS to safeguard my data and this is my topology. I have a "work" drive in which I make my changes, take a snapshot, then push the snapshot to several backups. I have 2 single drive backups (single vdev), and 1 mirror backup. All of my backups are offline except at the point in time I sync, rotating off-site ...

Whenever I make a change or update to the data, the sync happens fairly quickly, so it is fairly unlikely that I would lose work, but if I were to, it'd be capped to a snapshot. That said, my backup and "work" drives are all single vdev, they have no replication. If there are any faults, I wouldn't be able to recover. Is that 100% true, if ZFS detects an error with the data on any one of those drives, could I use my mirror backup to correct the data (without creating a new copy, just fixing the corrupt data element)? Or, since it isn't part of that same zpool, it wouldn't quite work that way?

Eric A. Borisch · Jul 17, 2020

tOsYZYny said:
I've been using ZFS to safeguard my data and this is my topology. I have a "work" drive in which I make my changes, take a snapshot, then push the snapshot to several backups. I have 2 single drive backups (single vdev), and 1 mirror backup. All of my backups are offline except at the point in time I sync, rotating off-site ...

Whenever I make a change or update to the data, the sync happens fairly quickly, so it is fairly unlikely that I would lose work, but if I were to, it'd be capped to a snapshot. That said, my backup and "work" drives are all single vdev, they have no replication. If there are any faults, I wouldn't be able to recover. Is that 100% true, if ZFS detects an error with the data on any one of those drives, could I use my mirror backup to correct the data (without creating a new copy, just fixing the corrupt data element)? Or, since it isn't part of that same zpool, it wouldn't quite work that way?

You could certainly recover from your mirror backup, but it may entail sending an entire dataset, depending on where the error occurs. You can’t repair just the blocks that are bad in this arrangement.

Mjölnir · Jul 17, 2020

You could manually take a zpool(8) mirror device on/offline instead?

tOsYZYny · Jul 17, 2020

Ah, so if I understand what you're saying then is, rather than sync these devices by sending snapshots, instead bring a specific device online to sync it, then take it offline when I'm done? Are there any potential risks if I do this?

The work drive I use is online 100% of the time, each of the backups are only online whenever I need to sync < 30 minutes a week. Will that be a problem?

Mjölnir · Jul 17, 2020

The mirror devices are in the same physical machine, usually. I bet ZFS does not allow mirror devices over the network? So if e.g. your machine burns down, your backup drive is in another room, while your mirror drive burns down, too. Your current setup is the more sound solution IMHO. But I'd suggest to backup at least daily.

tOsYZYny · Jul 17, 2020

Understood, that is a technical implementation detail, you got me thinking that ...

Instead of relying on USB, if I wanted, I could also use iSCSI (which I have before), but it seems like a lot of work, if I'm not using it for very long ... Then this device would appear locally and ZFS shouldn't really care if the actual device is on the network or not.

I could rotate the drives in the pool to sync them. Right now, I have 2 USB enclosures that I unlock, then push ZFS snapshots to, and a separate backup machine altogether that has 2 drives (mirrored) that I push over the network to. To simplify everything, why not just add the drives one-by-one to the pool to sync, then remove them one-by-one when done. I can eliminate the unnecessary backup PC and just have disks.

So, instead of doing zfs send, I would do whatever the appropriate command(s) are to add the device to the pool. I recall having done this a very long time ago back when Solaris was a thing and for some reason, I feel that the ZFS resilver is a background task versus zfs send which is foreground and completes fairly quickly. The benefit I have with doing this as a pool is that I could minimize the amount of work the drives have to do to sync in the event I have bad data on any one of the non-mirrored pools.

Mjölnir · Jul 17, 2020

What many people apparently forget is to regularly do a recovery manœvre. Let's say, once a year. With a mirrored drive, you should be able to get your system back up & running ASAP. With a backup, you'll need a minimal running system and then zfs receive. If you have surrounding scripts, make shure they are in sh(1) syntax, so even if you have no network access, the scripts can be run.

tOsYZYny · Jul 17, 2020

Cool.

Yes, I have prepared those scripts over time to make the synchronization less painful. Presently, it manages taking snapshots and sending and receiving them. Granted ZFS is almost painless, it just needs a little configuration.

I am still playing around with it some more. I can upgrade the pool anytime to a mirror apparently by attaching a device. Secondly, I can offline a drive at anytime and then bring it back online whenever I'm ready.

Additionally, I think that if I offline a drive from the pool and my main drive dies that I could bring the offlined drive back online in that original pool.
That was my main point of contention which is, how do I inspect those drives I'm using as backups if the main one would die? Right now, they're entirely separate pools and I can import them completely separately. And, that works great, but if offlining drives will allow me to achieve the same, it seems like a more efficient way to do it.

My approach might be different since most of my equipment is offline. I only bring it online to do the backup.

I will experiment more and if it works out well, I think I will do the online/offline approach. Instead of me sending snapshots, I will merely bring the drive online, when it says it is resilvered, I'll take it back offline, and repeat ...

Mjölnir · Jul 18, 2020

tOsYZYny said:
Additionally, I think that if I offline a drive from the pool and my main drive dies that I could bring the offlined drive back online in that original pool.

I never tried it on FreeBSD, neither Solaris, where I had a 3-way mirror for the OS & RAIDZ for data. You'll definitely have to verify that 1st. The main advantage of a real mirror is the reduced time to get back your system up & running. Common guideline is that a mirror is no substitute for regular backups! Estimate the risk of a fire (in one room), a heavy flood or storm (-> no internet for aprox. a week), and data corruption e.g. from a virus. Thus, I'd recommend to have 3 month of backups or snapshots stored in a separate room. If your disks are very large (>8TB), a 2-way mirror or RAID-Z1 is no more considered safe for statistical reasons.

tOsYZYny · Jul 19, 2020

Ok, so it seems like it'd work the way I want.

I presently have 3 copies of my media. I will take the mirror that I have (making 4 copies of the data), then add that to the "active" pool in a mirrored configuration. Once I get it resilvered, I will repeat for the other drive in the original mirror. I will then have a 3-way mirror and only 1 active drive in the mirror.

Whenever I sync the drives, I will run a scrub which should fix any errors found.
The other 2 drives that I mirror the data with through sending snapshots will stay in the same configuration in the event of a technical error. After about a year or so of online/offline mirror configuration, I'll determine whether I can convert the other drives into that paradigm as well.

I consider all of the drives backups since I store them separately and they're not in a computer until needed.

Mjölnir · Jul 19, 2020

After 3-6 month, please update your experience here.

tOsYZYny · Jul 20, 2020

I should state one immediate pain point that I had was that my "active" pool I'm using consists of 2 drives (to bring the space up to 1 TB). Whenever I tried to attach another drive to it to mirror the entire thing, that is when it got a bit complicated. Instead of mirroring the whole thing, it mirrored just 1 drive in the pool.

So, I'm able to do what I want to do, but am using one of my backups as my "active" work drive in the pool, then resilving the others with that. I will keep the 2 drives as a backup if something bad happens ...

Mjölnir · Jul 20, 2020

Do you bundle/concatenate two disks to gain more space? Don't! That's far too dangerous. If one disk breaks, the other one is unusable... In effect, you're roughly doubling the probability to fail. Your disks seem to be old anyway, since the're < 1TB. So chances are they start to fail in the near future.

tOsYZYny · Jul 20, 2020

Generally no, but ... I migrated from Linux to FreeBSD and wanted to have a copy of my media in FreeBSD without losing an existing copies of my data. So, I had 2 500GB disks laying around and my other media drives are 1 TB or greater.

Yes, you're right, I don't normally do that and I am getting away from that as soon as possible whilst also not buying new disks.

I plan to upgrade my disks as they fail to stagger how drive age as well as cost. Since I have 3 copies of my media (and 1 bad copy made of 2 disks ...), I should be fairly safe. I think it'd be quite rare to lose all disks at the same precise time.

ralphbsz · Jul 20, 2020

tOsYZYny said:
I plan to upgrade my disks as they fail to stagger how drive age as well as cost. Since I have 3 copies of my media (and 1 bad copy made of 2 disks ...), I should be fairly safe. I think it'd be quite rare to lose all disks at the same precise time.

That seems pretty dangerous to me. Here's why: The #1 risk to data durability is humans. Not disks, not networks, nor computers, but humans. Old joke: How do you administer a computer? You hire a man and a dog. The man is there to feed the dog. The dog's job is to bite the man if the tries to touch the computer.

So you're waiting until a disk fails. Then you will replace it, and restore the data from your various backups. Have you actually tried and practiced that operation? Here's mu suggestion (don't actually do this): Take one of the main disks, and unplug it, and put in a spare, and then perform the restore. I know so many cases where people did that, only to discover things like: their backups were actually all blank, the backups were unreadable without software whose only copy was on the failed disk, without the failed disk their machine won't even boot, and so on. Even if it is actually feasible, do you have a "playbook" for it, a set of simple instructions you could follow?

My fear is: When you actually have to perform the restore under pressure, you'll make a small mistake (mistype a disk name for example, ada3 looks pretty similat ro ada8 when your glasses are dirty and it is 3 in the morning), and suddenly you have nothing.

Recommendation: Replace the disk before it fails. And practice doing restores.

tOsYZYny · Jul 21, 2020

I appreciate the concern.

I presently have 5 copies of my data. 1 is in-use in the machine and the other 4 are either on-site or off-site. Of those 4 backups, 2 are part of the mirror, but offlined, and the other 2 are synced via ZFS snapshots. I label my drives and inventory them as to what is on them. Before I format any drive, I consult my inventory tracking database to be safe. I don't expect to lose 5 drives all at once or take so long to make a new copy that my other 4 would also die. Lastly, I have a formal, typed up document for how I do backups or synchronize (in case my script fails and I have to do it by hand). That document also includes restores.

The only thing I changed recently was to migrate from ZFS snapshots for synchronization to relying on an n-way mirror so that the drives are block-for-block identical. I still use ZFS snapshots.

Eric A. Borisch · Jul 21, 2020

While RAID (or a mirror) is not backup, they both play an important role in protecting data you value.

I would suggest having consistent redundancy (always present mirror /RAIDn) on your live systemd is more beneficial than a third or fourth backup.

tOsYZYny · Jul 21, 2020

What makes it a "backup"?

Why would it be more beneficial than having a 3rd or 4th backup?

I'm not sure if I stated it earlier, but my dataset changes very infrequently, about once / month. If I were using this setup for a production server where data was highly volatile, this wouldn't work (well) because I take the drives offline and then by the time I would sync, there would be a huge delta meaning all of those changes would be at risk.

Eric A. Borisch · Jul 21, 2020

“Backup” : A copy of your data that is (typically) inaccessible from your main system, preferably in a separate location, and at least on a separate machine / power supply. (My off-the-cuff definition.)

If the wrong bits get corrupted on your main drive (or if it dies completely) you have a much different recovery task (and loss of recent, although perhaps infrequent data) when you run your main pool without redundancy. Much better to have built in hardware resiliency (mirror/raid) on your active pool, and ZFS send/recv (which works really well) to your backup pools. With redundancy, you just take out the bad drive, put in a new one, and resilver with no interruptions or data loss.

You can certainly do your additional redundancy mirror disconnect/reconnect approach, but using send/receive lets you have different retention policies or compression levels for your backup and main system, for example.

Having third and fourth backups have their place for extremely valuable data, but I fear at this scale it’s additional complexity, which can lead to human mistakes (or laziness, as it requires continual human care and feeding) ... at some level the complexity of the system becomes a liability. Much better to allow the system to maintain its own resiliency against hardware failures, and add a backup to protect against disaster.

Mjölnir · Jul 21, 2020

It's because if you only have a mirror - which is fine on it's own, but... - and you loose data by human mistake (or e.g. a virus) the mirror drive will be affected as well. Thus a backup is vital in addition to a mirror.

ralphbsz · Jul 22, 2020

tOsYZYny said:
What makes it a "backup"?

Let me answer that by summarizing it into a superficial comparison.

Your data is at risk from three main factors:

Device failure, like sector error on a disk (perhaps unrecognized), or complete failure of a disk.
You protect against those using RAID (more than one copy of the data on multiple disks at once), and by using checksums (which today are built into the better file systems). Typically, protection against device failure is done "online" or "synchronously", with every write to disk immediately reflected on multiple disks. We need to argue how many disks you need for that, but that's a complex and separate discussion.
Site failure, like your computer (including the two mirrored disk drives) having all its electronics (including the two mirrored disk drives) destroyed due to a lightning strike, or you having a small fire in the trashcan underneath your desk which melts the computer.
You guard against that by moving copies of the data to different sites or physical locations. Since this is typically too expensive to do synchronously, it typically has delays; those delays can range from milliseconds (for metro replication via dedicated network infrastructure) to weeks (when shipping of media is involved). You need to make sure your sites are separated enough so they are in a different "failure domain" ... in IT, there is a sad (but true) story of a large corporation that had its backup data center in the other tower of the World Trade Center.
Human failure, like "rm -rf /", or zillions of variations of that: I shouldn't have overwritten that file, who knows which directory Fred used for the presentation from last week's meeting, and so on.
This is where backup comes in. It preserves a state of the system at certain points in the past.

Now, one can build systems that take care of multiple of these concerns at once. For example, a second disk that once a month receives a complete copy of the file system, and is then stored in a different place, can to some extent handle all these threats, although with significant shortcomings. Like if you suffer site or hardware failure, you may lose 3 weeks worth of work, if the copy is 4 weeks old. Or if you don't notice a human error for 6 weeks, the backup copy is also destroyed. That's OK: engineering is the art of the compromise, so think about what's important to you, and what cost you are willing to bear. My point in listing these three categories above is to help you think about threats, and design systems to address them with reasonable risk, at reasonable cost.

Argentum · Jul 24, 2020

ralphbsz said:
Human failure, like "rm -rf /", or zillions of variations of that: I shouldn't have overwritten that file, who knows which directory Fred used for the presentation from last week's meeting, and so on.
This is where backup comes in. It preserves a state of the system at certain points in the past.

You can mitigate the human failure by zfs snapshot .... Agree that the snapshot does not replace backup, but still gives some extra security, especially on mirror or raid.

Mjölnir · Jul 24, 2020

A nice example is this thread where no snapshot would heel the damage :/

tOsYZYny · Aug 5, 2020

I asked the question on backup because by me pulling drives out of my mirror and using them as an offsite backup, that is a bit unconventional, or at least seems to deviate quite a bit from how most people use mirrors or even what I would consider a mirror to be. So far, everything is working normally ...

I think another thing I should do is to restart capturing metrics such as a file count, list of filenames, etc. and see what changes over time across the drives. That might help remove any doubt.

I must say that using zfs snapshot + zfs send / receive feels more intuitive than bringing my drives back online one by one and resilvering them. It certainly works, but I always feel a bit uneasy about it. For that reason, I have documented my backup process to avoid any confusion.

I have no issues to report, yet ...

Crivens · Aug 5, 2020

mjollnir said:
The mirror devices are in the same physical machine, usually. I bet ZFS does not allow mirror devices over the network?

is geom_foxhole still there?

ZFS using ZFS to safeguard my data

Administrator