Ensuring/verifying original data - backup data integrity

mefizto · Jan 30, 2018

Greetings all,

I have been wondering how to ensure, that data transferred to and stored at a backup server in a network comprising heterogeneous OSs/file-systems, e.g., from OpenBSD/ffs to FreeBSD/ZFS, are identical to the original data.

Please note that I am not asking about ensuring integrity of the backed-up data for long term storage, just ensuring that the transfer and storage at the backup server did not introduce any corruption.

Edit: After I have written the above, I started to wonder if I have not been too strict in separating the transfer/integrity verification and long term storage, i.e., should both not be integrated, e.g., prepare the original data for long term storage - PAR, non-solid archive with redundancy, then transfer such data and verify integrity.

Any ideas are appreciated.

Kindest regards,

M

ralphbsz · Jan 30, 2018

Answer first, then philosophical discussion about the complication:

When reading the data (as part of copying it to the backup), perform a checksum/hash/signature/crc operation on the data. There is a variety of algorithms and programs available. I'm personally fond of SHA-256 and SHA-512, since they are still relatively fast to calculate, yet have a very low probability of false negatives (birthday paradox). Then, for each file you back up, store that checksum somewhere. When restoring from backup, re-read the restored data, recalculate the checksum, and validate.

Obviously, this is not easy. Where are you going to store these checksums? In such a fashion that they are easy to access? That depends on what you use to do your backups. If the backup is simply copying whole directory trees (for example with rsync), then tracking checksums for each file or directory has to be added manually, and that's a lot of extra work.

Now the philosophical discussion. You are worried about the integrity of data during backup. Why? Your backups are being transported over networks, which all have error detection (read about the checksum algorithms on ethernet, and in TCP/IP sometime). They are then stored on disks or tapes, all of which have extensive error correction codes. Why do you not trust your backups, while trusting your local file system? Perhaps that could be ameliorated by having checksums built into the file system, which go all the way to the disk, and are stored there separately from the data (so as to not be vulnerable to torn writes and undetected read errors). Turns out ZFS has such checksums, although I don't know about how they are implemented, so I don't know how reliable they are; other file systems have them too, and in some cases I understand the implementation well enough to know that they are basically invulnerable.

Actually, there is a sensible reason to trust your backups less: The data is copied to/from them using programs such as rsync, and those programs might theoretically have bugs (extremely unlikely), and they copy the data through the OS buffer cache and user space buffers, which can be corrupted. Even checksums in the file system will not protect you from that risk.

mefizto · Jan 30, 2018

Hi ralphbsz,

thank you for the answer.

Let me first address the philosophical question. The basic premise of a backup is an implied expectation that the original and backed-up data are identical. I am not familiar enough with the concepts of protection in TCP/IP networks and storage systems to be able to meaningfully discuss it, but it happened to me on not a single occasion that data received from third parties, which is a similar model to backups, i.e., transport and storage, were corrupted. Based on the foregoing experience and my admittedly high paranoia, I want to achieve as high probability as possible that the critical data are recoverable from a backup.

As to practical implementations, I have found RAR, which has a non-solid archive option with error correction, and PAR. Since you implied that "[t]here is a variety of algorithms and programs available" could you recommend some that are implemented at different OS platforms?

Kindest regards,

M

ralphbsz · Jan 30, 2018

I'm sorry, I was not thinking about archiving tools such as tar, rar, par, and so on. I just looked at the documentation for rar, and found that you can turn on BLAKE2 checksums, which are roughly equivalent to sha256 and sha512. I was thinking about tools that specifically do only one thing, namely take a "checksum" of a file, so one can verify that the file is still the same after copying it (to a backup, or restore it from a backup). And for that, on FreeBSD there is a whole family of checksumming tools, called md5, sha{1,256,512,...}. If you look at their man page, you'll find that they are all the same program.

Personally, I recommend sha512 for this application, and sha256 of you are completely out of CPU power. Why? Because on modern machines, the CPU demands of checksumming a file are so small that it pales in comparison to the time required to copy the file to/from disk. But from a birthday paradox point of view, sha256 is actually already sufficient (the probability if a coincidence causing a checksum match is tiny). In theory, sha1 and md5 are vulnerable to attack (it is possible to create a fake file that will pass a hash verification test). In practice, this is not a problem for backup/restore, since you are not so worried about an intelligent adversary deliberately returning invalid content, but you are worried about inadvertent corruption.

There is one problem with using any of these tools as part of a backup suite: You have to run them on your files *before* you back them up, which requires reading the files twice, first to calculate their checksum, then to copy them to the actual backup. In practice, for reasonable-sized files, this doesn't make a lot of difference, since the second time the file will still be in cache. Still, this is why it such checksumming tools are better integrated into archiving tools (like rar has BLAKE2), not separate programs. My home-written backup program uses a home-written copy program that simultaneously reads a file, calculates the sha512 checksum, and writes a copy (such a program is very easy to write: take the existing source code for sha512, either the Berkeley or Linux version, and add an option to write the file).

SirDice · Jan 31, 2018

As standard practice, once a month we pick a random server and try to restore the data (onto a different server). Regardless of how you trust your backup software and/or procedures you won't be the first that's been diligently backing up servers on a daily basis only to find out those backups are actually worthless when you need them. Test your backup procedures and make sure you test your restore procedures too.

(thread moved to storage, discussions about backups make more sense here)

mefizto · Jan 31, 2018

Hi ralphbsz,

once again, thank you for replying. I started looking at archiving tools after I realized that backup is one thing, long time storage is another. As much as I would prefer to use almost OS independent open source program, e.g., TAR, which can be used without compression, from my reading so far it appears to be a sold archive, which rules it out. PAR, both version 1 and 2 are abandoned and version 3 appears to be closed source, not universally adopted. Thus it leaves RAR and 7z. As best as i could determine so far, RAR is closed source, but can be used without compression and may add redundancy to its non-solid mode. 7z is open source, can be used without compression, may be used in non-solid mode, but no redundancy.

So, I am still looking.

Kindest regards,

M

mefizto · Jan 31, 2018

Hi SirDice,

so what do you do if the restore fails? It seems that a month of backups was lost.

Kindest regards,

M

SirDice · Jan 31, 2018

mefizto said:
so what do you do if the restore fails? It seems that a month of backups was lost.

If that happens, yes. That's why we test them regularly. If it fails the test we need to find a solution. But the whole point of the exercise is to ensure the backups are good. So we know they don't fail when we actually need them. Our backups are really only needed when disaster strikes. Most of the time we can get things working again by simply restoring a snapshot of the VM (all our servers are virtual these days). We have snapshots at various levels, even down to the storage itself. The 'traditional' backups are rarely used.

mefizto · Jan 31, 2018

Hi SirDice,

so that I can understand your terminology - what is the difference between a snapshot and 'traditional' backup?

Kindest regards,

M

SirDice · Jan 31, 2018

mefizto said:
what is the difference between a snapshot and 'traditional' backup?

Snapshots are done at the VMWare level, it basically creates a point-in-time copy of the running disks. By traditional backup I mean something like Bacula for example. A client that runs on the server and regularly sends backups to a central storage. Or even something as basic as a script that pulls in some files and copies them to a central server.

mefizto · Jan 31, 2018

Hi SirDice,

so the snapshots are kept on the same machine, i.e., same location?

Kindest regards,

M

SirDice · Jan 31, 2018

mefizto said:
so the snapshots are kept on the same machine, i.e., same location?

Sort of. They're on the same NetApp storage, yes. But the NetApps are synced to other NetApps at other locations (this is the storage level snapshots/backup).

Ensuring/verifying original data - backup data integrity

mefizto

ralphbsz

mefizto

ralphbsz

SirDice

Administrator

mefizto

mefizto

SirDice

Administrator

mefizto

SirDice

Administrator

mefizto

SirDice

Administrator