UFS Backup considerations -- please add !

meine · Sep 22, 2019

Hi,

Some time ago I migrated from Linux to FreeBSD, and now I am re-organizing my backups. I have one mirror (1-on-1 copy) of my hard disk, these considerations are on my incremental backup and archive, where I keep everything I ever made on my computers. Not all that stuff is necessarily on my day-to-day box.

Watching the verbosity, I noticed that there are a lot of duplicate files in and between backups. This mostly comes from reorganizing my hard drives -- moving directories to a better location (in or out of a parent directory) and renaming them for easier accessibility (eg. using all lower case instead of first character upper case names).

I don't have exact numbers, but suspect a lot of duplicates. Some however are useful to me, eg. having copies of all versions of my old websites, where the picts have many duplicates.

So before I rsync my old backups with the one-to-stay backup, I did some manual de-duplicating. textproc/meld seems to slow for the bazillion of files I have, manually using the output of sysutils/fdupes seems to be a better way. Since I still haven't figured out what the best structure is for my backup, I don't use the fdupes -r -d options (yet).

I don't want to mess around too much with my backup, but some de-duplication seems worth the extra disk space. So:

* before making a backup, de-duplicate files and directories;

* empty the trash, unless you also want to backup that;

* use directory names that you want to keep, rsync old_name new_name;

* rsync with your backup volume;

My question is if any of you de-duplicates and reorganizes your backup volumes or not, and what strategy you use...

TIA,

Geezer · Sep 22, 2019

Don't start messing around with backups. Or else they won't be backups.

Better to get more storage, and organise only what is current.

ralphbsz · Sep 22, 2019

You just found the #1 problem of backup software: Deduplication of files, with the false duplication caused by moving files. To understand why this is a problem, we need to understand what backups are for. Is it for handling disk or equipment failure? No, today we usually use RAID or similar forms of redundancy for. Backup is really protection against human and software errors (inadvertent deletion), and to act as an archive of old state. And with this function of "historical archive", the problem of files being moved or renamed comes in: The backup software needs to record, for very long periods, that one file used to be called X and is now called Y, or that it used to be in directory /A and is now in directory /B. The important thing is to recognize that it is the same content.

Dedup in storage systems is a big and complex field; there are lots of research papers about it, and lots of implementations. Some storage systems have built-in deduo, for example on FreeBSD the ZFS file system does (but with nasty memory usage and performance issues if not administered carefully). If a backup system runs on top of such a file system, it will be quite efficient for duplicated files.

I know that some commercial backup software handles this correctly internally, without relying on the underlying storage layer. For example, on particular product (I didn't work on it, but many colleagues in my development group had worked on it) does the following: When ingressing files into the backup, they take a checksum of the file, and if a new file is exactly identical to an already backup up one, they don't store the new file, instead they store a reference to the old one, and increment the reference count of the old one (so the old backup isn't deleted too early). They also take diffs when a file is modified, and then decide whether storing the new file or the diff is more efficient (that's called delta compression).

At home, I use a self-written backup software, which works exactly like that: When it finds a new file, it takes a SHA-512 fingerprint of the file, and if that fingerprint matches an existing file, it only stores a pointer to the old one. But my backup software is very hacky, not usable by other people (barely usable by me), and its restore implementation is pretty awful (you need to manually run scripts and edit lists to account for renamed files).

Personally, given today's low cost of large disks, I would propose that you follow Geezer's advice: Just don't worry about it, and make extra copies.

meine · Sep 25, 2019

First Geezer and ralphbsz, Thanks for sharing your considerations! I learned a lot from just a simple task of having a good backup. BTW, I mainly have my backup for protection against hardware failure and sometimes to find something from a distant past. Usually I don't delete by accident.

The biggest lesson I learned is that whatever tree you set up, you'd better think about it, so it doesn't need to be changed. Thoughtlessly putting the Linux structure I had with the capitalized directory names in my home folder is something I'd change now.

The solutions from ralphbsz are great, but exceed my skill level -- there is always something to learn in the future ;-)

For the rest I must confess that I did some de-duplication on my backup drives -- first synchronizing and after a check deletion of the duplicates. Maybe not as it should be, but practical for me. The best part I did was combining the three backups and archive drives I had, which makes making backups a lot easier now.

Thanks,

[SOLVED]

UFS Backup considerations -- please add !

meine

Geezer

ralphbsz

meine