What is the best compression method for backups?

The thread title pretty much says it all: What is the best compression method for backups?

Due to the...spectacular...hardware failure that my FreeBSD box suffered last month, I've been making plans to perform automated system backups. In addition to the USB ports on the mainboard, this machine also has a PCI Firewire/USB combo card which has an INTERNAL USB port. I purchased a 128GB USB memory stick and plugged it into the internal port and formatted the drive.

I want to do backups using tar(1) and there is a list of compression algorithms to choose from. I know that the fastest would be no compression, but it also takes up the most space as well. So I am looking for a balance between time and space. Some of the compression algorithms I have not heard of before. The list as given in the man page is as follows:

  • xz
  • gzip
  • bzip2
  • lrzip
  • lz4
  • lzma
  • lzop
  • compress
I know what gzip, bzip2, lzma, and compress are. I haven't heard of xz, lrzip, lz4, or lzop.

What do you recommend?
 
Whatever compression you use, make sure it works a way that is error-tolerant in case the compressed file has errors. For example tape dropouts can easily cause parts of files get corrupted.

This is the reason why I usually do not compress backups due to bad experiences in the past. If I cannot extract an archive because its checksum is invalid, all its contents are lost. Some archivers can skip over such files or uncompress only the parts that are not damaged, others refuse to unpack anything then.

Another sweet method of backing up when using zfs mirrored configuration is to just remove a drive and store it in a safe place, put in another drive and let it resilver. With ZFS compression, another layer of compression probably does not make much sense.

By the way, what was that spectacular kind of hardware failure? *curious*
 
Whatever compression you use, make sure it works a way that is error-tolerant in case the compressed file has errors. For example tape dropouts can easily cause parts of files get corrupted.

This is the reason why I usually do not compress backups due to bad experiences in the past. If I cannot extract an archive because its checksum is invalid, all its contents are lost. Some archivers can skip over such files or uncompress only the parts that are not damaged, others refuse to unpack anything then.

Another sweet method of backing up when using zfs mirrored configuration is to just remove a drive and store it in a safe place, put in another drive and let it resilver. With ZFS compression, another layer of compression probably does not make much sense.

I do not use ZFS. Besides, I'm backing up to a very spacious thumb drive. I doubt there will be any errors on it. I've decided to use geheimnisse's suggestion of xz. I will test to make sure it does work as intended though. The thumb drive is plugged into an I/O expansion card on the INSIDE of the case, so it's a permanent mount, like a hard disk partition.

By the way, what was that spectacular kind of hardware failure? *curious*

Well, you could search through my posts...but I'll be nice.

What happened was that I had a hard disk failure and it wiped out the /usr directory. So when it went to single user mode, I had no /usr directory. That's because when I first learned about FreeBSD many years ago, I followed the recommended setup that was in the printed version of the handbook. That was version 3.4. I stayed with that ever since.

Some more reading:

https://forums.freebsd.org/threads/63763/
https://forums.freebsd.org/threads/63815/
https://forums.freebsd.org/threads/63830/
 
Ahh sorry... I actually thought I missed something.
Because, I do not consider disk failures "spectacular" things.
In my perception these are just occasional and inevitable nuisances requiring drive swapping and resilvering, due to my habit of using cheap old used 15k sas drives.

The worst thing I experienced in that direction was a system using dual deathstar drives, of which one failed and the other one half a hour later. It was that generation which first used glass platters which were highly sensitive to physical separation of the magnetic layer. I suppose the cause was that the computer was not fully acclimatized and there possibly some condensation happened.

I admit I hoped for a spectacular story like with the computer of a friend, which power supply suddenly started to emit a big flow of sparks with the air flow exhaust and then died.
When I examined the thing, it was nicely burnt inside, and the mainboard was fried also. The harddisk miraculously survived, though.
 
That's spectacular. It was probably the 3.3v line that fried. If it was the 5v or 12v the HD would have been toast.
 
Here's a little piece of software that I wrote in sh. It backs up certain aspects of the system. Use it as you see fit. Needless to say, you have to be root to run this. Note that this is tailored to my system, so you may need to make some modifications before you deploy this. In the future, I may have it delete old backups by counting how many are in the backup directory and doing some math with head or something.

Code:
#!/bin/sh

PATH=/bin:/sbin:/usr/bin:/usr/sbin

DATE=`date -j "+%Y%m%d"`
OPTIONS=-cPJvf
BKPATH=/usr/backup/$DATE
EXFILE=.sujournal

# Removes the existing file/directory
# if it exists
remove_exist()
  {
    if [ -e $BKPATH ] ; then
      rm -Rf $BKPATH
    fi
  }

# Creates a new directory, if needed
create_dir()
  {
    if [ ! -e $BKPATH ]; then
        mkdir $BKPATH
        chmod 0700 $BKPATH
      elif [ ! -d $BKPATH ] ; then
        rm -Rf $BKPATH
        mkdir $BKPATH
        chmod 0700 $BKPATH
    fi
  }

# Prints instructions
usage()
  {
    echo ''
    echo 'usage: backup [ all | home | etc | src |'
    echo '  obj | doc | ports | local ]'
    echo ''
    echo 'Any of the above combinations will'
    echo 'be recognized.'
    echo ''
    echo 'The options above will back up the'
    echo 'following components:'
    echo ''
    echo '  all:   Everything'
    echo '  home:  Home directories'
    echo '  etc:   System configuration'
    echo '  src:   System source code'
    echo '  obj:   System object code'
    echo '  doc:   System documentation'
    echo '  ports: The ports tree'
    echo '  local: Local software configuration'
    echo ''
    echo 'If no options are given, then'
    echo 'all is assumed.'
    echo ''
  }

# Archives everything
tar_everything()
  {
    remove_exist
    create_dir
    tar_home
    tar_i386conf
    tar_etc
    tar_usrobj
    tar_usrsrc
    tar_usrdoc
    tar_usrports
    tar_usrlocaletc
  }

# Archives all the home directories
tar_home()
  {
    local exclude
    local target
    target=/home
    exclude="--exclude $target/$EXFILE"
    tar $OPTIONS $BKPATH/home.txz $exclude $target
  }

# Archives the kernel config
tar_i386conf ()
  {
    local target
    target=/usr/src/sys/i386/conf
    tar $OPTIONS $BKPATH/usr.src.i386conf.txz $target
  }

# Archives the system config
tar_etc()
  {
    local target
    target=/etc
    tar $OPTIONS $BKPATH/etc.txz $target
  }

# Archives the compiled base system
tar_usrobj()
  {
    local exclude
    local target
    target=/usr/obj
    exclude="--exclude $target/$EXFILE"
    tar $OPTIONS $BKPATH/usr.obj.txz $exclude $target
  }

# Archives the base system source code
tar_usrsrc()
  {
    local exclude
    local target
    target=/usr/src
    exclude="--exclude $target/$EXFILE"
    tar $OPTIONS $BKPATH/usr.src.txz $exclude $target
  }

# Archives the documentation
tar_usrdoc()
  {
    local exclude
    local target
    target=/usr/doc
    exclude="--exclude $target/$EXFILE"
    tar $OPTIONS $BKPATH/usr.doc.txz $exclude $target
  }

# Archives the ports tree
tar_usrports()
  {
    local exclude
    local target
    target=/usr/ports
    exclude="--exclude $target/$EXFILE"
    tar $OPTIONS $BKPATH/usr.ports.txz $exclude $target
  }

# Archives the installed software
tar_usrlocaletc()
  {
    local target
    target=/usr/local/etc
    tar $OPTIONS $BKPATH/usr.local.etc.txz $target
  }

# Begins processing
process()
  {
    create_dir
    if [ $1 ]; then
        for loopvar in $*
          {
            case "$loopvar" in
              [Hh][Oo][Mm][Ee])
                tar_home
              ;;
              [Ee][Tt][Cc])
                tar_etc
              ;;
              [Ss][Rr][Cc])
                tar_usrsrc
              ;;
              [Oo][Bb][Jj])
                tar_usrobj
              ;;
              [Dd][Oo][Cc])
                tar_usrdoc
              ;;
              [Pp][Oo][Rr][Tt][Ss])
                tar_usrports
              ;;
              [Ll][Oo][Cc][Aa][Ll])
                tar_usrlocaletc
              ;;
              [Aa][Ll][Ll])
                tar_everything
              ;;
              *)
                usage
              ;;
            esac
          }
      else
        tar_everything
    fi
  }


# Entry Point
process $*
 
RAR, see archivers/rar.

When I want to compress something while also keeping my data safe I always rely on RAR. It's a commercial archiver, usable free of charge (though they obviously want you to get a license, just like ARJ, PKZIP and others in the past) but it has some very interesting features to keep your data safe.

And the features which can do miracles here are:

Code:
  rr[N]         Add data recovery record
  rv[N]         Create recovery volumes
  s[<N>,v[-],e] Create solid archive
A solid archive means that the data gets added in a specifically sorted and compact way, this decreases the archive size even without having applied any compression. But the cool parts are the data recovery records. These are comparable to the uudeview CRC methods PAR CRC methods often used within Usenet, see also this article on UUEncoding this article on PArchive.

Edit: I'm mixing up my facts here. This has nothing to do with UUEncode but more so with PAR2. It's embarrassing but I forgot the name of the CRC method, but updated my post.

Another thing is that RAR often manages to create a better compression than others. And the extra space that gains me is often used by me to fill in the recovery records. Or in some cases recovery volumes (these are external CRC files). When I have data which is really important to me I usually keep the archives on one storage medium and my recovery volumes on the other.

The cool part is that when I have a multi volume archive (for example to cater to the 2Gb file size limit on some filesystems) then these recovery records can protect at least 1 volume completely. Meaning: if one archive volume becomes corrupted then I can fully recreate it using my recovery volumes. And not a predetermined volume: any random volume can go b0rk and it will be easily recreated.

Which makes RAR the best archiver for me. But I fully agree with Snurg up there: sometimes using no compression is the better choice.
 
Of course, another option would be ZFS. You can turn on gzip compression, and even set copies=2 if you want some additional protection from bad blocks on the devices. (Counteracts savings from compression, of course.) In addition, you can do your backup with rsync and have all your files available without going through another tool. Need versioning of backups? Use a snapshot!

I know you “do not use ZFS”. But imagine if you did... (if you were using it for your source, you could even use send/recv goodness.) Alas.
 
If xz is that bad, then why did the developers include it in the tar options? I can always change the options to tar.

Because it compresses really well. So long as you can avoid errors, any issues with error handling are unseen. For distribution (downloads) decreased download size leads to saved time/money for everyone. Checking checksums is sufficient since you can redownload if something went wrong.

If it’s an archive, or a backup — something you don’t use unless the original is gone and you really need it to work — then resilience to/handling of errors becomes more important. So don’t equate “it is a popular format for distributing software” with “it is a good way to backup my important data.”

Also, quoting the article:
It is said that given enough eyeballs, all bugs are shallow. But the adoption of xz by several GNU/Linux distributions shows that if those eyeballs lack the required experience, it may take too long for them to find the bugs.
 
The article Eric A. Borisch linked to is really excellent. One thing that I found particularly interesting is the aspect of undetected/undectable errors in archives. I experienced such things a few times and I now understand better what happened with some archives that for some reason resulted in damaged and thus unusable unpacked files.

And, albeit a bit off-topic, I really like the introductory quotes of the article.
There are two ways of constructing a software design: One way is to make it so simple that there are obviously no deficiencies and the other way is to make it so complicated that there are no obvious deficiencies. The first method is far more difficult.
-- C.A.R. Hoare

Perfection is reached, not when there is no longer anything to add, but when there is no longer anything to take away.
-- Antoine de Saint-Exupery
The microcode update stuff in cpucontrol (sysutils/devcpu-data) is a good example.
The code is so contrived and complicated that it is really hard to not use words like "crappy".
It is code of the latter type by Hoare's definition above. It is one of the worst pieces of code I ever touched.

Big chunky functions with many many gotos. A good number of bugs that are hard to see because of this. I start to understand why the devs withdrew the microcode update and keep quiet about this.
Because I do not want to wait for 11.2, I currently am breaking the stuff down into smaller functional pieces that are easy to understand and make it easy to spot mistakes.
And, there are many bit operations, some of them contaminated with hard-to-see bugs. To avoid these difficult-to-understand and error-prone anding, oring and shifting bit handling things there exist for good reason bit field structures and unions. Using those it is easy to recode in a way that makes the whole stuff easy to understand and easy to spot errors. I love the KISS principle.

I have had so many failed disk drives, backup tapes and usb sticks. so that I am just happy to be able to use ZFS mirrors and can just stash away a $20 HDD and put in another one.
In addition I regularly back up my text based data onto DVDs to have write-protected backups.

If the computer blows up, I can just get another, put in a backup drive, boot and start using it, without all that disruption you experienced.
This is my way of backing up. Cheaper, safer and way more durable than tapes, flash and the like.

So, what I am having difficulties to understand why you are thinking about such a complicated "backup" method with so many possibilities of getting hosed again.
I mean, what does you make want not to use ZFS?
 
Snurg,

It sounds to me like someone was inexperienced in writing software. Either that, or the code was contributed by a paid programmer who was looking out for job security. Granted, the schools teach that you never use goto because of the danger of generating spaghetti code.

With that being said, I *DO* use goto when doing error handling. Since I write system level software and some very low level stuff, there are cases where you have no choice. A goto is an unconditional jump to another part of the code. Sometimes that is useful. Consider the following:

Code:
ptr1 = malloc(sizeof(some struct 1));
if (ptr1 == NULL)
  {
    errcd = errno;
    goto error1;
  }
ptr2 = malloc(sizeof(some struct 2));
if (ptr2 == NULL)
  {
    errcd = errno;
    goto error2;
  }
ptr3 = malloc(sizeof(some struct 3));
if (ptr3 == NULL)
  {
    errcd = errno;
    goto error3;
  }

do something useful

return(result)

error3:
free(ptr2);
error2:
free(ptr1);
error1:
return(errcd);

To me, the error handling just makes sense. You are freeing memory that was allocated before the error occurred.


Anyways, I'm doing some benchmark testing based on space. So far, this is the order that I have in regards to least amount of space occupied:

  1. xz
  2. bzip2
  3. gzip
I'll try some of the others to see how they work.
 
In trying out the other compression methods, I ran into a problem...

It seems that tar does not recognize the -- for long options. I get an error for --lrzip, --lz4, and --lzop. I even installed the libs for those from ports and it still will not work. What's interesting though is that it will not recognize --bzip or --bzip2 even though it's in the man page, but it will recognize -j which specifies the bzip2 compression algorithm.

Can someone else check this to make sure that it's not me?

This is on 11.1.
 
Because it compresses really well. So long as you can avoid errors, any issues with error handling are unseen. For distribution (downloads) decreased download size leads to saved time/money for everyone. Checking checksums is sufficient since you can redownload if something went wrong.

If it’s an archive, or a backup — something you don’t use unless the original is gone and you really need it to work — then resilience to/handling of errors becomes more important. So don’t equate “it is a popular format for distributing software” with “it is a good way to backup my important data.”

Also, quoting the article:

++

Also, xz seems to take a longer time to compress, especially when doing a backup on a system without much memory.
 
The thread title pretty much says it all: What is the best compression method for backups?

Due to the...spectacular...hardware failure that my FreeBSD box suffered last month, I've been making plans to perform automated system backups. In addition to the USB ports on the mainboard, this machine also has a PCI Firewire/USB combo card which has an INTERNAL USB port. I purchased a 128GB USB memory stick and plugged it into the internal port and formatted the drive.

I want to do backups using tar(1) and there is a list of compression algorithms to choose from. I know that the fastest would be no compression, but it also takes up the most space as well. So I am looking for a balance between time and space. Some of the compression algorithms I have not heard of before. The list as given in the man page is as follows:

  • xz
  • gzip
  • bzip2
  • lrzip
  • lz4
  • lzma
  • lzop
  • compress
I know what gzip, bzip2, lzma, and compress are. I haven't heard of xz, lrzip, lz4, or lzop.

What do you recommend?




I know by direct experiment that pzbip2 -1 is best balance of speed and small size result.
I tried many of these when I was doing mysql backups.
lzma might be worth restesting.....but parallel bzip beats pigz paralle gzip and doing highest compession actually loses u a lot of time for little gain.

of course if you care nothing for time them compress to max with lzma
 
What are the actual commands you are running?

tar -cPv --bzip2 -f <tar filename> --exclude <exclude filename> <target dir for archiving>

The above command does not work... However, the following command does:

tar -cPvjf <tar filename> --exclude <exclude filename> <target dir for archiving>
 
Have you considered borg? I use its lz4 argument when I do backups with it. I have used up 550GB of storage on my external drive, and the original size of my disk that gets backed up is 502GB. I have 19 backups on the drive right now. It does deduplication of the data too. I have found it to be reliable and have restored data from it many times. It's neat because it's designed to make encryption stupidly easy too (and it is)..though both encryption and compression are all optional
 
Back
Top