Remote backups server using FreeBSD, ZFS, and Rsync

jyavenard said:
One note however, the iozone benchmarks are useless here, especially the read speed.
All it is showing is that the data is in RAM or CPU cache...

It's not useless, considering the bulk of our data will be in the ARC, and it's mostly reads to compare the data to what's on the remote servers. Plus, the transfer from one backup server to the other is all reads. Also, since the servers are on UPSes, and the RAID controllers have batteries, all the caches are configured as write-back, so as soon as data hits one of the caches, it's considered "written to disk".

A more valid test would be:
[cmd=]iozone -R -a -i 0 -i 1 -i 2 -g <size> -f <testfile> -b <excelfile>[/cmd]

I'll see if I can run the above, for comparison. However, I'm off on holidays starting tomorrow, so it won't be until after the 13th that I'll be able to try this.
 
We wanted to maximise the use of all 24 drive bays for data storage.

We didn't want to have to partition one of the drives to make room for the OS, we didn't want to dedicate an entire 500 GB drive to the OS, and we didn't have any extra internal drive bays that could be used.

Thus, we used small (2 GB and 4 GB) CompactFlash drives for the OS install (uses less than 2 GB for / and /usr), and used all 24 drives for data storage. These were small enough that they could be attached to the inside of the case.
 
rsnapshot uses hardlinks and directories on standard filesystems. We looked into doing this originally, but managing the hardlinks and directories and what-not was not fun.

ZFs snapshots are internal to the filesystem. They are accessible at any time via the /<path>/.zfs/snapshot/<snap name>/ directory. And you get all the added bonuses of ZFS (compression, pooled storage, easy admin, etc).

We looked at a lot of different remote backup tools, especially ones that use rsync, and even tried coming up with some custom stuff using hardlinks, squasfs, other compressed fs, LVM, etc and just could not find a storage stack that was usable and simple. :)

Then ZFS was imported into FreeBSD (we're a Debian Linux shop, but we use FreeBSD on the firewalls, so getting a FreeBSD storage box was not a hard sell). And the rest is history.
 
jyavenard said:
A more valid test would be:
[cmd=]iozone -R -a -i 0 -i 1 -i 2 -g <size> -f <testfile> -b <excelfile>[/cmd]

Test is still running, but here's some preliminary results:
Code:
                                                            random  random
              KB  reclen   write rewrite    read    reread    read   write
              64      64  781660  985384  1772821  3758700 3559345 2210854                                                             
             128     128  810472 2515594  3043191  4280683 4116568 2775716                                                             
             256     256 1019294 1741907  2327052  3762009 3610222 2051407                                                             
             512     512 1062365 1166212  1816181   530057 2188145 1236732                                                             
            1024    1024  812607 1173034  1218977  1190593 1190593 1022996                                                             
            2048    2048  423683  977971  1390538  1392341 1360799 1142823                                                             
            4096    4096  888472 1130223  1386544  1382416 1382861 1098427                                                             
            8192    8192  688925 1068884  1152028  1176646 1178219 1027706                                                             
           16384   16384  872934 1051746  1160754  1109570 1164057  993998
           32768   16384  663842 1018810  1254801  1227216 1249020  976341                                                             
           65536   16384  926499 1079457  1099078  1214551 1287518 1057221
          131072   16384  568620 1043002   829028  1156366  771242 1012178
          262144   16384  893503 1046252  1225213  1180366 1139746 1023720
          524288   16384  173176  268374  1126224  1030217 1101103  259177
         1048576   16384  163071  231045   279858    37752  266438  200727

So you can see that it's ranging between ~200 MBps and 1 GBps for writes and between 300 MBps and 3 GBps for reads.

Once the test completes, I'll post the full results.
 
phoenix said:
Test is still running, but here's some preliminary results:
Code:
                                                            random  random
              KB  reclen   write rewrite    read    reread    read   write
              64      64  781660  985384  1772821  3758700 3559345 2210854                                                             
             128     128  810472 2515594  3043191  4280683 4116568 2775716                                                             
             256     256 1019294 1741907  2327052  3762009 3610222 2051407                                                             
             512     512 1062365 1166212  1816181   530057 2188145 1236732                                                             
            1024    1024  812607 1173034  1218977  1190593 1190593 1022996                                                             
            2048    2048  423683  977971  1390538  1392341 1360799 1142823                                                             
            4096    4096  888472 1130223  1386544  1382416 1382861 1098427                                                             
            8192    8192  688925 1068884  1152028  1176646 1178219 1027706                                                             
           16384   16384  872934 1051746  1160754  1109570 1164057  993998
           32768   16384  663842 1018810  1254801  1227216 1249020  976341                                                             
           65536   16384  926499 1079457  1099078  1214551 1287518 1057221
          131072   16384  568620 1043002   829028  1156366  771242 1012178
          262144   16384  893503 1046252  1225213  1180366 1139746 1023720
          524288   16384  173176  268374  1126224  1030217 1101103  259177
         1048576   16384  163071  231045   279858    37752  266438  200727

So you can see that it's ranging between ~200 MBps and 1 GBps for writes and between 300 MBps and 3 GBps for reads.

Once the test completes, I'll post the full results.

200MB and 300MB/s are the only value that actually mean something in your setup. Provided the kind of data you are writing (mirroring external machine) , the cache effect is irrelevant.

It's surprising that you are only achieving 200MB/s write provided the number of disks your are using. I get the same speeds with only 6 disks.

But don't quote that you get 3GBs read. It's nonsense when performing disk benchmarks.
 
Yeah, there are a lot of different methods to run rsync on the Windows machine (I personally prefer the rsync.net backup agent, which supports SSH). However, they are all client solutions. I've yet to find a server-hosted solution for this. We'd prefer to keep all the backup configuration on the server. Makes it easier to schedule and manage the network/disk load.

There are SSH daemons for Windows, and there are rsync apps for Windows. But I have yet to find a pair that will allow:
  • server to connect via SSH
  • server to initiate the rsync program on the client
  • client connect back through the SSH tunnel to push the data to the server

If we could find that, then we could backup everything, via a single set of config files on the backup server.
 
Cygwin

phoenix said:
There are SSH daemons for Windows, and there are rsync apps for Windows. But I have yet to find a pair that will allow:
  • server to connect via SSH
  • server to initiate the rsync program on the client
  • client connect back through the SSH tunnel to push the data to the server

Install cygwin with its openssh and rsync packages then run 'ssh-host-config'. It should set up everything needed to make sshd a windows service.

Once you have cygwin installed you can refer to '/usr/share/doc/Cygwin/openssh.README' if you have problems.
 
A few questions

Thank you very much for posting all of this information. I've been planning on doing something very similar and it's great to see someone else accomplishing it at a much larger scale than I'm planning.

Are you backing up any databases? If so how are you doing it?

To hazard a guess, if you have mysql databases I'm thinking you are locking all the tables then directly copying the contents of '/var/mysql' (or where ever the databases are stored on the file system) or you are dumping the contents of the databases to flat files before doing the rsync.

What is your retention policy? I see that you were hoping for 13 months, but was that based off an SLA?

When you do hit the ~500 day mark I'm guessing will you be removing snapshots, starting with the oldest, to free up space. Have you considered keeping one snapshot for each week or month, that way you can still have access to data that was backed up more than 500 days prior and still have space for future back ups? The storage pool will still eventually fill up doing this, I'm sure, but it would be interesting to see how long it could last.

Did you consider using the larger Chenbro chassis (50 bay) instead?

And just to satiate the geek in me, do you have any pictures of the servers?
 
gene said:
Are you backing up any databases? If so how are you doing it?

Yes. MySQL databases. We dump the databases to text files, and then rsync both those and the db directory as part of the rsync process. We've done recoveries using both the dumps and the binary files.

What is your retention policy? I see that you were hoping for 13 months, but was that based off an SLA?

We're aiming for 13 months. It looks like we'll have to move to larger harddrives before the first year is out, though. Using 500 GB drives, we only have 2 TB of disk space left. 1 TB drives are coming down in price, though, and the issues with them appear to be solved.

When you do hit the ~500 day mark I'm guessing will you be removing snapshots, starting with the oldest, to free up space. Have you considered keeping one snapshot for each week or month, that way you can still have access to data that was backed up more than 500 days prior and still have space for future back ups? The storage pool will still eventually fill up doing this, I'm sure, but it would be interesting to see how long it could last.

Yes, that is one possibility we are looking at. Keeping the backups from the 7th, 14th, 21st, and 28th of each month, starting on the 14th month. And then keeping those for an extra year.

Did you consider using the larger Chenbro chassis (50 bay) instead?

We didn't know about the 50-bay cases until after we had things installed and working.

We re-purposed servers for this. The 5U mega-servers were originally purchased to act as Xen/KVM/VMWare hosts. Then we realised that CPU and RAM are more important for VM hosts than disk space. So these became the backup servers. And the other 5U servers will become storage servers for the VM hosts (which will probably be net-booted 1U or 2U systems with gobs of CPU and RAM).

And just to satiate the geek in me, do you have any pictures of the servers?

Not currently, no.
 
No. We use 12 Seagate drives, and 12 Western Digital drives. Bought in four batches of 6 drives each, to try an minimise the "all from the same manufacturing batch" issue (would really suck if they all died at the same time). A pair of the drives have been replaced already with newer WD drives.
 
gene said:
Install cygwin with its openssh and rsync packages then run 'ssh-host-config'. It should set up everything needed to make sshd a windows service.

Once you have cygwin installed you can refer to '/usr/share/doc/Cygwin/openssh.README' if you have problems.

Have you given this a try? I've done it with 2003 server successfully, and just over the weekend did it with an XP box with success.
 
I'm in the process of testing it.

It's going to require making some (possibly massive) changes to our backup script. For example, there's no sudo in cygwin.

I've got it working manually. Now to figure out how to automate it, and to test a system restore. And to figure out what needs to be added to the exclude file. :)
 
Quick question for the author... I'm using your set of scripts as a starting point because it all seems pretty sane. Once I'm happy with it, I'll probably change things up a bit.

I'm having one bizarre issue that I can't track down though... My backups box has much more storage than all the machines it's backing up, so I have not been paying much attention to the space used over the last few weeks. As I was copying some things off, I realized that there's more data than I'd expect on the backups server. After poking around a bit I found that rsync is simply not deleting files. I see the "--delete-during" option in the script, also tried plain old "--delete" with the same result.

Any ideas? I see people with similar problems when they are working from a file list or with wildcards, but the only wildcards I've got a in my exclude lists...

I'm totally stumped by this one.
 
Nope... It grew some more tonight with "--delete-after" as well. Seems like a common problem with rsync, I'll have to figure out how to step through what the script is doing but scale things down enough so I can see what's happening.
 
Almost there... Since the boxes are active while they are being backed-up, rsync throws errors here and there about files disappearing and the like, which is fairly normal.

What I did not know is that rsync skips ALL file deletion operations if it encounters ANY errors. There's an "--ignore-errors" flag, but it's a bit blunt - it ignores any errors, which could be problematic. I have a query out about this on the rsync list.

So if you're using this script, or a similar method, you might want to look for this line in your backup logs:

2010/02/18 01:42:30 [75398] IO error encountered -- skipping file deletion

That does not refer to a single file, that means NO files were deleted in the entire run.
 
Even when --delete-during is used, which does the file deletions as it comes across them, instead of batching them up at the end?
 
phoenix said:
Even when --delete-during is used, which does the file deletions as it comes across them, instead of batching them up at the end?

Yep, --delete-during was the initial option I used. The number of errors is small, and they all give a "bad file descriptor (9)" error, which I think rsync feels is a "really bad" error compared to the normal "file disappeared" type errors. Googling around on the "bad file descriptor" error gives me lots of hits on problems with smbfs mounts, but not much else (and I have no smbfs mounts).

I'll try a run with "--ignore-errors" tonight and see what happens. Not an optimal solution, but a good stopgap.
 
phoenix said:
Code:
/sys/*
/proc/*
*mozilla/firefox/*/Cache/**
/var/lib/vservers/vs1/home/*
*/.googleearth/Cache/**
*/.googleearth/Cache/temp/**
/var/spool/squid/**
/backup/*
/var/spool/cups/**
/var/log/**.gz
*/cache/apt/archives/**
/var/lib/vservers/vs1/var/tmp/**
/home/programs/tmp/**
/home/programs/vmware/**
/home/**/.thumbnails/**
/home/**/.java*/deployment/cache/**
/home/**/profile/**
/home/**/.local/Trash/**
/home/**/.macromedia/**

I'm wondering why both "*" and "**" are used at the end? Is there any particular reason since "*" seems to be sufficient.
 
This file just grew organically, with three of us adding to it, so some things have one *, and others have two. No real reason beyond that, I don't think.

I believe the ** in the middle of a path is important, though.

The globbing/regex stuff in rsync is confusing, to say the least.
 
Back
Top