PDA

View Full Version : Remote backups server using FreeBSD, ZFS, and Rsync


phoenix
May 1st, 2009, 00:20
Updates
2010-04-04:
We've rolled out version 3 of our rsbackup system. See this thread (http://forums.freebsd.org/showthread.php?p=71162) for more information on it. (Version 2 never really saw the light of day, but was used as a stepping stone to the refactored version 3.)

Intro
A co-worker and I developed a centralised backup solution using FreeBSD, ZFS, and Rsync. The following set of posts describe how we did it.

Note: this is fairly long, and includes code dumps from all the scripts and config files used.

Server Hardware
Our central backup server uses the following hardware:

Chenbro 5U rackmount case, with 24 hot-swappable drive bays, and a 4-way redundant PSU
Tyan h2000M motherboard
2x dual-core Opteron 2200-series CPUs at 2.2 GHz
8 GB ECC DDR2-SDRAM
3Ware 9550SXU PCI-X RAID controller in a 64-bit/133 Mhz PCI-X slot
3Ware 9650SE PCIe RAID controller in an 8x PCIe slot
Intel PRO/1000MT 4-port gigabit PCI-X NIC
24x 500 GB SATA harddrives
2x 2 GB CompactFlash cards in CF-to-IDE adapters


OS Configuration
We're currently running the 64-bit amd64 version of FreeBSD 7.1. We'll be upgrading to 7.2 once it's released. And we are anxiously awaiting the release of 8.0 with ZFSv13 support.

Two of the gigabit NIC ports are combined using lagg and connected to one gigabit switch. We're considering adding the other two ports to the lagg interface, but we're waiting for a new managed switch that support LACP before we do.

The 2 CF cards are configured as gm0 using gmirror. / and /usr are installed on gm0.

The 3Ware RAID controllers are configured basically as glorified SATA controllers. Each drive is configured as a "SingleDrive" array, and appear to the OS as separate drives. Using SingleDrive instead of JBOD allows the RAID controller to use the onboard cache, and allows us to use the 3dm2 monitoring software. Each drive is also named after the slot/port it is connect to (disk01 through disk24).

The 24 harddrives are also labelled using glabel, according to the slot they are in, using the same names as the RAID controller uses (disk01 through disk24).

The drives are added to a ZFS pool as 3 separate 8-drive raidz2 vdevs, as follows:
# zpool create storage raidz2 label/disk01 label/disk02 label/disk03 label/disk04 label/disk05 label/disk06 label/disk07 label/disk08
# zpool add storage raidz2 label/disk09 label/disk10 label/disk11 label/disk12 label/disk13 label/disk14 label/disk15 label/disk16
# zpool add storage raidz2 label/disk17 label/disk18 label/disk19 label/disk20 label/disk21 label/disk22 label/disk23 label/disk24

This creates a "RAID0" stripe across the three "RAID6" arrays. The total storage pool size is just under 11 TB.

# zpool status
pool: storage
state: ONLINE
scrub: none requested
config:

NAME STATE READ WRITE CKSUM
storage ONLINE 0 0 0
raidz2 ONLINE 0 0 0
label/disk01 ONLINE 0 0 0
label/disk02 ONLINE 0 0 0
label/disk03 ONLINE 0 0 0
label/disk04 ONLINE 0 0 0
label/disk05 ONLINE 0 0 0
label/disk06 ONLINE 0 0 0
label/disk07 ONLINE 0 0 0
label/disk08 ONLINE 0 0 0
raidz2 ONLINE 0 0 0
label/disk09 ONLINE 0 0 0
label/disk10 ONLINE 0 0 0
label/disk11 ONLINE 0 0 0
label/disk12 ONLINE 0 0 0
label/disk13 ONLINE 0 0 0
label/disk14 ONLINE 0 0 0
label/disk15 ONLINE 0 0 0
label/disk16 ONLINE 0 0 0
raidz2 ONLINE 0 0 0
label/disk17 ONLINE 0 0 0
label/disk18 ONLINE 0 0 0
label/disk19 ONLINE 0 0 0
label/disk20 ONLINE 0 0 0
label/disk21 ONLINE 0 0 0
label/disk22 ONLINE 0 0 0
label/disk23 ONLINE 0 0 0
label/disk24 ONLINE 0 0 0
# zpool list
NAME SIZE USED AVAIL CAP HEALTH ALTROOT
storage 10.9T 5.11T 5.76T 47% ONLINE -

We then created ZFS filesystems for basically everything except / and /usr:

/home
/tmp
/usr/local
/usr/obj
/usr/ports
/usr/ports/distfiles
/usr/src
/var
/storage/backup


We enabled lzjb compression on /usr/ports and /usr/src, and disabled it on /usr/ports/distfiles. And we enabled gzip-9 compression on /storage/backup. We also disabled atime updates on everything except /var.

phoenix
May 1st, 2009, 00:21
RSBackup
We developed a "simple" set of shell scripts that perform remote backups of Linux and FreeBSD systems using rsync and ZFS snapshots. The scripts run a sequential series of rsync connections for all servers at a remote site, while also doing multiple sites in parallel. It uses SSH (as user rsbackup, with a password-less RSA key) to connect to the remote server, then uses rsync to send data back through the SSH connection. Backups are stored on a ZFS filesystem (/storage/backup/), with a separate directory for each site, and separate sub-directories for each server. Before each nightly backup run, a ZFS snapshot is taken of the /storage/backup filesystem, named using the current date, in YYYY-MM-DD format.

We called our solution rsbackup.

rsbackup is configured to run every night starting at 7 pm, via root's crontab. The crontab looks like this:

SHELL=/bin/sh
MAILTO=root
PATH=/sbin:/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin

#min hour day month weekday command
*/15 * * * * /root/scripts/check-fs.sh

# Take a snapshot of the backups filesystem
50 18 * * * /root/rsb/rsb-snapshot

# Run the rsbackup script
0 19 * * mon-fri /root/rsb/rsb-wrapper force
0 19 * * sat-sun /root/rsb/rsb-wrapper start
50 6 * * mon-fri /root/rsb/rsb-wrapper stop

The crontab above shows the helper scripts that are used:

check-fs
rsb-snapshot
rsb-wrapper
rsb-one


check-fs checks the status of the gmirror and the zpool to make sure there are no checksum errors, dying drives, missing drives, degraded vdevs, and so on. If there are, then an e-mail is delivered with the details of the issues.

rsb-snapshot pulls in the rsbackup config file to determine which filesystem to snapshot, then creates a snapshot using the current date as the snapshot name (YYYY-MM-DD format).

rsb-wrapper pulls in the rsbackup config file, then checks if any other rsbackup processes are running. If there are any, a warning is displayed and the wrapper exits. If there are none, then the backup process is started. rsb-wrapper is also run just prior to 7 am, to check if any rsync processes are still running, and to kill them if they are (We didn't want backups running during the day, as they will hog all the upload bandwidth for the remote sites). Error and warning messages from all the log files are then sent via e-mail to the address listed in the crontab.

rsb-one can be used from the command-line to do a manual backup of a single server at a single site. It uses the same config file as the rest of the scripts. Command syntax is:
rsb-one -s sitename -h hostname

The gist of the backup process is this:

every night, a ZFS snapshot is created of the /storage/backup filesytem. This becomes the historical backup for everything, as one can navigate through all the snapshots via /storage/backup/.zfs/snapshot/<snapname>/<sitename>/<server>/.
every night, a full rsync is done of virtually every file on the remote systems against a local directory for that server.


We are currently backing up 102 remote servers. The backups start at 7pm, the rsync for the last server starts around 2am, and everything is finished by 4am.

The size of the snapshots fluctuates daily, but the average is under 10 GB. The base storage required for those 102 servers is ~ 4 TB, which gives us over 500 days of daily backups, well over the 13 months we were hoping for.

phoenix
May 1st, 2009, 00:28
Remote Server Config
The rsbackup system requires a bit of setup on both the central backup server and the remote server(s). The following shows how to configure a Debian Linux host for backups.

On the remote host


install rsync (preferably 3.0.x, as it has much reduced CPU and RAM usage, and it starts sending file changes while generating the file list)
create backups group
addgroup --system backups
create rsbackup user
adduser rsbackup
manually set the password to * in /etc/shadow to prevent console logins, the shell can be set to /bin/sh, as there are no interactive logins
add rsbackup to group backups
adduser rsbackup backups
edit sudo config to allow backups group to run rsync with no password
visudo
Cmnd_Alias RSYNC = /usr/bin/rsync
%backups ALL=(ALL) NOPASSWD: RSYNC
create .ssh/ directory in ~rsbackup/
mkdir ~rsbackup/.ssh
create blank authorized_keys file
touch ~rsbackup/.ssh/authorized_keys
set correct permissions on .ssh/ directory and .ssh/authorized_keys file
chmod 700 ~rsbackup/.ssh
chmod 600 ~rsbackup/.ssh/authorized_keys


On the central backup server

copy public SSH key for rsbackup to remote server
scp /root/rsb/conf/rsbackup.rsa.pub remoteserver:
on the remote server, move the rsbackup.rsa.pub file to ~rsbackup/.ssh/authorized_keys
test SSH logins using the key (must be done as root)
ssh -l rsbackup -i /root/rsbackup/conf/rsbackup.rsa -p <portnum> <server>
test that rsbackup can run rsync via sudo without passwords, but cannot run any other commands via sudo
sudo /bin/ls (should fail)
sudo rsync --version (should work)

phoenix
May 1st, 2009, 00:31
Central Backup Server Config
All rsbackup-related stuff is (currently) stored under /root/rsb/ (ideally, it should be stored under /usr/local/ to follow hier).

The example below shows the configuration steps used for testserver.


If this is the first server added for a site, create a site directory, using the DNS name for the site, under /root/rsb/sites/
mkdir sites/testsite
Create/edit the site_defaults file
cp /root/rsb/conf/site_defaults /root/rsb/sites/testsit/
ee sites/testsite/site_defaults
Create a config file for the server.
ee sites/testsite/testserver
Add/edit at least the following:
RSYNC_SERVER=testserver.hostname
SERV_DIR=testsite/
Add any overrides for items in the global defaults (mainly SSH port to use)
If there are special excludes for this server, add the following
RSYNC_EX_SERVER=$SITE_CONF/exclude.testserver
Add/edit the exclude file listed above
ee sitess/testsite/exclude.testserver
Connect via ssh to add the host to the known_hosts file
ssh -l rsbackup -i /root/rsb/conf/rsbackup.rsa testserver.hostname
Add the site to the global sites list
ee conf/sites.lst
Rename the server config file to end in .cfg (only server config files ending in .cfg are processed)
mv sites/testsite/testserver sites/testsite/testserver.cfg
The site and server(s) will be picked up in the next run of rsbackup via cron


Try to only add 1 or 2 new servers per day. The initial rsync run takes a long time, as it has to copy over every file in the system. Any still-running rsync processes will be killed at 7 am weekdays, so the initial sync may be spread across multiple days. Adding servers on Friday is best, as the rsync processes will run until complete or Monday at 7 am, whichever comes first.

phoenix
May 1st, 2009, 00:32
Restoring From Backups
Every snapshot that is created can be navigated via the hidden .zfs/snapshot/<snapshotname>/ directory hierarchy. The .zfs directory is placed in the root of the ZFS filesystem. As you navigate through the snapshot hierarchy, ZFS automatically mounts the snapshot as a read-only filesystem. You can also manually mount the snapshot as read-only using mount -t zfs. In this way, you can restore files from either the most recent backup (the normal filesystem hierarchy) or from any previous backup (the snapshot hierarchy).

To manually mount a snapshot (as root):
mount -t zfs -r /storage/backup@2008-09-12 /mnt

You can clean up the output of mount by periodically running (as root):
mount | grep 'backup@' | awk '{ print $3 }' | xargs -n 1 umount

Individual Files/Folders

SSH to the central backups server
Switch to root
cd into the /storage/backup/.zfs/snapshot/ directory
Do an ls to see all the available snapshot dates
cd into the desired snapshot directory
cd into the <site>/<server>/ directory
find the file/folder you need and scp it back to the server in question


Complete System Restore - Linux
In order for this to work correctly, the username you use in the rsync command will need to be part of the sudoers users/groups that can run rsync on the central backup server.

Boot replacement server off a Linux LiveCD (Knoppix/Kanotix/etc).
Partition the drive(s) as needed using cfdisk (see fstab in the server's etc directory on the central backup server).
Format the partitions as needed (see fstab in the server's etc directory on the central backup server).
mkfs -t ext3 /dev/sda1
mkfs -t xfs /dev/sda5
mkfs -t xfs /dev/sda6
and so on
Mount the partitions under /mnt.
mount -t xfs /dev/sda5 /mnt
mkdir /mnt/boot /mnt/usr /mnt/home /mnt/var
mount -t ext3 /dev/sda1 /mnt/boot
mount -t xfs /dev/sda6 /mnt/usr
and so on
cd to /mnt (not really needed, but a good safety-net, just in case).
Run rsync to copy everything from the central backup server to the local server
Note 1: --numerical-ids is *very* important, do not forget this option, or things will fail in spectacular ways!
Note 2: -H is needed to restore hardlinks to various files. Without this, the restore will be significantly larger.
# rsync -vaH --partial --stats --numeric-ids --rsh=ssh --rsync-path="sudo rsync" username@backupserver:/storage/backup/<site>/<server>/ /mnt/
Grab a coffee as it does the transfer. Time depends on the size of the dataset being restored.
Install GRUB into the boot sector of the harddrive.
grub-install --no-floppy --recheck /dev/sda
grub-install --no-floppy /dev/sda
Reboot the server to make sure everything comes up correctly.


For the last step, where you run rsync, you can use a ZFS snapshot directory to restore the server to any day. Instead of /storage/backup/<site>/<server>/ you can use /storage/backup/.zfs/<snapshotdate>/<site>/<server>/

phoenix
May 1st, 2009, 00:32
Complete System Restore - FreeBSD
In order for this to work correctly, you will need to be part of the sudoers users/groups on the central backup server that can run rsync without requiring a password.

First, do a minimal install of FreeBSD, to make the drives bootable:

Boot replacement server using the FreeBSD install CD.
Select Canada as the country.
Select USA ISO as the keymap.
Select Standard install.
Select OK on the warning message.
Delete all existing partitions. Press A to create a single partition for FreeBSD. Mark it as Bootable. Press Q to save the changes.
Select Standard MBR (no boot manager).
Select OK on the warning message.
Create the partitions needed (see the fstab under /storage/backup/<site>/<server>/etc/). Press Q to save the changes.
Select Minimal install.
Select FTP Passive for the installation media (or CD/DVD if using the full CD1).
Select Main Site.
Select the correct network device (xl0 on my test server).
Select No for IPv6.
Select Yes for DHCP.
Enter the correct hostname.
Select Yes on the warning message.
Wait as it does the minimal install.
Select OK on the completion message.
Select No for "function as a network gateway".
Select No for "configure inetd".
Select No for "enable SSH login".
Select No for "anonymous FTP".
Select No for "NFS server".
Select No for "NFS client".
Select No for "customize system console".
Select Yes for "set this machine's time zone'.
Select No for "Is this machine's CMOS clock set to UTC".
Select America -- North and South for region.
Select Canada for country.
Select Pacific Time - west British Columbia for timezone.
Select Yes for "PDT".
Select No for "Linux compatibility".
Select No for "mouse".
Select No for "browse the package collection".
Select Yes for "add any initial user accounts".
Select User for "User and group management".
Fill in the blanks. The exact contents don't matter, as the rsync restore will wipe this out. This is just for testing during the initial boot.
Select Exit for "User and group management".
Select OK on the warning message.
Type root's password twice.
Select No on the warning message.
Press Tab key to get to "Exit Install". Press enter.
Select Yes on the warning message to exit the installer and reboot the system.


Test that the new install boots correctly, and that you can login from the console.

Then follow the steps below to restore the data from the backups server.


Boot replacement server off a FreeBSD LiveCD that includes rsync (Frenzy/FreeSBIE/etc). Frenzy 1.1 seems to work best.
Type nohdmnt at the boot menu, to prevent the existing filesystems from being mounted automatically.
Enable modifying of drives while the system is running.
sysctl -w kern.geom.debugflags=16
Create a directory to use for the mount point of the harddrive partitions.
mkdir /root/media
Mount the partitions under /root/media
mount /dev/ad4s1a /root/media
mkdir /root/media/usr /root/media/var /root/media/home
mount /dev/ad4s1d /mnt/usr
mount /dev/ad4s1e /mnt/var
mount /dev/ad4s1f /mnt/home
and so on
Change to /root/media (not really needed, but a good safety-net, just in case).
Run rsync to copy everything from the central backup server to the local server.
Note 1: --numerical-ids is *very* important, do not forget this option, or things will fail in spectacular ways!
Note 2: -H is needed to restore hardlinks to various files. Without this, the restore will be huge, and will fail. FreeBSD uses hardlinks a lot!
rsync -vaH --partial --inplace --stats --numeric-ids --rsh="ssh" --rsync-path="sudo rsync" username@backupserver:/storage/backup/<site>/<server>/ /mnt/
Grab a coffee as it does the transfer. Length of the restore depends on the size of the dataset being restored.
Reboot the server, without any CDs in the drive, to make sure everything comes up correctly. Test that you can login from the console.

For the last step, where you run rsync, you can use a ZFS snapshot directory to restore the server to any day. Instead of /storage/backup/<site>/<server>/ you can use /storage/backup/.zfs/<snapshotdate>/<site>/<server>/

phoenix
May 1st, 2009, 00:32
The rsbackup Script
This is version two our our prototype rsbackup, it's still a little rough around the edges, and spread across too many separate files. It works well for us, but it's not as pretty as it should be. :) There are a couple of different coding styles, and some options may not be in use anymore. We're hoping to clean it up over the summer, when school is not in session (we don't want to disrupt the backups during the school year). We'd also like to amalgamate rsbackup, rsb-one, and rsb-snapshot together.

#!/bin/sh

Defaults="rsbackup.conf"
. $Defaults

# Functions used in this script
do_rsync()
{
SITE_CONF="${SERVERS_DIR}/${1}"

#find each .cfg file in the passed dir, load defaults, site_defaults, server_defaults, and run the rsync
for I in $( find $SITE_CONF -type f -name "*.cfg" ); do

# Load Standard defaults
. $Defaults

# Load site wide defaults
if [ -f $SITE_CONF/site_defaults ]; then
. $SITE_CONF/site_defaults
fi

# Load server specific options
if [ ! -z $I ]; then
if [ -f $I ]; then
. $I
fi
fi

# make sure the site directory exists
if [ ! -e $BACKUP_DIR/$SITE_DIR ]; then
mkdir $BACKUP_DIR/$SITE_DIR
fi

# just to make typing easier
S_DIR="$BACKUP_DIR/$SITE_DIR/$SERV_DIR"

# make sure the directory for the server itself exists
if [ ! -e $S_DIR ]; then
mkdir $S_DIR
fi

echo ""
echo "====>> $( date "+%b %d %Y: %H:%M" ) Starting rsync for $RSYNC_SERVER" >> $logfile
echo ""

# The actual rsync command
rsync $RSYNC_OPTIONS $RSYNC_SITE_OPTIONS $RSYNC_SRV_OPTIONS \
--exclude-from=$RSYNC_EX_DEF $RSYNC_EXTRA_EXCLUDE \
--rsync-path="$RSYNC_EXEC" --rsh="$RSYNC_SSHCMD -p $RSYNC_PORT -i $RSYNC_SSH_KEY" \
--log-file=/var/log/rsbackup/$RSYNC_SERVER.log \
$RSYNC_USER@$RSYNC_SERVER:/ $S_DIR

echo ""
echo "====>> $( date "+%b %d %Y: %H:%M" ) Ending rsync run for $RSYNC_SERVER" >> $logfile

done
}


# run the rsync for each directory listed in sites.conf
for site in $( cat ${CONF_DIR}/sites.lst ); do
echo ""
echo "****>> $( date "+%b %d %Y: %H:%M" ) Starting sequential run for servers at ${site}" >> $logfile
do_rsync ${site} &
sleep $SLEEPTIME
done

phoenix
May 1st, 2009, 00:33
rsbackup.conf
This is the main configuration file that all the scripts use. It lists where the log files should be stored, how long to wait between sites, the default options used for the rsync command, and so on.

RS_DIR="/root/rsb"
SERVERS_DIR="$RS_DIR/sites"
CONF_DIR="$RS_DIR/conf"
logfile="/var/log/rsbackup/rsbackup.log"

# Where all the backups are stored
BACKUP_DIR="/storage/backup"

#Default options for rsync
# RSYNC_OPTIONS are the defaults for the rsbackup system
# RSYNC_SITE_OPTIONS are the overrides that apply to all systems at one site (set in servers/<site>/site_defaults file)
# RSYNC_SRV_OPTIONS are the overrides that apply to one specific server (set in servers/<site>/<server>.rs file)

RSYNC_OPTIONS="--archive --stats --numeric-ids --delete-during --partial --inplace --hard-links"
RSYNC_SITE_OPTIONS="--compress --compress-level=9"
#RSYNC_SITE_OPTIONS=""
RSYNC_USER="rsbackup"
RSYNC_PORT="55556"
RSYNC_EX_DEF="$CONF_DIR/exclude.default.linux"
RSYNC_SSH_KEY="$CONF_DIR/rsbackup.rsa"
RSYNC_EXEC="sudo rsync"
RSYNC_EX_MEDIA="$CONF_DIR/exclude.pass1"
RSYNC_EX_SERVER=""
RSYNC_SSHCMD="/usr/local/bin/ssh"

SLEEPTIME=250

phoenix
May 1st, 2009, 00:33
rsb-wrapper
This is the wrapper script that is run via cron.

When called with the parameter force, it will start rsbackup, no questions asked.

When called with the parameter start, it will check for other running rsbackup processes. If there are any, then it outputs a warning message and exits without starting rsbackup. If there are no running rsbackup processes, then it starts one.

When called with the parameter stop, it will unquestionably kill any running rsync and rsbackup processes. It will then tail and grep all the log files for warnings and errors, and echo them so cron can send them as an e-mail.

#!/bin/sh

# Set custom PATH
export PATH=/sbin:/bin:/usr/sbin:/usr/bin:/usr/local/sbin:/usr/local/bin

# Set the default exit value
exitval=0

# Grab the PID of the current script
pid=$$

# Get info on how we were called
curdir=$( /usr/bin/dirname ${0} )

# Pull in the config file for rsbackup, which should be in the same directory we are called from
if [ -e ${curdir}/rsbackup.conf ]; then
. ${curdir}/rsbackup.conf
cd $RS_DIR
else
echo "Error: unable to load the config file."
exit 1
fi

# Functions used in this script
check_logs()
{
local word="${1}"

cd /var/log/rsbackup

for log in $( ls *.log ); do
msg_head="${log}: "
msg_body="$(tail -1 ${log} | grep "${word}" )"
if [ "${msg_body}x" != "x" ]; then
echo ${msg_head}
echo ${msg_body}
echo ""
fi
done
}


# Main script
case "$1" in
[Ff][Oo][Rr][Cc][Ee])
echo "Forcing rsbackup to start"
./rsbackup > /dev/null 2>&1 &
;;
[Ss][Tt][Aa][Rr][Tt])
# Check if any rsync/rsbackup processes are already running, and abort if there are
numrunning=$( pgrep -lf rsbackup | grep rsync | wc -l | cut -c 8- )

if [ ${numrunning} -eq 0 ]; then
echo "Starting rsbackup"
cd ${RS_DIR}
./rsbackup > /dev/null 2>&1 &
else
echo "Warning: other rsbackup processes are running. Not starting."
exitval=2
fi
;;
[Ss][Tt][Oo][Pp])
# Check if there are any running rsync/rsbackup processes, and abort if there aren't
numrunning=$( pgrep -lf rsbackup | grep rsync | grep -v rsb-wrapper | wc -l | cut -c 8- )

if [ ${numrunning} -gt 0 ]; then
echo -n "Attempting to forcibly stop rsbackup ... "

pkill -9 -f rsbackup

numrunning=$( pgrep -lf rsbackup | grep rsync | grep -v rsb-wrapper | wc -l | cut -c 8- )
sleep 3
if [ ${numrunning} -gt 0 ]; then
echo "ERROR!"
echo "Unable to stop all processes."
exitval=1
else
echo "done."
echo ""
exitval=0
fi
else
echo "No running rsbackup processes. Nothing to stop."
exitval=0
fi


echo "Checking logs for warnings"
echo "----------------------------------"
check_logs "warning"

echo ""

echo "Checking logs for errors"
echo "----------------------------------"
check_logs "error"
;;
esac

exit $exitval

phoenix
May 1st, 2009, 00:33
rsb-snapshot
This script just pulls in the central config file, figures out which ZFS filesystem is being used, and creates a snapshot of it. The snapshot is named after the current date, using YYYY-MM-DD as the format.

#!/bin/sh

export PATH=/sbin:/bin:/usr/sbin:/usr/bin:/usr/local/sbin:/usr/local/bin

# Get info on how we were called
curdir=$( /usr/bin/dirname ${0} )

# Pull in the config file for rsbackup, which should be in the same directory we are called from
if [ -e ${curdir}/rsbackup.conf ]; then
. ${curdir}/rsbackup.conf
else
echo "Error: unable to load the config file."
exit 1
fi

# Get today's date, formatted as YYYY-MM-DD
today=$( date "+%Y-%m-%d" )

# Remove any leading slashes from storage directory
if [ $( echo $BACKUP_DIR | /usr/bin/cut -c 1 ) = "/" ]; then
backupdir=$( echo $BACKUP_DIR | /usr/bin/cut -c 2- )
else
backupdir=$BACKUP_DIR
fi

# Create a snapshot using the date in the name
/sbin/zfs snapshot ${backupdir}@${today}

if [ $? -ne 0 ]; then
echo "Error: unable to create the snapshot (${backupdir}@${today})."
exit 1
fi

exit 0


rsb-one
This script can be used to do a manual backup of a single server at a single site. Mainly used for testing, but has also come in handy on a couple of occasions when the automatic backup failed.

This pulls in the central config file, but duplicates the rsync command, so one has to keep this file and the rsbackup file in sync. We're planning on amalgamating this into the main rsbackup script.

#!/bin/sh

which="/usr/bin/which"
basename=$( ${which} basename )
dirname=$( ${which} dirname )
scriptname=$( ${basename} ${0} )
scriptdir=$( ${dirname} ${0} )
scriptversion=1.0
sshcmd="/usr/local/bin/ssh "

# Pull in the defaults file
defaults="rsbackup.conf"
if [ -r ${scriptdir}/${defaults} ]; then
. ${scriptdir}/${defaults}
else
echo "Error! Main config file doesn't exist."
exit 1
fi

# Arguments passed to this script are:
# -s sitename this tells the script where to find the site settings
# -h hostname this tells the script which host config file to grab
if [ $# -gt 0 ]; then
while getopts "s:h:" OPTION; do
case "${OPTION}" in
"s")
# Check if site config file exists, and read it in
if [ -r ${RS_DIR}/sites/${OPTARG}/site_defaults ]; then
sitedir=${RS_DIR}/sites/${OPTARG}
. ${sitedir}/site_defaults
else
echo "Error! Site directory doesn't exist."
exit 1
fi
;;
"h")
# Check if host config file exists, and read it in
if [ -r ${sitedir}/${OPTARG}.cfg ]; then
hostconf=${sitedir}/${OPTARG}.cfg
. ${hostconf}
else
echo "Error! Host conf file doesn't exist."
exit 1
fi
;;
*)
echo "Usage: ${0} -s sitename -h hostname"
;;
esac
done
else
echo "No arguments given. Nothing to do."
echo ""
echo "Usage: ${0} -s sitename -h hostname"
exit 1
fi

# Check whether there's a server-specific exclude file needed
if [ -z $RSYNC_EX_SERVER ]; then
RSYNC_EXTRA_EXCLUDE=""
else
RSYNC_EXTRA_EXCLUDE="--exclude-from=${sitedir}/$RSYNC_EX_SERVER"
fi

# Make sure that the backup directory exists
if [ ! -e $BACKUP_DIR/$SITE_DIR/$SERV_DIR ]; then
mkdir $BACKUP_DIR/$SITE_DIR/$SERV_DIR
fi

# Do the rsync
rsync $RSYNC_OPTIONS $RSYNC_SITE_OPTIONS $RSYNC_SRV_OPTIONS \
--exclude-from=$RSYNC_EX_DEF $RSYNC_EXTRA_EXCLUDE \
--rsync-path="$RSYNC_EXEC" --rsh="$sshcmd -p $RSYNC_PORT -i $RSYNC_SSH_KEY" \
--log-file=/var/log/rsbackup/$RSYNC_SERVER.log \
$RSYNC_USER@$RSYNC_SERVER:/ $BACKUP_DIR/$SITE_DIR/$SERV_DIR

phoenix
May 1st, 2009, 00:34
Example site_default file
This is the config file that lists defaults for all servers at a specific site, as well as the main directory to use for the backups for all the servers at that site.

#Site wide options
#required
SITE_DIR=site

Example server config file
This is the config file that each remote server would have. It lists any server-specific exclude files to use, the hostname of the server, and the name of the directory to store the backup under (usually named after the server).

# adding an additional exclude file
RSYNC_EX_SERVER=exclude.server

# These 2 are required, and specific to each server
RSYNC_SERVER=server.hostname
SERV_DIR=server

Default exclude file for Linux servers
This is an example of the default exclude file used for all Linux servers.

/sys/*
/proc/*
*mozilla/firefox/*/Cache/**
/var/lib/vservers/vs1/home/*
*/.googleearth/Cache/**
*/.googleearth/Cache/temp/**
/var/spool/squid/**
/backup/*
/var/spool/cups/**
/var/log/**.gz
*/cache/apt/archives/**
/var/lib/vservers/vs1/var/tmp/**
/home/programs/tmp/**
/home/programs/vmware/**
/home/**/.thumbnails/**
/home/**/.java*/deployment/cache/**
/home/**/profile/**
/home/**/.local/Trash/**
/home/**/.macromedia/**

phoenix
May 1st, 2009, 00:41
check-fs
This script monitors the health of the gmirror and the zpool. It runs via cron. If any anomalies are detected, and e-mail is sent with all the details. It's based on the zpool check script run via periodic.

#!/bin/sh

send=0

# Check zpool status
status=$( zpool status -x )

if [ "${status}" != "all pools are healthy" ]; then
zpoolmsg="Problems with ZFS: ${status}"
send=1
fi

# Check gmirror status
status=$(gmirror status)

if $( gmirror status | grep DEGRADED > /dev/null ); then
gmirrormsg="Problems with gmirror: ${status}"
send=1
fi

# Send status e-mail if needed
if [ "${send}" -eq 1 ]; then
echo "${zpoolmsg} ${gmirrormsg}" | mail -s "Filesystem Issues on backup server" someone@somewhere.com
fi

exit 0

DutchDaemon
May 1st, 2009, 01:11
Could you post some details?

;)

phoenix
May 1st, 2009, 01:27
Filesystem Layout
And finally, here's the directory structure used, to show where the different files go, where the backups go, etc.

/root/rsb/
conf/
rsb-one
rsb-snapshot
rsb-wrapper
rsbackup
rsbackup.conf
sites/

/root/rsb/conf/
exclude.default.bsd
exclude.default.linux
rsbackup.rsa
rsbackup.rsa.pub
server.rs.example
site_default.example
sites.lst

/root/rsb/sites/
site1/
site2
site3/
site4/

/root/rsb/sites/site1/
exclude.host1
exclude.host2
exclude.host3
host1.cfg
host2.cfg
host3.cfg
host4.cfg
site_defaults

FBSDin20Steps
May 1st, 2009, 02:22
Spam alert!!! ;)

ArtemD
May 1st, 2009, 05:47
Thank you for the great howto (*very* informative btw :)). I was wondering thou how stable do you find ZFS on FreeBSD? Did you have any issues? How about performance under heavy load?

phoenix
May 1st, 2009, 07:43
The original server setup, using a single 24-drive raidz2 vdev in the storage pool, was not very good. We learnt the hard way that the IOps performance of a raidz vdev is equivalent to that of a single drive. IOW, a 24-drive raidz2 is no faster than a single SATA drive!!

Plus, when you have to replace a drive in the vdev, as we had to, it will thrash all the drives in the raidz vdev ... and thrashing 24 drives 24-hours a day *really* slows things down, usually leading to re-starts of the resilver process. After a week of that, we rebuilt the box using the 3x raidz2 vdevs using 8-drives each. Performance went through the roof after that.

Turns out, the official recommendation from SUN is to use <=10 drives per raidz vdev, preferably 6-8.

The original setup would complete ~60 server backups between 7pm and 7am. We really fiddled with the sleep times between starting the parallel rsyncs, and with the ordering of the sites, but we couldn't really get it much better than ~60 servers in one run.

Moving to the 3x raidz setup, we can complete 102 server backups within 5 hours, leaving plenty of time for extra servers.

We did have to do some manual tuning of various sysctls, and loader tunables. And we switched to using OpenSSH from the ports tree, with the HPN patches (went from ~30 Mbits/sec max network throughput to over 90 Mbits/sec, per SSH connection).

We monitor the server using SNMP, MRTG, and Routers2. Even though we can only poll the 32-bit disk counters every 60 seconds, we average 80 MBytes/sec disk I/O during the backup run, with the odd peak at 120 MBytes/sec. The system is still very responsive to SSH connections, log tailing, and other interactive duties.

We also push the contents of the /storage/backup directory out to a second, identical system, at an off-site location. Takes a little under 4 hours for that. Using a slightly modified rsync script (basically just a for loop through the directories under /storage/backup, with a separate rsync per sub-directory).

The kicker to all this: ~$10,000 CDN for each storage server!! And we're working on a method to automate backups for the few Windows stations we still have (also using ssh and rsync).

Another school district in the province spent over $250,000 CDN for their backup setup, with less storage space, a lot more administrative overhead, and more physical servers. Without off-site redundancy. :) Sometimes, I really like working with FreeBSD and Linux systems!!

vivek
May 2nd, 2009, 02:32
Nice. I've also built a server but without ZFS. I run rsnapshot and shell scripts to backup 3 MySQL servers and 5 webservers. I'm using 1TBx4 hard disk with RAID 10. We make a full backup to tape.

Your setup is awesome. Did you able to run any disk I/O tests? If so could you paste your results?

TIA.

phoenix
May 2nd, 2009, 06:05
I did, way back when we first started, but didn't keep them (had nothing to compare them to). Just simple dd runs, so nothing really useful.

Any suggestions on disk benchmarks to run?

vivek
May 2nd, 2009, 08:36
benchmarks/bonnie++/

OR
benchmarks/iozone/

Later can create graphs too from data.

phoenix
June 9th, 2009, 19:04
I ran some iozone benchmarks on one of the servers. Created a new ZFS filesystem, with all the default settings (noatime off, compression off).

The iozone commands used:
iozone -M -e -+u -T -t <threads> -r 128k -s 40960 -i 0 -i 1 -i 2 -i 8 -+p 70 -C
I ran the command using 32, 64, 128, and 256 for <threads>

Write speeds range from 236 MBytes/sec to 582 MBytes/sec for sequential; and from 242 MBytes/sec to 550 MBytes/sec for random.

Read speeds range from 3.3 GBytes/sec to 5.5 GBytes/sec for sequential; and from 1.8 GBytes/sec to 5.5 GBytes/sec for random.

All the gory details are below.


32-threads: Children see ... 32 initial writers = 582468.13 KB/sec
32-threads: Parent sees ... 32 initial writers = 108808.46 KB/sec
64-threads: Children see ... 64 initial writers = 236144.47 KB/sec
64-threads: Parent sees ... 64 initial writers = 86942.94 KB/sec
128-threads: Children see ... 128 initial writers = 284706.68 KB/sec
128-threads: Parent sees ... 128 initial writers = 10850.40 KB/sec
256-threads: Children see ... 256 initial writers = 258260.59 KB/sec
256-threads: Parent sees ... 256 initial writers = 9882.16 KB/sec

32-threads: Children see ... 32 rewriters = 545347.52 KB/sec
32-threads: Parent sees ... 32 rewriters = 339308.08 KB/sec
64-threads: Children see ... 64 rewriters = 419838.51 KB/sec
64-threads: Parent sees ... 64 rewriters = 335620.45 KB/sec
128-threads: Children see ... 128 rewriters = 350668.51 KB/sec
128-threads: Parent sees ... 128 rewriters = 319452.97 KB/sec
256-threads: Children see ... 256 rewriters = 317751.52 KB/sec
256-threads: Parent sees ... 256 rewriters = 295579.66 KB/sec

32-threads: Children see ... 32 random writers = 379256.37 KB/sec
32-threads: Parent sees ... 32 random writers = 95298.44 KB/sec
64-threads: Children see ... 64 random writers = 551767.68 KB/sec
64-threads: Parent sees ... 64 random writers = 113397.95 KB/sec
128-threads: Children see ... 128 random writers = 241980.60 KB/sec
128-threads: Parent sees ... 128 random writers = 74584.01 KB/sec
256-threads: Children see ... 256 random writers = 398427.84 KB/sec
256-threads: Parent sees ... 256 random writers = 20219.56 KB/sec

32-threads: Children see ... 32 readers = 5023742.86 KB/sec
32-threads: Parent sees ... 32 readers = 4661309.72 KB/sec
64-threads: Children see ... 64 readers = 5516460.71 KB/sec
64-threads: Parent sees ... 64 readers = 3949337.61 KB/sec
128-threads: Children see ... 128 readers = 4748635.74 KB/sec
128-threads: Parent sees ... 128 readers = 3208982.03 KB/sec
256-threads: Children see ... 256 readers = 4358453.38 KB/sec
256-threads: Parent sees ... 256 readers = 2741593.08 KB/sec

32-threads: Children see ... 32 re-readers = 5502926.62 KB/sec
32-threads: Parent sees ... 32 re-readers = 4650327.75 KB/sec
64-threads: Children see ... 64 re-readers = 5509400.02 KB/sec
64-threads: Parent sees ... 64 re-readers = 4526444.40 KB/sec
128-threads: Children see ... 128 re-readers = 4072363.55 KB/sec
128-threads: Parent sees ... 128 re-readers = 2840317.47 KB/sec
256-threads: Children see ... 256 re-readers = 3329375.95 KB/sec
256-threads: Parent sees ... 256 re-readers = 2183894.33 KB/sec

32-threads: Children see ... 32 random readers = 5555090.45 KB/sec
32-threads: Parent sees ... 32 random readers = 4602383.62 KB/sec
64-threads: Children see ... 64 random readers = 4402270.77 KB/sec
64-threads: Parent sees ... 64 random readers = 2059081.52 KB/sec
128-threads: Children see ... 128 random readers = 3070466.93 KB/sec
128-threads: Parent sees ... 128 random readers = 525076.11 KB/sec
256-threads: Children see ... 256 random readers = 1888676.12 KB/sec
256-threads: Parent sees ... 256 random readers = 293304.53 KB/sec

32-threads: Children see ... 32 mixed workload = 3130000.18 KB/sec
32-threads: Parent sees ... 32 mixed workload = 123281.78 KB/sec
64-threads: Children see ... 64 mixed workload = 1587053.33 KB/sec
64-threads: Parent sees ... 64 mixed workload = 294586.82 KB/sec
128-threads: Children see ... 128 mixed workload = 807349.95 KB/sec
128-threads: Parent sees ... 128 mixed workload = 98998.77 KB/sec
256-threads: Children see ... 256 mixed workload = 393469.55 KB/sec
256-threads: Parent sees ... 256 mixed workload = 112394.90 KB/sec

vivek
June 11th, 2009, 22:46
Thanks. You got impressive disk I/O :e

jyavenard
August 1st, 2009, 02:52
Hi

Fantastic posts, I wish I had found this thread earlier. I created a similar setting, though using higher capacity drives.

One note however, the iozone benchmarks are useless here, especially the read speed.
All it is showing is that the data is in RAM or CPU cache...

A more valid test would be:
iozone -R -a -i 0 -i 1 -i 2 -g <size> -f <testfile> -b <excelfile>

size needs to be at least twice more than the amount of RAM and rounded to the power of 2 (for more accuracy). E.g with 6GB of ram, use size = 16g

testfile is the path to the file on the disk to test...

There's physically no way a RAID array with 24 disks, each having a physical limit of around 100MB/s could achieve over 5GB/s reading, heck that's more than what a single lane PCI-e can carry !
The 3Ware 9650 is a PCI-e 1.0 8 lanes ; the PCI-e 8X 1.0 port can carry 2GB/s maximum...

I tested a RAIDZ setup with 6 x 2TB (Western Digital RE4 drives) and achieve 280MB/s write and 320MB/s read. Which is ok (faster than what the dual-NIC could output), but not exceptionally great.

A linux box with a E8200 (2.6GHz) dual core with 2GB of RAM and 5 x 1.5TB consumer-level drive achieved with md 270MB/s write but 455MB/s read ...

User23
August 4th, 2009, 16:08
Nice Setup & Howto :)

I have a question to your Settings on the 3Ware Controller.

Did you have the WriteCache enabled in the 3dm2 Webinterface?

I got a very poor throughtput without WriteCache enabled on 7.2 Release AMD64
on different machines with different 3ware controllers. And it looks like im not the only person who hit that bug(?) .

twa, 3ware performance issue (http://forums.freebsd.org/showthread.php?t=3743)

thx & best regards

phoenix
August 4th, 2009, 20:39
Yes, we have the write cache enabled on the controller, and use the performance profile for each of the disks. This gives us a nice, fast, "2nd level" cache (disk cache -> controller cache -> ZFS ARC), and it allows the controller to re-order writes to the drives, as needed.

Makes it a bit more intelligent than a plain JBOD setup, where the controller would be just a dumb SATA controller.

phoenix
August 4th, 2009, 20:42
One note however, the iozone benchmarks are useless here, especially the read speed.
All it is showing is that the data is in RAM or CPU cache...

It's not useless, considering the bulk of our data will be in the ARC, and it's mostly reads to compare the data to what's on the remote servers. Plus, the transfer from one backup server to the other is all reads. Also, since the servers are on UPSes, and the RAID controllers have batteries, all the caches are configured as write-back, so as soon as data hits one of the caches, it's considered "written to disk".

A more valid test would be:
iozone -R -a -i 0 -i 1 -i 2 -g <size> -f <testfile> -b <excelfile>

I'll see if I can run the above, for comparison. However, I'm off on holidays starting tomorrow, so it won't be until after the 13th that I'll be able to try this.

SuperMiguel
August 11th, 2009, 17:32
why u install your system on 2 CF card?? instead on a small HD? speed?

phoenix
August 13th, 2009, 17:53
We wanted to maximise the use of all 24 drive bays for data storage.

We didn't want to have to partition one of the drives to make room for the OS, we didn't want to dedicate an entire 500 GB drive to the OS, and we didn't have any extra internal drive bays that could be used.

Thus, we used small (2 GB and 4 GB) CompactFlash drives for the OS install (uses less than 2 GB for / and /usr), and used all 24 drives for data storage. These were small enough that they could be attached to the inside of the case.

saxon3049
August 20th, 2009, 02:39
Very useful, I already liked to it on another forum and it's provided me with a new solution to try out.

confusion
August 31st, 2009, 14:26
This is an interesting write-up. Is there an advantage of using zfs snapshots vs. rsnapshot?

phoenix
August 31st, 2009, 20:29
rsnapshot uses hardlinks and directories on standard filesystems. We looked into doing this originally, but managing the hardlinks and directories and what-not was not fun.

ZFs snapshots are internal to the filesystem. They are accessible at any time via the /<path>/.zfs/snapshot/<snap name>/ directory. And you get all the added bonuses of ZFS (compression, pooled storage, easy admin, etc).

We looked at a lot of different remote backup tools, especially ones that use rsync, and even tried coming up with some custom stuff using hardlinks, squasfs, other compressed fs, LVM, etc and just could not find a storage stack that was usable and simple. :)

Then ZFS was imported into FreeBSD (we're a Debian Linux shop, but we use FreeBSD on the firewalls, so getting a FreeBSD storage box was not a hard sell). And the rest is history.

phoenix
September 11th, 2009, 22:35
A more valid test would be:
iozone -R -a -i 0 -i 1 -i 2 -g <size> -f <testfile> -b <excelfile>

Test is still running, but here's some preliminary results:
random random
KB reclen write rewrite read reread read write
64 64 781660 985384 1772821 3758700 3559345 2210854
128 128 810472 2515594 3043191 4280683 4116568 2775716
256 256 1019294 1741907 2327052 3762009 3610222 2051407
512 512 1062365 1166212 1816181 530057 2188145 1236732
1024 1024 812607 1173034 1218977 1190593 1190593 1022996
2048 2048 423683 977971 1390538 1392341 1360799 1142823
4096 4096 888472 1130223 1386544 1382416 1382861 1098427
8192 8192 688925 1068884 1152028 1176646 1178219 1027706
16384 16384 872934 1051746 1160754 1109570 1164057 993998
32768 16384 663842 1018810 1254801 1227216 1249020 976341
65536 16384 926499 1079457 1099078 1214551 1287518 1057221
131072 16384 568620 1043002 829028 1156366 771242 1012178
262144 16384 893503 1046252 1225213 1180366 1139746 1023720
524288 16384 173176 268374 1126224 1030217 1101103 259177
1048576 16384 163071 231045 279858 37752 266438 200727


So you can see that it's ranging between ~200 MBps and 1 GBps for writes and between 300 MBps and 3 GBps for reads.

Once the test completes, I'll post the full results.

jyavenard
September 12th, 2009, 22:37
Test is still running, but here's some preliminary results:
random random
KB reclen write rewrite read reread read write
64 64 781660 985384 1772821 3758700 3559345 2210854
128 128 810472 2515594 3043191 4280683 4116568 2775716
256 256 1019294 1741907 2327052 3762009 3610222 2051407
512 512 1062365 1166212 1816181 530057 2188145 1236732
1024 1024 812607 1173034 1218977 1190593 1190593 1022996
2048 2048 423683 977971 1390538 1392341 1360799 1142823
4096 4096 888472 1130223 1386544 1382416 1382861 1098427
8192 8192 688925 1068884 1152028 1176646 1178219 1027706
16384 16384 872934 1051746 1160754 1109570 1164057 993998
32768 16384 663842 1018810 1254801 1227216 1249020 976341
65536 16384 926499 1079457 1099078 1214551 1287518 1057221
131072 16384 568620 1043002 829028 1156366 771242 1012178
262144 16384 893503 1046252 1225213 1180366 1139746 1023720
524288 16384 173176 268374 1126224 1030217 1101103 259177
1048576 16384 163071 231045 279858 37752 266438 200727


So you can see that it's ranging between ~200 MBps and 1 GBps for writes and between 300 MBps and 3 GBps for reads.

Once the test completes, I'll post the full results.

200MB and 300MB/s are the only value that actually mean something in your setup. Provided the kind of data you are writing (mirroring external machine) , the cache effect is irrelevant.

It's surprising that you are only achieving 200MB/s write provided the number of disks your are using. I get the same speeds with only 6 disks.

But don't quote that you get 3GBs read. It's nonsense when performing disk benchmarks.

ttsiod
September 18th, 2009, 13:14
I am backing up Windows machines using rsync to our OpenSolaris/ZFS server. Here is how I do it:

http://users.softlab.ntua.gr/~ttsiod/win32backup.html

phoenix
September 18th, 2009, 17:19
Yeah, there are a lot of different methods to run rsync on the Windows machine (I personally prefer the rsync.net backup agent, which supports SSH). However, they are all client solutions. I've yet to find a server-hosted solution for this. We'd prefer to keep all the backup configuration on the server. Makes it easier to schedule and manage the network/disk load.

There are SSH daemons for Windows, and there are rsync apps for Windows. But I have yet to find a pair that will allow:
server to connect via SSH
server to initiate the rsync program on the client
client connect back through the SSH tunnel to push the data to the server

If we could find that, then we could backup everything, via a single set of config files on the backup server.

gene
September 19th, 2009, 06:12
There are SSH daemons for Windows, and there are rsync apps for Windows. But I have yet to find a pair that will allow:
server to connect via SSH
server to initiate the rsync program on the client
client connect back through the SSH tunnel to push the data to the server

Install cygwin with its openssh and rsync packages then run 'ssh-host-config'. It should set up everything needed to make sshd a windows service.

Once you have cygwin installed you can refer to '/usr/share/doc/Cygwin/openssh.README' if you have problems.

gene
September 19th, 2009, 10:16
Thank you very much for posting all of this information. I've been planning on doing something very similar and it's great to see someone else accomplishing it at a much larger scale than I'm planning.

Are you backing up any databases? If so how are you doing it?

To hazard a guess, if you have mysql databases I'm thinking you are locking all the tables then directly copying the contents of '/var/mysql' (or where ever the databases are stored on the file system) or you are dumping the contents of the databases to flat files before doing the rsync.

What is your retention policy? I see that you were hoping for 13 months, but was that based off an SLA?

When you do hit the ~500 day mark I'm guessing will you be removing snapshots, starting with the oldest, to free up space. Have you considered keeping one snapshot for each week or month, that way you can still have access to data that was backed up more than 500 days prior and still have space for future back ups? The storage pool will still eventually fill up doing this, I'm sure, but it would be interesting to see how long it could last.

Did you consider using the larger Chenbro chassis (50 bay) instead?

And just to satiate the geek in me, do you have any pictures of the servers?

phoenix
September 21st, 2009, 08:22
Are you backing up any databases? If so how are you doing it?

Yes. MySQL databases. We dump the databases to text files, and then rsync both those and the db directory as part of the rsync process. We've done recoveries using both the dumps and the binary files.

What is your retention policy? I see that you were hoping for 13 months, but was that based off an SLA?

We're aiming for 13 months. It looks like we'll have to move to larger harddrives before the first year is out, though. Using 500 GB drives, we only have 2 TB of disk space left. 1 TB drives are coming down in price, though, and the issues with them appear to be solved.

When you do hit the ~500 day mark I'm guessing will you be removing snapshots, starting with the oldest, to free up space. Have you considered keeping one snapshot for each week or month, that way you can still have access to data that was backed up more than 500 days prior and still have space for future back ups? The storage pool will still eventually fill up doing this, I'm sure, but it would be interesting to see how long it could last.

Yes, that is one possibility we are looking at. Keeping the backups from the 7th, 14th, 21st, and 28th of each month, starting on the 14th month. And then keeping those for an extra year.

Did you consider using the larger Chenbro chassis (50 bay) instead?

We didn't know about the 50-bay cases until after we had things installed and working.

We re-purposed servers for this. The 5U mega-servers were originally purchased to act as Xen/KVM/VMWare hosts. Then we realised that CPU and RAM are more important for VM hosts than disk space. So these became the backup servers. And the other 5U servers will become storage servers for the VM hosts (which will probably be net-booted 1U or 2U systems with gobs of CPU and RAM).

And just to satiate the geek in me, do you have any pictures of the servers?

Not currently, no.

gene
September 21st, 2009, 17:00
Are all of your drives the same make and model?

phoenix
September 21st, 2009, 19:13
No. We use 12 Seagate drives, and 12 Western Digital drives. Bought in four batches of 6 drives each, to try an minimise the "all from the same manufacturing batch" issue (would really suck if they all died at the same time). A pair of the drives have been replaced already with newer WD drives.

gene
September 28th, 2009, 21:41
Install cygwin with its openssh and rsync packages then run 'ssh-host-config'. It should set up everything needed to make sshd a windows service.

Once you have cygwin installed you can refer to '/usr/share/doc/Cygwin/openssh.README' if you have problems.

Have you given this a try? I've done it with 2003 server successfully, and just over the weekend did it with an XP box with success.

phoenix
September 28th, 2009, 23:25
I'm in the process of testing it.

It's going to require making some (possibly massive) changes to our backup script. For example, there's no sudo in cygwin.

I've got it working manually. Now to figure out how to automate it, and to test a system restore. And to figure out what needs to be added to the exclude file. :)

spork
February 6th, 2010, 08:10
Quick question for the author... I'm using your set of scripts as a starting point because it all seems pretty sane. Once I'm happy with it, I'll probably change things up a bit.

I'm having one bizarre issue that I can't track down though... My backups box has much more storage than all the machines it's backing up, so I have not been paying much attention to the space used over the last few weeks. As I was copying some things off, I realized that there's more data than I'd expect on the backups server. After poking around a bit I found that rsync is simply not deleting files. I see the "--delete-during" option in the script, also tried plain old "--delete" with the same result.

Any ideas? I see people with similar problems when they are working from a file list or with wildcards, but the only wildcards I've got a in my exclude lists...

I'm totally stumped by this one.

jb_fvwm2
February 6th, 2010, 14:09
--delete-after ?? That fixed something in rsync here, maybe it would fix
it...

spork
February 7th, 2010, 09:53
Nope... It grew some more tonight with "--delete-after" as well. Seems like a common problem with rsync, I'll have to figure out how to step through what the script is doing but scale things down enough so I can see what's happening.

spork
February 18th, 2010, 21:47
Almost there... Since the boxes are active while they are being backed-up, rsync throws errors here and there about files disappearing and the like, which is fairly normal.

What I did not know is that rsync skips ALL file deletion operations if it encounters ANY errors. There's an "--ignore-errors" flag, but it's a bit blunt - it ignores any errors, which could be problematic. I have a query out about this on the rsync list.

So if you're using this script, or a similar method, you might want to look for this line in your backup logs:

2010/02/18 01:42:30 [75398] IO error encountered -- skipping file deletion

That does not refer to a single file, that means NO files were deleted in the entire run.

phoenix
February 19th, 2010, 00:04
Even when --delete-during is used, which does the file deletions as it comes across them, instead of batching them up at the end?

spork
February 19th, 2010, 00:58
Even when --delete-during is used, which does the file deletions as it comes across them, instead of batching them up at the end?

Yep, --delete-during was the initial option I used. The number of errors is small, and they all give a "bad file descriptor (9)" error, which I think rsync feels is a "really bad" error compared to the normal "file disappeared" type errors. Googling around on the "bad file descriptor" error gives me lots of hits on problems with smbfs mounts, but not much else (and I have no smbfs mounts).

I'll try a run with "--ignore-errors" tonight and see what happens. Not an optimal solution, but a good stopgap.

dennylin93
June 13th, 2010, 12:42
/sys/*
/proc/*
*mozilla/firefox/*/Cache/**
/var/lib/vservers/vs1/home/*
*/.googleearth/Cache/**
*/.googleearth/Cache/temp/**
/var/spool/squid/**
/backup/*
/var/spool/cups/**
/var/log/**.gz
*/cache/apt/archives/**
/var/lib/vservers/vs1/var/tmp/**
/home/programs/tmp/**
/home/programs/vmware/**
/home/**/.thumbnails/**
/home/**/.java*/deployment/cache/**
/home/**/profile/**
/home/**/.local/Trash/**
/home/**/.macromedia/**


I'm wondering why both "*" and "**" are used at the end? Is there any particular reason since "*" seems to be sufficient.

phoenix
June 16th, 2010, 04:06
This file just grew organically, with three of us adding to it, so some things have one *, and others have two. No real reason beyond that, I don't think.

I believe the ** in the middle of a path is important, though.

The globbing/regex stuff in rsync is confusing, to say the least.

dennylin93
June 16th, 2010, 07:28
The globbing/regex stuff in rsync is confusing, to say the least.


I can't agree more :). I actually had to run test cases to understand the man page better.