Other Splitting up directory trees into fixed-size parts?

cracauer@

Developer
This is probably a very common problem, but I've never seen an easygoing solution.

Let's say I have many TB of primary storage and I want it backed up in fixed-sized chunks dictated by media size. Let's say 800 GB for a tape or 3 TB for a USB harddisk.

Traditionally people let a single tar "run over" and continue on the next medium, within the same tarfile. Obviously this has some disadvantages:
  • If one medium breaks then you can't access the subsequent ones anymore
  • It take huge amounts of time and any interruption to the tar process (reboot etc) would make you start over
  • Even if everything goes right - the moment you want to access a file on the backup you have to wade through the previous media

There must of a software out there that splits a big file tree into lists of files, the sum of the files in each list being the size of the medium. But I don't see such a software.

I'm just short of creating many more ZFS filesystems, each the size of the backup medium. But that's stupid as the size of the medium can change and you can't automate what happens when one filesystem grows.
 
with tar you should be able to recover data from the second+ media (not if compressed)
you will lose the file that crosses the media boundary but thats it
 
Note that plan9 venti (also available in devel/plan9ports allows you to specify arenas of fixed size. See also http://doc.cat-v.org/plan_9/4th_edition/papers/venti/-- but they are talking about incremental backups. Since all data is content addressable, only one copy of a block is stored. Not sure if this is what you want but another option is to try something like misc/perkeep If interested check out Brad Fitzapatrick & Mathieu Lonjaret's talk here:
View: https://www.youtube.com/watch?v=PlAU_da_U4s


Another option may be to simply take the output of "find ." and split files such that they take roughly similar size and then back them up. You can make it smarter by keeping a sha256 checksum of all files and only backup changed files.
 
I know about port sysutils/fusefs-mhddfs but I have never used it.
Code:
mhddfs - Multi HDD [FUSE] File System

File system for unifying several mount points into one
This FUSE-based file system allows mount points (or directories) to be
combined, simulating a single big volume which can merge several hard
drives or remote file systems. It is like unionfs, but can choose the
drive with the most free space to create new files on, and can move
data transparently between drives.

If possible to have connected all USB-HDDs in the same time:
Try to combine few filesystems on USB-HDDs into another one using mhddfs, and try to backup on it.
Mhddfs should place your files on different filesystems.
But it may not work if your source files too large.

Another idea is:
Create few files with required sizes. Attach these files using mdconfig. Make filesystems on every md-device mapped to a file.
Combine all md-filesystems into one using mhddfs.
Sync your files to resulting mhdd-fs
 
It's a tricky problem because any directory may contain files which will overflow the capacity of the media. Hence, in the most general case:
  • for sufficiently large directories, you have to be able to write the contents of a directory over multiple media sets; and
  • for sufficiently large files, you have to be able to write the contents of a single file over multiple media sets.
Given these fundamental requirements, traditional solutions just create a single data stream and break it up onto multiple pieces of media.

It's not really worthwhile trying to solve less general problems, as there is every chance of failure.

Restore, tar, and cpio got better over time at dealing with lost/corrupted media by storing extra metadata on each piece of media.
 
I'm willing to restrict myself to:
- no files larger than media size
- waste space at end of medium for a partial file as long as that file appears in full on the next one
 
I know about port sysutils/fusefs-mhddfs but I have never used it.
Code:
mhddfs - Multi HDD [FUSE] File System

File system for unifying several mount points into one
This FUSE-based file system allows mount points (or directories) to be
combined, simulating a single big volume which can merge several hard
drives or remote file systems. It is like unionfs, but can choose the
drive with the most free space to create new files on, and can move
data transparently between drives.

If possible to have connected all USB-HDDs in the same time:
Try to combine few filesystems on USB-HDDs into another one using mhddfs, and try to backup on it.
Mhddfs should place your files on different filesystems.
But it may not work if your source files too large.

Another idea is:
Create few files with required sizes. Attach these files using mdconfig. Make filesystems on every md-device mapped to a file.
Combine all md-filesystems into one using mhddfs.
Sync your files to resulting mhdd-fs

This later allows to retrieve files if you only have one of the harddrives available?
 
this will reach the goal but it has an efficiency problem
find /place/to/backup/ -type f|tee /dev/stdout|sort |tar --files-from - -cf something.tar :)

you can hack something with gnu tar which can run a command before starting next volume but does not look very robust
Code:
titus@ubuntu:~$ rm files_to_redo.txt
titus@ubuntu:~$ rm *.tar
titus@ubuntu:~$ tar -M -L 50m -F ./b.sh -cvf aaa-tar1.tar /usr/bin/ >u
tar: Removing leading `/' from member names
tar: Removing leading `/' from hard link targets
titus@ubuntu:~$ ls -l *tar
-rw-rw-r-- 1 titus titus 52428800 Mar 21 16:43 aaa-tar1.tar
-rw-rw-r-- 1 titus titus 52428800 Mar 21 16:43 aaa-tar2.tar
-rw-rw-r-- 1 titus titus 13578240 Mar 21 16:43 aaa-tar3.tar
titus@ubuntu:~$ cat files_to_redo.txt
/usr/bin/x86_64-linux-gnu-as
/usr/bin/mc


titus@ubuntu:~$ cat b.sh
#!/bin/bash

REDO=$(tail -1 u)
echo $REDO >>files_to_redo.txt
echo aaa-tar$TAR_VOLUME.tar >&$TAR_FD
 
This later allows to retrieve files if you only have one of the harddrives available?
I did not used it in person, but I read a lot of articles about it while looking for a solution for the similar task.
MHDDFS uses general filesystems as a storage, so you should have possibility to get any files from single separated harddrive which physicaly stored on this harddrive.
 
Back
Top