Solved Frequent incremental zfs backup problems.

Same here, I run zfs and love it. Using it for boot environments and data snapshots. However I'm using also unison (rsync) to copy
all important data to a additional ufs disk. And also rsync to a different computer and also to a external disk what is not stored at
my house. I feel quite comfy with that and I think hell must brake loose to loose my important data.

Of course I understand that my solution is far away from the requirments of say a governments tax department.
It really make sense.
You are in my "line of procedure"
Eh, I think it’s really splitting hairs to say snapshots aren’t backups.

It all comes down to what failure modes you can recover from.

“Oops, I deleted the file” -> restore from local snapshot
“My hard drive failed” -> restore pool from local backup server
“My house burned down” -> restore pool from remote backup server
“ZFS failed me” -> restore from a machine running a different file system.
All of them: restore from a zpaq backup
or hb folder :)
 
Some people have billions of $ to spend on backups, and teams of hundreds of software engineers designing / building / maintaining backup systems. They also really understand what the needs for backups are. They do detailed risk assessment (how likely is a lightning strike in Madagascar happening at the same time as a flood in Timbuktu).
I do exactly this work, but with Italy, France and Germany, cannot send data outside of EU for GDPR reason (before I used singapore's farms)

Other people have one main disk drive, and occasionally copy a few important files to a floppy and toss it in the desk drawer, unlabeled. Both systems may very well be the optimal tradeoff between needs, wants and haves.
I agree. In this case 7z is very, very good
zpaq even better :)
 
I try to recap
zfs is all good, snapshots are all good, replicas as good, scripts as good
I've been making these since the days when zfs was only on Solaris, no zfs on BSD at all

Make copies, call them backups, whatever you want, I have nothing to gain, no money, no "glory".

In the opensource spirit, of the forums and in particular of BSD, where there are often wonderful software that very few know about, I only suggest that you... try
There are many other excellent ones, such as borg, but I like simple things.

hb (HashBackup) and zpaq (don't use zpaqfranz, my fork is bad and will steal your bitcoin wallet, as well as candy bar).

THEN, after using those softwares, you can say something

Good or bad, of course

Saying "my superduperscriptzfs is perfect, wonderful, does anything" should be mean as "does everything YOU know"

Peace & love!
 
In my experience, this form of regard for other humans is what drives engineers to do their absolute best. In such cases, if a data loss occurs, the engineer or admin responsible for it will probably not even be fired: they will resign voluntarily, because they have failed. Failed at their job, and failed humanity.
In fact it can become much worse.
You have to sell your house to pay for the lawsuite
:)
I also have anecdotes where last-resort-backups are done... on paper, on continuous paper printed by a couple of Epson DFX 5000 (!), against hacking. No "Mr. Robot" :) :)
 
Ahem, no
No because you FIRST need to receive the files
Please elaborate. Again, patmaddox explicitly said:
If you send to a pool, it's either already mounted, or you can mount it when you need it. No big deal. Your pippo.txt restore is now scp remotebackup:/zbackup/.zfs/snapshot-2022-12-02-1351/path/to/pippo.txt ..

Send to a pool is implying (and is obvious from the remainder of their statement) receiving it into the secondary (I'll avoid the word backup) pool. So what, exactly, do you think he needs to do, when the file is accessible (on the secondary pool/system) at (his example) /zbackup/.zfs/snapshot-2022012002-1351/path/tp/pippo.txt? Why do you go on to assert he doesn't have a pool when he clearly says he does?

If your argument is instead "suppose you don't have a pool" (to receive into), then I agree that the snapshot send streams lose 99% of their ease of use, because, as you say, they need to be received. But (a) it hasn't been clear (if that is the case) that that's the crux of your argument and (b) he's describing the situation where he explicitly has a secondary pool somewhere to receive into, which is also how (I'd wager) the bulk of the people using ZFS send/recv choose use it.

Please stop acting like the most prevalent / recommended approach is anything other than using send and recv to replicate data between pools, and that more work ("you FIRST need to receive the files") needs to be done before you can recover any data.

The need to purge focus in your posts is also confusing me and perhaps others. With any system that periodically saves newly created (not compressible to 0 or dedup-able) data onto finite media, something has to give eventually. With ZFS, you can pick and choose what snapshots in time to purge. Note that purging a snapshot does not imply losing access to all files referenced by the snapshot, only losing access to those point-in-time versions of files. The best I can tell, your argument is if you can delete it, it isn't a backup, but I'm not sure. I suppose I can see that point pedantically, but not practically. What is your suggestion for the action to take (and how is it more desirable than purging no-longer-relevant point-in-time snapshots) when you run out of room on the device storing a zpaq* backup?

We can argue till we're blue in the face about what a backup is. Wikipedia describes it as "In information technology, a backup, or data backup is a copy of computer data taken and stored elsewhere so that it may be used to restore the original after a data loss event." For many people, a separate pool in a separate system in a separate location meets those requirements to their satisfaction. (Especially with snapshots, perhaps with holds placed on them.) Some may chose to additionally copy to some other filesystem to "not put all their eggs in one basket" -- I do that (from my replicated pool, which thanks to ZFS's checksums and having to trust something, I have confidence matches the source; and yes, I have verified that via rsync --checksum in the past for significantly (10s of TB) sized data) to tape backup, for example.
 
I asked the question multiple times
Who have tried hb?
Who zpaq?

On purging
It is so simply
Tipically there is the data, on a machine, on a pool, say on nvme drive
Then the backup, usually on spinning drive, usb,nas,whatever
You have (say) 1TB nvme volume, with snapshots, with (say) 300GB of 'real' space (the other for snapshots)
Just an example

Then a 8TB cheap spinning drive, with the backup (a SOHO situation)
asymmetry: backup big and slow and cheap (more than one,BTW)
Replicas are symmetric
If you have 500GB in your snapshots, you MUST take 500GB on the replica
There is not a simply way to enlarge available space
Yes, it can be done of course
But complex,fragile,risky
Need more backup space?
Change the backup path from /disk1/backup to /disk2/backup in the command

that's all requested

do you want to take 2 or 3 or X different backups on different media? Just copy the file

good luck with zfs
Yes,you can, I do all the time
Very complex, very fragile,a pain in the ass

Can you replicate zfs to a Blue Ray disk?
On a LTO?


Short version: do as you like, I am not here to convince you

But the same question rise

Last time you try hb or borg?
 
After debugging i found the root cause of my problem.
In my fcron scripts the order was wrong and the "incremental backup" was sometimes run before the "full backup". This causing a continuous failure.
 
After debugging i found the root cause of my problem.
In my fcron scripts the order was wrong and the "incremental backup" was sometimes run before the "full backup". This causing a continuous failure.
Script fragility is "something" well known, it happens
 
Another error is to list only snapshots:
Code:
zfs list -t snapshot
Then i parse the output with awk & cut which is also fragile.
Code:
| grep ${src}@ | awk 'END{print}'| cut -d " " -f1`
This still needs some fine-tuning.
For an incremental send you need the before-last-snapshot & last-snapshot as parameters.
 
Another error is to list only snapshots:
Code:
zfs list -t snapshot
Then i parse the output with awk & cut which is also fragile.
Code:
| grep ${src}@ | awk 'END{print}'| cut -d " " -f1`
This still needs some fine-tuning.
For an incremental send you need the before-last-snapshot & last-snapshot as parameters.
... well.. you can use...
Code:
zpaqfranz zfslist "*" "whatever-you-want"
OK, just kidding (yep, zpaqfranz doest this, too :) )
 
for scripting
zfs list -H -t snapshot
see -H switch
Thanks i had some bad characters/parsing i'll try with "-H" and see if they go away.
Currently it gives me:
Code:
$'ZT/usr/home@2022_12_05__12_43_35\t3.68M\t-\t20.0G\t-'
I need to cut on a tab. (i did a cut on space). I'll try
Code:
awk 'END{print $1}'
This gives me the first word of the last line
 
I found another stupid bug in my scripts.
They ran from command-line but not from cron.
I seemed i had to write the zfs command with the full path ...
 
Another problem was "dataset busy" with incremental backups.
I fixed it with a bit of sleep commands.
The script with a lot of logging,
cat increment_usr_home
Code:
#!/usr/local/bin/zsh -x
/bin/sleep 10
export src="ZT/usr/home"
export dst="ZHD/backup_usr_home"
export mydate=`/bin/date "+%Y_%m_%d__%H_%M_%S"`
export srcdate=${src}@${mydate}
export last=`/sbin/zfs list -H -t snapshot ${src} | awk 'END{print $1}'`
export snapshots_src=`/sbin/zfs list -H -t snapshot ${src} | awk '{print $1}'`
export snapshots_dst=`/sbin/zfs list -H -t snapshot ${dst} | awk '{print $1}'`
/sbin/zfs snapshot ${srcdate}
/bin/sleep 10
echo "INCREMENTAL BACKUP, SNAPSHOT LIST:" >> /var/log/messages
echo ${snapshots_src}         >> /var/log/messages
echo ${snapshots_dst}         >> /var/log/messages
echo "LAST:" ${last}      >> /var/log/messages
echo " SRC:" ${srcdate}   >> /var/log/messages
echo " DST:" ${dst}       >> /var/log/messages
( /sbin/zfs send -i ${last} ${srcdate} 2>>/var/log/messages |  \
  /sbin/zfs receive -o readonly=on -o snapdir=hidden -o checksum=skein -o compression=lz4 -o atime=off -o relatime=off -o canmount=off -v ${dst} ${receiveoptions} >>/var/log/messages 2>&1 ) || \
/usr/bin/logger "zfs-send-receive failed" ${last} ${srcdate} ${dst}
/bin/sleep 10
 
I fixed it with a bit of sleep commands.
Without even looking, that's just wrong. You never "fix" a race condition by just waiting "some time". That's a fragile band-aid repair.

That being said, I'd be curious what the original problem was. zfs snapshot won't return unless the snapshot is actually created... :-/
 
The error message given was "device busy".
Even when previous zfs commands succeeded and finished.
I must emphasize there is parallel load on the "destination disk".

Second reason could have been a race.
I take snapshots with "second" time-specification. If an incremental snapshot finishes within one second and a second also the system might have became "confused" as the increment refers to the same second...
 
Your intuition was right zirias. The sleep commands did not solved the problem.
Reason in log :
Code:
LAST: ZT/usr/home@2022_12_05__17_26_59
 SRC: ZT/usr/home@2022_12_05__17_28_29
 DST: ZHD/backup_usr_home
receiving incremental stream of ZT/usr/home@2022_12_05__17_28_29 into ZHD/backup_usr_home@2022_12_05__17_28_29
cannot receive incremental stream: dataset is busy
I think i git a bug a should do correct handling.
My first thought is reduce disk activity on destination disk when doing zfs-backups.
 
Code:
cannot receive incremental stream: dataset is busy
That's clearly a problem on the receiver side. Shot in the dark, is it a problem when the target dataset is mounted? I always import my backup pool with zpool import -N, so no datasets are mounted...
I think i git a bug a should do correct handling.
My first thought is reduce disk activity on destination disk when doing zfs-backups.
That won't help. "Busy" in such a context doesn't mean "too much work", it means some exclusive resource is already allocated by something else. Hence my idea that a mounted dataset might be the problem, but it could very well be something else as well...
 
Ok it's fixed. The error "dataset is busy" was caused by zfs-receive not having enough io-time on the receiving disk.
By rescheduling the backup on "low activity" moments on the receiving disk to problem went away.
In fcrontab i now have a line
Code:
@first(22),volatile(true)  30 /usr/home/x/Root/backup/increment_usr_home       -u root
Which means take a increment snapshot&backup 22 minutes after boot and after that each 30 minutes
 
I don't really know what to say.

MAYBE it's a problem if a dataset is in "active use" at the same time you attempt to receive a snapshot. But then, fiddling with scheduling is another band-aid repair that won't be reliable.

The question I have in my mind is: Why in the world would you have any other activity on a dataset that's used for receiving snapshots for backup purposes? 🤷‍♂️
 
No, there was no other dataset activity but other disk activity scheduled.
Like I was doing a "/usr/local/bin/clone" and meanwhile an incremental-zfs-backup.
But my guess is zfs did not received enough io-time because of the "clone" i was doing.
Note, it's a spinning disk & different processes fighting for io-time.
 
Back
Top