Daily panics dumping journaled FS

This last weekend, I converted my /home partition to a journaled filesystem. I did this by dumping it, using gjournal label -f on the partiton, then newfs -J and restore.

So far as I can tell, the result is correct and functional, except for one thing.

Every day, I run a script that does a dump on all of the filesystems at varying levels, but always with -L and -C. /home is the only journaled filesystem, and twice now in the last 5 days, the system has paniced at some point during the process (I can tell this because /etc/dumpdates shows dumps for all the filesystems except this one).

This is a 7.1-RELEASE-p1 system. geom-journal is being loaded as a module.

Despite the fact that I have dumpon set up, I get nothing at all in /var/crash. The first panic was because of "snapacct_ufs2: bad block", but I have no information at all about the panic last night. I only know that I didn't get the cron e-mail I usually get, the uptime of the machine goes back to when /home would have been dumped, and /etc/dumpdates shows that it didn't happen.

I suspect that creating the snapshot is causing the trouble. ffs_mksnap takes a good 10 minutes to complete. The filesystem has about 8 GB of data in it and is well over 200 GB in total size (so over 95% empty).

The machine has 1 GB of total RAM and I used the default 1 GB journal size. I did so because all of the tutorials suggest that the workload on this machine (mail server for 2 people, web) shouldn't require anything extraordinary. If this is being caused by the journal being overrun, would a 'gjournal sync' immediately before the dump help?

The good news is that despite the panics, it appears that the journal is doing its job. 'fsck -n' on the live filesystem doesn't turn up anything unexpected (one unref file and the free block count wrong), and there are no messages in the boot dmesg other than the note that the journal is consistent.
 
Well, sort of.

Making a snapshot always takes a good long time, and that, I believe, is a known issue. During that time the system is largely unresponsive.

That's not something that really bothers me. Having the machine panic (and, thus, fail to complete the backups) does bother me.

I've tried to make snapshots manually. Apart from taking for ever and a day to finish, there has been no negative side effect - certainly no panics. I've even done things like

dump 0LCf 20 /dev/null /home

and they've run to completion (after the initial delay caused by the L flag making a snapshot).
 
I had rsnapshot to an IDE drive. running stuff in another tty
panicked the box and the mbr was hosed (unless I can recover it).
(of the target, not the source drive)
.........
in the meantime i installed a pci > sata card,
revised the rsnap.zsh (not its name) backup script (which
runs rsnapshot in succession over FS's. to
gnice --adjustment-10 rsnapshot.........(stuff) > (log)
and it completed without problem, in about the same time,
newly into the SATA. one tty went ncurses-wild, but a few
stty sane" fixed that.
so maybe the "gnice --adjustment=10" prefix on rsnapshot
would preclude panics upon backup it they would otherwise.
..........
During the rsnapshot's I was able to email, web etc.
.............
 
Keep in mind that rsnapshot is not the same thing as the snapshots created by dump and mksnap_ffs.
 
right. I've used BootIt, dump, and rsnapshot
as well as single-fs tar, cpio, pax etc. So I have
never gotten around to reading much about
snapshots. (unless of course they are created
by "dump").
.............
okay. So dumps==snapshots. Good to know.
...............
just to muddy up the post, /star-devel/, I have
its manual printed from 2004 or 2005...
 
Well, this morning the machine didn't reboot, but was just stuck and unresponsive. snapshots on journaled UFS filesystems appear to be really problematic.
 
So this is a continuing issue. About every 3rd day or so either the machine reboots or hangs attempting to make a snapshot on /home, which is the only journaled filesystem I've set up on this machine.
 
I decided to try something different. Before, I had jgournal set up within the same partition, and had set aside 1GB for the journal.

I decided to repartition the disk to have a separate partition for the journal, so that the journal could go before the beginning of the partition to improve the performance a bit. I also made the journal 10GB instead of 1.

Now the issue is that despite having the filesystem having a label, it doesn't show up at the right time in /dev/ufs. If I don't use the label and instead use /dev/ad0p5.journal (the disk is GPT labeled), here's what the console says:

SMP: AP CPU #1 Launched!
GEOM_LABEL: Label for provider ad0p2 is ufs/root.
GEOM_LABEL: Label for provider ad0p3 is ufs/var.
GEOM_LABEL: Label for provider ad0p4 is ufs/usr.
GEOM_LABEL: Label for provider ad0p5 is ufs/home.
GEOM_JOURNAL: Journal 3505265168: ad0p5 contains data.
GEOM_JOURNAL: Journal 3505265168: ad0p7 contains journal.
GEOM_LABEL: Label ufs/home removed.
Trying to mount root from ufs:/dev/ufs/root
WARNING: / was not properly dismounted
GEOM_JOURNAL: Journal ad0p5 consistent.
GEOM_LABEL: Label for provider ad0p5.journal is ufs/home.

If I attempt to use /dev/ufs/home, then it fails, because the geom label module saw the home label on ad0p5 at first, but it seems that geom gjournal sees that there's a journal, so it removes ad0p5, making the label go away. But when it creates ad0p5.journal, it would appear that that label doesn't get created in /dev/ufs like it should.

fsck, so far as I can tell, makes the /dev/ufs/home label comes back, but then mounting /dev/ad0p5.journal makes it go away, of course.

With this partitioning, it *appears* that I can make and remove snapshots and have the machine survive. I'll know more in a few days. The snapshots take forever and a day, but that doesn't really offend me nearly as much as if the machine panics or hangs every other day.

But it annoys me somewhat not to be able to use /dev/ufs/home as the device for the mount.
 
I was able to solve the labeling problem (see PR kern/132273), but I still see either a spontaneous reboot (no panic dump in /var/crash) or a hang attempting to use dump with snapshots. The value of running with journaling no longer exceeds the cost of dealing with this problem, so I have disabled journaling. Maybe it won't suck so hard in 7.2.
 
Back
Top