dump freezes every time I try to run it.

LSDave · Aug 19, 2011

Hello,

Running FreeBSD 9.0 BETA1 (AMD64) on a Dell PowerEdge SC430.

I am mostly a Windows guy turned FreeBSD newb, but have had success updating the source, building world, customizing, compiling and installing the kernel (to remove debugger stuff and devices that don't apply to my system), installing world, importing ZFS raidz pool from previous FreeBSD system, installing Samba, Xorg, KDE4, and TightVNC. In fact, I just got everything working the way I wanted, was about to do a portupgrade -a, but thought to take a backup while everything was working so great... :\

So I started to do a dump of the system slices--backing them up one at a time onto my large ZFS raidz. I only got as far as the root fs, using the following command:

Code:

#dump -0aLuf /mnt/ZFSstore/more.paths/full-root /

This first attempt was actually working--I saw corresponding hard drive activity on my raidz hard drives enclosure (i.e., the destination drives). But then I stupidly tried to interrupt the process..hitting Ctrl+C repeatedly to try to get it to stop, and when the process did not terminate and the system seemingly froze, I gave up and hard reset the machine.

On reboot, fsck detected errors in the filesystems, and appeared to correct them. I've also since booted to single user mode and ran fsck on all the system partitions. fsck reported that it fixed the errors.

The problem is that each time since that first forced restart, running the dump command has failed. Absolutely no machine activity follows from the command, and the computer becomes largely unresponsive, though strictly speaking it's not "frozen". For example, my Ctrl+C entries appear on the screen, but have no effect--the process does not terminate and I am not returned to the command line and cursor. If I press Alt+F2 I can indeed get to another console, but upon typing "root" at the login prompt the username very sluggishly appears on character at a time, and then that console becomes largely unresponsive as well--the password prompt never appears.

This unresponsiveness does not end prettily--as each time I am obliged to hard reset the machine again, after which fsck reports and repairs more filesystem errors on boot up. This has gone on 4 or 5 times like this with the same result, and now I come here for your assistance.

Any guidance would be appreciated, and please let me know if you require further details. Thanks in advance.

SirDice · Aug 19, 2011

It sounds like the drive may be up the creek. I've had a similar issue once before, turned out the drive had bad sectors which caused things to hang indefinitely.

wblock@ · Aug 19, 2011

-L when dumping a live filesystem makes a snapshot first; see mksnap_ffs(8). That can take a couple of minutes where it seems to be doing nothing, although there will be drive activity and mksnap_ffs will show on top(1). Running anything else will be slow due to head contention. Once the snapshot completes, it'll be back to normal and the dump will start.

LSDave · Aug 30, 2011

SirDice said:
It sounds like the drive may be up the creek. I've had a similar issue once before, turned out the drive had bad sectors which caused things to hang indefinitely.

I have progressed without backing up, and no issues or other symptoms suggesting a faulty drive. No complaints for fsck on boot up for example. I really don't think this is the problem, but any suggestions on program I can run to rule it out completely?

wblock said:
-L when dumping a live filesystem makes a snapshot first; see mksnap_ffs(8). That can take a couple of minutes where it seems to be doing nothing, although there will be drive activity and mksnap_ffs will show on top(1). Running anything else will be slow due to head contention. Once the snapshot completes, it'll be back to normal and the dump will start.

Yes, there was certainly a couple of minutes delay the first time when it worked and I foolishly interrupted it. Since that time however, I have left it for 10, 15, and even 20 minutes and there is ZERO activity. So I don't think it's patience matter, but thanks for this info.

Q1: How can I fix this: Rebuild world? Rebuild kernel? Both?

Q2: After many many attempts installing and frustration, I've really just got FreeBSD right where I want it: customized Kernel, Samba + ZFS NAS server playing nice with my Windows network, KDE working and autoloaded, VNC autoloaded, and VirtualBox running. I would consider a rebuild discussed above at Q1, but I really want a back up first.

I have read that while dump/restore is best, you can also back up your system with tar. Any thoughts or advice on me running the following command to effect a full backup:

Code:

tar -cvpzf  /mnt/path.to.ZFSstore/fullbackup.tar --exclude=/proc --exclude=/sys --exclude=/mnt --exclude=/media --exclude=/dev /

And restored if (and hopefully not!) necessary with:

Code:

tar -xvpzf /mnt/path.to.ZFSstore/fullbackup.tar -C /

Will these commands do the trick? Have I missed some typical exclusions? Any other concerns?

Thanks for your time and any comments.

wblock@ · Aug 30, 2011

Wait--if the disk being backed up is ZFS, don't use dump, which is for UFS. net/rsync is a better candidate than tar(1) if you want to go that way, but there should be something native for ZFS.

LSDave · Aug 30, 2011

wblock said:
Wait--if the disk being backed up is ZFS, don't use dump, which is for UFS. net/rsync is a better candidate than tar(1) if you want to go that way, but there should be something native for ZFS.

I was not clear. The ~/, var, tmp, and usr filesystems are not ZFS, but rather the standard FreeBSD filesystem that is the only option readily available during the Boot-DVD's installation process. The destination folder for the backup (and incidentally the storage shared by the Samba Server) is a ZFS raidz pool, which is mounted at /mnt.

Against the background of those clarifications, is using tar as I described above acceptable?

wblock@ · Aug 30, 2011

LSDave said:
Against the background of those clarifications, is using tar as I described above acceptable?

Yes, but on restore it may Do Things with hard links and other filesystem weirdness. Might be options to handle that like with rsync(1) (-H).

Trying dump on a small filesystem like / or /var without -L, and maybe with -C8 or -C16, could help to figure out the problem. It really ought to work, and is kind of concerning.

dalescott · Jan 18, 2012

I'm also getting a non-responsive system after issuing dump on a relatively fresh install of FreeBSD-9.0-RELEASE. The system is a VirtualBox 4.1.2 vm, with a 20GB GPT system drive and a 20GB MBR backup drive. Since it was created, the ports tree has been updated and apache22, mysql55-server and python/django installed (and a number of minor utilities). Everything seems to work ok, except for dump.

Code:

# mount
/dev/ada0p2 on / (ufs, local, journaled soft-updates)
devfs on /dev (devfs, local, multilabel)
/dev/ad1as1d on /backup (ufs, local, soft-updates)
#
# cd /backup
# dump -0aLf 20120118.dump /

there is no output after hitting <enter> and the system becomes generally unresponsive. A command typed into either the VirtualBox server console or an ssh terminal is echoed (e.g., whoami), but that's all. I had started "top" in a seperate ssh terminal before issuing the dump command and it shows mksnap_ffs with 98%-100% WCPU for about 55 minutes, at which point the "top" session stops updating. After 70 minutes I give up and yank the virtual power cord.

I the created a new FBSD-9.0 VM (basic install only from dvd), but also create/fdisk/label/mount a new virtual backup drive. On this one, dump works as expected! I've moved the virtual backup drive between systems, and also repartitioned/relabeled the backup drive on the problem system, with no effect. FWIW, on both VMs (the problem vm and the one where dump works), fdisk reports that the system drive chunks do not start on track boundaries (no idea why, I used auto for the install).

Does any of this make any sense to anyone? Is it possible to correct it somehow? I could spend the time to re-configure the second vm like first, but not knowing what went wrong concerns me (and if it's going to happen again). The overall intent is to get version 2 of my live server working as a vm, then restore the dump from the vm onto the live server.

Thanks in advance for any and all assistance,
Dale

wblock@ · Jan 19, 2012

Are the two VMs identical in hardware settings, particularly in controller? dump(8) forks several processes. If the VM was really limited on RAM, it might start swapping and really slow down.

kpa · Jan 19, 2012

Your problem is not likely related to VirtualBox but to a problem with snapshots on an UFS filesystem when using SU+J setup.

There's a thread on freebsd-current mailing list about the very issue:

http://lists.freebsd.org/pipermail/freebsd-current/2012-January/030937.html

dalescott · Jan 19, 2012

Thanks kpa for your link. The problem was caused by UFS SU+J (with journalling). After reverting the file system to SU only (no +J), dump executed correctly.

For the benefit of others, the thread linked to by kpa is very informative and worth reading, but the short version for me was to boot the livecd from a FreeBSD-9.0-RELEASE (dvd1, but disk1 would work too), then:

Code:

# fsck -f (as recommended by Mr. McKusick)
# tunefs -j disable /dev/ada0p2

and then reboot normally from the system drive and delete /.sujournal (to reclaim space).

Hopefully smarter minds than mine will solve the issue with journalling.

P.S. Thanks also wblock@, the two vm's were in fact identical although having the default 128M memory. Before disabling journalling, I tried increasing memory to 256M, and then to 512M, but without effect.

kpa · Feb 7, 2012

Take note:

http://lists.freebsd.org/pipermail/freebsd-fs/2012-February/013646.html

wblock@ · Feb 7, 2012

Direct link to the PR: PR kern/161674

Tomse · Oct 1, 2012

do anyone but me also have this problem ?

running 9.0 p3 amd64

according to the PR the bug should be fixed by now.

swa · Oct 1, 2012

Last week I had freezes using Amanda dump. Only on one partition (643G) with 24G in use. I couldn't locate the problem, in stead I thought it would be the combination of big partition+active/encrypted jails causing problems. I managed to reproduce it and system just hangs, finally locking myself out of ssh and couldn't do anything but hard reboot. Dump on the smaller / /usr and /var are fine. Using 8.3-RELEASE-p3 with soft-updates but without Journaling.

ctengel · Aug 10, 2013

Hi all,

Sorry to resurrect a long dead thread, but I think I just hit this bug and am trying to determine the best route forward. I am running FreeBSD 9.0-RELEASE-p7 (which I know is no longer supported, but was just taking a backup with dump(8) prior to upgrading to 9.1!). Something to the effect of the following happened:

Code:

# cd / && dump -0 -a -L  -n -f - /dev/mirror/gm0 | gzip > /zpool/rootdumps/pre91upgr.gz
  DUMP: Date of this level 0 dump: Sat Aug 10 15:00:38 2013
  DUMP: Date of last level 0 dump: the epoch
  DUMP: Dumping snapshot of /dev/mirror/gm0 (/) to standard output
  DUMP: mapping (Pass I) [regular files]
  DUMP: mapping (Pass II) [directories]
  DUMP: estimated 885750 tape blocks.
  DUMP: dumping (Pass III) [directories]
  DUMP: dumping (Pass IV) [regular files]

I noticed after it was sitting on that step for a while that my output file was sitting at 5.9 MB for a long time. I have seen where gzip(1) does a good job compressing stuff, so figured that may be why...but nope... gunzip into wc -c indicates that the inner content is a steady 18219008 bytes.

top seems to indicate I have two dump processes running, each about a size of 10 MB, and each taking up about an entire CPU core it seems. (I have 3, which is probably why my system does not appear to be fully hung as with the OP.)

Notably there is no mksnap_ffs processes running.

tunefs(8) does appear to indicate SU+J is enabled

Code:

# tunefs -p /dev/mirror/gm0 
tunefs: POSIX.1e ACLs: (-a)                                disabled
tunefs: NFSv4 ACLs: (-N)                                   disabled
tunefs: MAC multilabel: (-l)                               disabled
tunefs: soft updates: (-n)                                 enabled
tunefs: soft update journaling: (-j)                       enabled
tunefs: gjournal: (-J)                                     disabled
tunefs: trim: (-t)                                         disabled
tunefs: maximum blocks per file in a cylinder group: (-e)  4096
tunefs: average file size: (-f)                            16384
tunefs: average number of files in a directory: (-s)       64
tunefs: minimum percentage of free space: (-m)             8%
tunefs: optimization preference: (-o)                      time
tunefs: volume label: (-L)

My first instinct is to simply kill/interrupt the dump, but based on the OP's experience I am sort of afraid to do that. Don't want to cause data corruption issues. So my first question is: what should I do now to get out of this situation?

Also, going forward, I guess I should disable journalling? Or soft updates? Or both? Or just not take backups this way? (Or is this bug fixed in 9.1?)

Thanks!
Chris
PS: I did also come across in my Googling about this that journalling is not recommended on SSDs. I am using a gmirror of two CF cards as my root device, so I imagine that maybe I should just disable journalling for that reason alone?

wblock@ · Aug 10, 2013

It's not interrupting the dump that causes filesystem problems, it's a hard reset. So if you can kill the dump(8) process, it will be fine. Reboot in single user, turn off SUJ with tunefs -j disable on all UFS filesystems that will ever be backed up with dump(8).

SUJ's whole reason for existence is to reduce the time a full fsck(8) takes. On an SSD, fsck(8) goes fast anyway, so there is little reason to use SUJ. CF cards are not really SSDs by modern standards. They are usually not very fast. Regardless, if you want to use dump(8), turn off SUJ.

ctengel · Aug 11, 2013

Thanks; I was able to simply control C it. Just for kicks I tried doing a dump without -L, and came back with many errors from the mirror... so I had further underlying issues. After doing some over-zealous fsck(8) and breaking, I was able to more or less recover everything to workable from an older dump, fix underlying issues, and then proceed to complete 9.1 upgrade. It was a long day, but worth it!

dump freezes every time I try to run it.

Administrator