FreeBSD 7.3 amd64 rsnapshot woes (libc issue?)

fadolf · Jul 7, 2010

After upgrading several boxes from 7.1 and 7.2 to 7.3-RELEASE-p1 (all amd64) we were experiencing some weird behaviour with rsnapshot:

After a few days/several weeks (seems to be completely random), rsnapshot reports that it can't start due it's lockfile and process still being present. on such boxes either a zombie rm or find process (which presumably were launched by rsnapshot) can be found. If the backup was done to a separate partition (physical disks or RAIDs) any access (ls, stat, fsck, etc) to the partition would kill the current Terminal Session, creating a new zombie of the process one just started. Unmounting the affected partition would also lead to the same result.
The machines wouldn't even shut down completely but hanged somewhere after syncing buffers, only a hardware reset worked. after the reboot, those partitions were unmounted and fscked. After which the backups would work again until the next error happened again.

The affected boxes are far too diverse in function, installed ports and hardware so one could easily blame it on a common 3rd party, and this effect occurred more often than one could just dismiss it as "bad luck".

The Errata Page for 7.3 only mentions a problem with libc on FreeBSD/amd64 in the entry of [20100330] Late-Breaking News and Corrections.

This fix is already included in 8.0-RELEASE-p3 and boxes which were upgraded to this version haven't exhibited this behaviour ever since the update.

Has anyone experienced anything similar with 7.3-RELEASE-p1 on amd64?

fadolf · Jul 26, 2010

Small Update:
with the Release of 7.3-RELEASE-p2 the fix for include/dirent.h wasn't included.
so after checking out the new source, we patched it on our own, yet the beforementioned behavior has still occured despite a fresh buildworld and buildkernel.
i guess it's save to assume that the behavior we observer is not related to libc after all, remains to be seen what's causing it.
the remaining affected boxes have been upgraded to 8.1-RELEASE with binary update and haven't yet shown this behaviour since friday.

any thoughts what the underlying culprint could have been?

fadolf · Sep 23, 2010

Another update in this mystery:

A previously unaffected box exhibited this behaviour, luckily this one was a non-production test box, so we built a debug kernel with:

Code:

makeoptions DEBUG=-g # gdb(1) debug symbols
options KDB # Enable kernel debugger support.
options DDB # Support DDB.
options GDB # Support remote GDB.
#options DEADLKRES # Enable the deadlock resolver (unfortunately not supported in 7.3)
options INVARIANTS # calls of extra sanity checking
options INVARIANT_SUPPORT # sanity checks of int. struct.
options WITNESS # detect deadlocks and cycles
options WITNESS_SKIPSPIN # Don't run witness on spinlocks

A few weeks after installing said kernel, the server encountered the problem again, but it didn't panic.

[CMD="procstat"]-ak[/CMD]

showed these processes of interest

Code:

55396 100135 find - mi_switch sleepq_switch sleepq_wait _sleep acquire _lockmgr ffs_lock VOP_LOCK1_APV _vn_lock vget cache_lookup 
vfs_cache_lookup VOP_LOOKUP_APV lookup namei kern_lstat lstat syscall
70923 100146 rsync - mi_switch sleepq_switch sleepq_wait _sleep acquire _lockmgr ffs_lock VOP_LOCK1_APV _vn_lock vget vfs_hash_get ffs_vgetf 
ufs_lookup_ vfs_cache_lookup OP_LOOKUP_APV lookup namei kern_lstat

unfortunately we didn't enable

Code:

options     DEBUG_LOCKS
options     DEBUG_VFS_LOCKS

so we'll have to wait another few day-weeks until we get more info.

although the server didn't panic, unmounting the affected partition rendered the system completely unresponsive and everything short of a hardware reset wouldn't yield any results.

everything seems to hint that there is something causing a deadlock. it's a bit unnerving that this only happens with 7.3 but neither with 7.1, 7.2 nor 8.x, seeing as there are no changes in the source of ciss(4), which should have such effects.

FreeBSD 7.3 amd64 rsnapshot woes (libc issue?)

fadolf

fadolf

fadolf