Apologies for the long post here but I have a really weird issue which I think I may have worked around or resolved (for now) but curious if anyone can shed some light on this.
I have a server which I've updated with a buildworld/kernel around 10 times now without issue. The last time I went from 9.1 RC1 to RC2, again no issue. Then yesterday I tried to go from RC2 to RC3. At completely random points throughout the buildworld the server kernel panic'd and rebooted. I enabled crash dumps and found this:
I have been using SU+J journalling on this filesystem which is relatively untested in the wild at the moment and I know I've seen a few people on various forums and mailing lists saying it's caused them issues with disk corruption etc. I figured it would be worth trying to disable this and fscking the filesystem properly. So I did this. This showed lots of errors which were successfully fixed. Thinking this may have solved the problem I then tried another buildworld. This time it crashed with this panic:
If need be I can copy the output from "backtrace" from these kgdb sessions. But for the sake of not making this post too long I've left them off.
Because I now had two different errors I figured maybe there's something related to the RC2 source that had issues. My server seemed to be working perfectly well otherwise, it had been up for several weeks, and had many portmaster upgrades during that time. Out of curiosity knowing that I had been able to do a buildworld of RC2 whilst running RC1 I decided to try and downgrade it. Changing my compiler to clang I ran another buildworld to see what this would do. The buildworld/kernel completed without issue. So I installed it and rebooted into RC1. Then comes the interesting part. I reverted back to gcc and tried to buildworld/kernel again. This completed successfully with no problems at all.
So I'm thinking there was actually a possible issue with the binaries that I had installed using the RC2 source. Is it possible something was screwed in a way that could cause UFS SU+J corruption and/or various under load kernel panics whilst using gcc but not clang which has now been fixed by me copying over all those binaries again with new source? Or is it possible there is actually a large issue with RC2 and possibly RC3? I'm a bit scared now to try updating it to 9.1-RELEASE or RC3 although I have kept a copy of the object source which I can easily reinstall.
I had a quick look through the log entries from svn for the differences and I can't really see anything in particular which jumps out at me as a major issue other than possibly there were some changes made to some timers from 10ms to 10us.
Any ideas? Would the output from backtrace help? Should also probably mention that I use ccache. Although it was still crashing with ccache disabled. The gcc etc binaries installed would have come from ccache originally though. I also would have used make -j4, although also tried without that and it still crashed.
I have a server which I've updated with a buildworld/kernel around 10 times now without issue. The last time I went from 9.1 RC1 to RC2, again no issue. Then yesterday I tried to go from RC2 to RC3. At completely random points throughout the buildworld the server kernel panic'd and rebooted. I enabled crash dumps and found this:
Code:
panic: ufs_dirbad: /: bad dir ino 16774028 at offset 29696: mangled entry
cpuid = 0
KDB: stack backtrace:
#0 0xffffffff80580a76 at kdb_backtrace+0x66
#1 0xffffffff8054b73e at panic+0x1ce
#2 0xffffffff80799faf at ufs_dirbad+0x4f
#3 0xffffffff8079b6a9 at ufs_lookup_ino+0x6a9
#4 0xffffffff805cd088 at vfs_cache_lookup+0xf8
#5 0xffffffff80831910 at VOP_LOOKUP_APV+0x40
#6 0xffffffff805d4724 at lookup+0x464
#7 0xffffffff805d5839 at namei+0x4e9
#8 0xffffffff805ef91b at vn_open_cred+0x3cb
#9 0xffffffff805ee999 at kern_openat+0x1f9
#10 0xffffffff807e8586 at amd64_syscall+0x546
#11 0xffffffff807d3ee7 at Xfast_syscall+0xf7
I have been using SU+J journalling on this filesystem which is relatively untested in the wild at the moment and I know I've seen a few people on various forums and mailing lists saying it's caused them issues with disk corruption etc. I figured it would be worth trying to disable this and fscking the filesystem properly. So I did this. This showed lots of errors which were successfully fixed. Thinking this may have solved the problem I then tried another buildworld. This time it crashed with this panic:
Code:
Fatal trap 12: page fault while in kernel mode
cpuid = 2; apic id = 02
fault virtual address = 0x801466c30
fault code = supervisor read instruction, protection violation
instruction pointer = 0x20:0x801466c30
stack pointer = 0x28:0x8012ce600
frame pointer = 0x28:0x801314210
code segment = base 0x0, limit 0xfffff, type 0x1b
= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags = resume, IOPL = 0
current process = 95300 (cc1)
trap number = 12
panic: page fault
cpuid = 2
KDB: stack backtrace:
If need be I can copy the output from "backtrace" from these kgdb sessions. But for the sake of not making this post too long I've left them off.
Because I now had two different errors I figured maybe there's something related to the RC2 source that had issues. My server seemed to be working perfectly well otherwise, it had been up for several weeks, and had many portmaster upgrades during that time. Out of curiosity knowing that I had been able to do a buildworld of RC2 whilst running RC1 I decided to try and downgrade it. Changing my compiler to clang I ran another buildworld to see what this would do. The buildworld/kernel completed without issue. So I installed it and rebooted into RC1. Then comes the interesting part. I reverted back to gcc and tried to buildworld/kernel again. This completed successfully with no problems at all.
So I'm thinking there was actually a possible issue with the binaries that I had installed using the RC2 source. Is it possible something was screwed in a way that could cause UFS SU+J corruption and/or various under load kernel panics whilst using gcc but not clang which has now been fixed by me copying over all those binaries again with new source? Or is it possible there is actually a large issue with RC2 and possibly RC3? I'm a bit scared now to try updating it to 9.1-RELEASE or RC3 although I have kept a copy of the object source which I can easily reinstall.
I had a quick look through the log entries from svn for the differences and I can't really see anything in particular which jumps out at me as a major issue other than possibly there were some changes made to some timers from 10ms to 10us.
Any ideas? Would the output from backtrace help? Should also probably mention that I use ccache. Although it was still crashing with ccache disabled. The gcc etc binaries installed would have come from ccache originally though. I also would have used make -j4, although also tried without that and it still crashed.