Crash issue with 9.1-RC2

xtaz · Oct 30, 2012

Apologies for the long post here but I have a really weird issue which I think I may have worked around or resolved (for now) but curious if anyone can shed some light on this.

I have a server which I've updated with a buildworld/kernel around 10 times now without issue. The last time I went from 9.1 RC1 to RC2, again no issue. Then yesterday I tried to go from RC2 to RC3. At completely random points throughout the buildworld the server kernel panic'd and rebooted. I enabled crash dumps and found this:

Code:

panic: ufs_dirbad: /: bad dir ino 16774028 at offset 29696: mangled entry
cpuid = 0
KDB: stack backtrace:
#0 0xffffffff80580a76 at kdb_backtrace+0x66
#1 0xffffffff8054b73e at panic+0x1ce
#2 0xffffffff80799faf at ufs_dirbad+0x4f
#3 0xffffffff8079b6a9 at ufs_lookup_ino+0x6a9
#4 0xffffffff805cd088 at vfs_cache_lookup+0xf8
#5 0xffffffff80831910 at VOP_LOOKUP_APV+0x40
#6 0xffffffff805d4724 at lookup+0x464
#7 0xffffffff805d5839 at namei+0x4e9
#8 0xffffffff805ef91b at vn_open_cred+0x3cb
#9 0xffffffff805ee999 at kern_openat+0x1f9
#10 0xffffffff807e8586 at amd64_syscall+0x546
#11 0xffffffff807d3ee7 at Xfast_syscall+0xf7

I have been using SU+J journalling on this filesystem which is relatively untested in the wild at the moment and I know I've seen a few people on various forums and mailing lists saying it's caused them issues with disk corruption etc. I figured it would be worth trying to disable this and fscking the filesystem properly. So I did this. This showed lots of errors which were successfully fixed. Thinking this may have solved the problem I then tried another buildworld. This time it crashed with this panic:

Code:

Fatal trap 12: page fault while in kernel mode
cpuid = 2; apic id = 02
fault virtual address   = 0x801466c30
fault code              = supervisor read instruction, protection violation
instruction pointer     = 0x20:0x801466c30
stack pointer           = 0x28:0x8012ce600
frame pointer           = 0x28:0x801314210
code segment            = base 0x0, limit 0xfffff, type 0x1b
                        = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags        = resume, IOPL = 0
current process         = 95300 (cc1)
trap number             = 12
panic: page fault
cpuid = 2
KDB: stack backtrace:

If need be I can copy the output from "backtrace" from these kgdb sessions. But for the sake of not making this post too long I've left them off.

Because I now had two different errors I figured maybe there's something related to the RC2 source that had issues. My server seemed to be working perfectly well otherwise, it had been up for several weeks, and had many portmaster upgrades during that time. Out of curiosity knowing that I had been able to do a buildworld of RC2 whilst running RC1 I decided to try and downgrade it. Changing my compiler to clang I ran another buildworld to see what this would do. The buildworld/kernel completed without issue. So I installed it and rebooted into RC1. Then comes the interesting part. I reverted back to gcc and tried to buildworld/kernel again. This completed successfully with no problems at all.

So I'm thinking there was actually a possible issue with the binaries that I had installed using the RC2 source. Is it possible something was screwed in a way that could cause UFS SU+J corruption and/or various under load kernel panics whilst using gcc but not clang which has now been fixed by me copying over all those binaries again with new source? Or is it possible there is actually a large issue with RC2 and possibly RC3? I'm a bit scared now to try updating it to 9.1-RELEASE or RC3 although I have kept a copy of the object source which I can easily reinstall.

I had a quick look through the log entries from svn for the differences and I can't really see anything in particular which jumps out at me as a major issue other than possibly there were some changes made to some timers from 10ms to 10us.

Any ideas? Would the output from backtrace help? Should also probably mention that I use ccache. Although it was still crashing with ccache disabled. The gcc etc binaries installed would have come from ccache originally though. I also would have used make -j4, although also tried without that and it still crashed.

Orum · Oct 31, 2012

When I see weird things like this, I tend to think there are RAM related issues.

For example, on my router I also used to be running SU+J, and would get random segfaults while building the world. Thinking this was SU+J related, I removed it, only to find I still had segfaults at random points. Switched to clang/llvm, and everything worked. Puzzled, I remembered that clang/llvm uses much less RAM than gcc--and this prompted me to suspect there was some faulty memory. Sure enough, after running memtest I saw large blocks of memory at the ~1036MB address were really bad.

A RMA later and everything is bad to normal. I'm still using clang/llvm, but I have to thank gcc for being such a memory hog. Without it, I probably never would have noticed that I had a bad DIMM. Even though I test all memory when I receive it, it has a habit of going bad over time.

SirDice · Oct 31, 2012

Orum said:
When I see weird things like this, I tend to think there are RAM related issues.

Either that or bad sectors on the harddrive. You can get really weird issues if they happen to be inside the swap partition x(

Don't know if that's the case or not but if this machine is overclocked that might result in weird problems too. Especially if the timing's off on memory.

xtaz · Oct 31, 2012

I did think the same actually and tried switching the two DIMMs around so they were reseated and in different slots but it still crashed. What gets me though is now it's back on RC1 it's working perfectly well. I've rebuilt world using gcc twice now and it's not crashed. The hardware is actually pretty new. I bought it in August. The hard drive has smartmontools running on it which shows no issues. It's not overclocked either. I could try a memtest. Just downloaded memtest86+ onto my USB key. Will try this when I get a moment.

It'll be interesting to see what happens if I up it to RC3 again too.

xtaz · Oct 31, 2012

OK. I just ran memtest86+ on it. After one pass it says zero errors. Unfortunately I have 4GB of RAM and run amd64 arch usually so I guess it wouldn't have tested the whole amount running from a 32 bit USB boot image. At some point I should probably try removing 1 stick of ram and testing it with just 2GB, then switch it for the other 2GB stick. But from what I've seen so far I think the hardware is probably OK. Smartmontools says the HDD is fine, memtest albeit probably not testing the entire 4GB shows no issues. And as I mentioned, the thing seems perfectly OK running the RC1 code.

I think going forward I should test each 2GB stick individually just to make 100% sure. And then when 9.1-RELEASE is actually out, as I don't feel like trying RC3 now, upgrade it to that and see if building a new world with gcc at that point causes a crash.

AlexJ · Nov 1, 2012

xtaz said:
Unfortunately I have 4GB of RAM and run amd64 arch usually so I guess it wouldn't have tested the whole amount running from a 32 bit USB boot image.

This is wrong assumption. memtest86 can test up to 64Gb of RAM utilizing 16 cores CPU and server version can test up to 8Tb utilizing 32 cores. It works directly with hardware chipsets, so it isn't an issue if it start as 32 bit app.

xtaz · Nov 1, 2012

AlexJ said:
This is wrong assumption. memtest86 can test up to 64Gb of RAM utilizing 16 cores CPU and server version can test up to 8Tb utilizing 32 cores. It works directly with hardware chipsets, so it isn't an issue if it start as 32 bit app.

In that case then the whole memory was tested fine! Thanks.

I also ran an extended offline self test of the HDD using smartmontools last night and this passed without any issues as well. I'm going to conclude that the hardware is fine then. So I still think there was something odd about the binaries that I happened to have installed. This is wild speculation obviously but I'm thinking along the lines of an SU+J corruption which caused the initial panics under load about UFS which were resolved by disabling journalling and performing a full fsck. And then the second panic is possibly due to corrupted gcc binaries or some shared library because of the UFS corruption?

My other speculative possibility was that something screwed up with using ccache and/or make -j4 when building the RC2 world.

Either possibility was resolved by doing a new world via make -j1 and using clang. Although now I'm running the RC1 source so could also possibly be something in the RC2 source.

The way forward is to clearly now try going back to RC2 or RC3 and seeing what happens. I'll give this a go later on and then if I get the same issues I might up it to stable instead and see what happens there. I've kept a copy of the /usr/obj from RC1 so I can easily install this again now in case I get further crashes after upgrading it.

xtaz · Nov 1, 2012

As strange as this sounds I think what I said in the last post might not be too far from the truth here. Throughout today I had tried running buildworld three times whilst running RC1 and using gcc with no issues. I then decided to actually install the RC3 code that I had compiled to see what happens. After doing this I ran buildworld again twice and both times it has again compiled fully with no issues. Tried it with/without ccache and with -j1 and -j4 with no issues.

So as weird as it seems it does appear that my issues have been resolved by disabling SU+J journalling, fscking the filesystem, compiling world/kernel using clang, and installing the result. Most odd.

AlexJ · Nov 3, 2012

I afraid that SU+J isn't production ready yet.

Uniballer · Nov 4, 2012

AlexJ said:
I afraid that SU+J isn't production ready yet.

I really hope this isn't true. The installer in 9.1-RC3 defaults to SU+J.

xtaz · Nov 5, 2012

For added information the very first time my server crashed it was in the middle of a buildworld using ccache and make -j4 and downloading two torrent files using rtorrent so the disk would have had heavy access at the time. But then after that just running buildworld -j1 with ccache disabled was enough to crash it within minutes. When I disabled SU+J and ran fsck manually the directory entries that it claimed it was fixing were all ccache cache directories.

I've seen several posts on forums and mailing lists from people claiming their own filesystems had become corrupted under SU+J and when they also ran fsck manually on it without using the journal it fixed several issues meaning the filesystem clearly wasn't clean even though was happy to use it. Because of this and my own experience described in this post now I'm going to not trust it and keep it disabled until I'm convinced otherwise. I have run FreeBSD systems since version 4.1 with softupdates only enabled and never had a single problem. As soon as I decide to enable SU+J I get problems like this within a month.

Considering SU+J is enabled by default in 9.1 I'll be keeping an eye on the forums and mailing lists to see what other peoples experiences are.

xtaz · Nov 30, 2012

I should make another update on this. At the time of all these posts the output from smartctl showed me that there were no issues at all with the hard drive. All parameters were at zero and extended self tests were passing with no problems. This week however I got an email to say I have errors on the drive. Affected params are:

Code:

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
197 Current_Pending_Sector  0x0032   100   100   000    Old_age   Always       -       1
198 Offline_Uncorrectable   0x0030   100   100   000    Old_age   Offline      -       1
200 Multi_Zone_Error_Rate   0x002a   100   100   000    Old_age   Always       -       2

And:

Code:

Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed: read failure       90%      2711         270556838

And syslog is chucking out this:

Code:

Nov 30 14:39:54 tao smartd[76168]: Device: /dev/ada0, 1 Currently unreadable (pending) sectors
Nov 30 14:39:54 tao smartd[76168]: Device: /dev/ada0, 1 Offline uncorrectable sectors

So it does look like I have hard disk issues after all. It's just that they only appeared a few weeks after all my issues. Annoyingly this drive is only about four months old and I have to send back the drive before I get a new one on warranty. So I've bought myself a new drive (from a different manufacturer!) which I'll be installing over the weekend followed by a reinstall of my server. Then I'll do a warranty claim and just end up with a new drive which I'll keep as a spare I guess.

Apparently you can force the drive to reallocate this bad sector by writing over it at that LBA with dd, however if the drive has a single error I'd rather replace it than try and paint over the cracks.

Oh well, live and learn! So now I have to have a think about if I want to go with SU+J again or be conservative and just go with what I've used for years in the past.

overmind · Nov 30, 2012

Orum said:
When I see weird things like this, I tend to think there are RAM related issues.

I also had the same issues, RAM was my problem too. If your system crashes when compiling kernel and world run a test on RAM, althought some apps that test memory might not found defective memory, better replace the RAM and try to recompile.

Regarding your hard drive issue I wonder if in your case it helps if you use ZFS instead (2 drive, ZFS mirror).

xtaz · Dec 1, 2012

overmind said:
Regarding your hard drive issue I wonder if in your case it helps if you use ZFS instead (2 drive, ZFS mirror).

I have a Zotac zbox which is a tiny PC based on an Atom D525. It only has space on-board for a single 2.5" laptop drive so unfortunately I have to accept that if the drive dies I have to reinstall it from scratch. I do however have very good backups so it's not too bad!

Crash issue with 9.1-RC2

Administrator