Kernel Panic at 3:00 am

BachiloDmitry · Feb 22, 2012

Greetings. I have a problem that has already been discussed here, except for the solution there was to replace motherboard, RAM and processor, which I did and it did not help.

I am getting this every night:

Code:

Fatal trap 12: page fault while in kernel mode
cpuid = 0; apic id = 00
fault virtual address   = 0x326d78
fault code              = supervisor read data, page not present
instruction pointer     = 0x20:0xffffffff8081fb76
stack pointer           = 0x28:0xffffff80c52602e0
frame pointer           = 0x28:0xffffff80c5260310
code segment            = base 0x0, limit 0xfffff, type 0x1b
                        = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags        = interrupt enabled, resume, IOPL = 0
current process         = 11460 (find)
trap number             = 12
panic: page fault
cpuid = 0

It can happen some time from 3 to 4 am, the trap is not always 12, can be 9 and other numbers, but it's always a 'find' process.

I have a

Code:

CPU: AMD Athlon(tm) 64 Processor 3000+ (1809.31-MHz K8-class CPU)
  Origin = "AuthenticAMD"  Id = 0x20ff0  Family = f  Model = 2f  Stepping = 0
real memory  = 4294967296 (4096 MB)
avail memory = 2824007680 (2693 MB)

(it's 4 DDR-1 1GB memory modules)

This ASUS motherboard has 8 SATA ports, they are all used.

Code:

ada0: <ST31000340NS SN04> ATA-6 SATA 1.x device
ada1: <ST3500320NS SN04> ATA-6 SATA 1.x device
ada2: <WDC WD800AAJS-00PSA0 05.06H05> ATA-7 SATA 2.x device
ada3: <WDC WD800AAJS-00PSA0 05.06H05> ATA-7 SATA 2.x device
ada4: <ST3500320NS SN05> ATA-8 SATA 1.x device
ada5: <ST3500320NS SN04> ATA-6 SATA 1.x device
ada6: <ST3500320NS SN04> ATA-6 SATA 1.x device
ada7: <ST3500630NS 3.AEK> ATA-7 SATA 1.x device

Code:

samba# gmirror status 
          Name    Status  Components
mirror/system0  COMPLETE  ada2 (ACTIVE)
                          ada3 (ACTIVE)
samba# zpool status
  pool: data
 state: ONLINE
 scan: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        data        ONLINE       0     0     0
          raidz2-0  ONLINE       0     0     0
            ada5    ONLINE       0     0     0
            ada7    ONLINE       0     0     0
            ada1    ONLINE       0     0     0
            ada4    ONLINE       0     0     0
            ada6    ONLINE       0     0     0

errors: No known data errors

It all worked perfectly for years on version-8 FreeBSDs (from 8.0 to 8.2) until I decided to upgrade. I installed Socket 775 based ASUS MB and a Core 2 processor, also 4 BG of DDR3 RAM, 2 modules, 2 GB each. And there it started - it worked normally during day and started crashing at night. First of all I've updated to 9.0-RELEASE, problem remained. Then I downgraded to what I've had - MB, CPU, memory - but problem remained. I don't get it anymore. Not believing that disks can cause it I still checked SMARTs on all of them, all are fine. Furthermore, this machine gets a dump/dd through SSH from another machine, writing a file which is more then 80 GB for now, takes 3 hours, but works perfectly fine, I tried moving this file from my terabyte disk to zpool, causing heavy writes on all the other HDDs - still nothing, everything works fine under heavy loads until night, when it comes to 'find'. I don't get it, what should I replace now?

DutchDaemon · Feb 22, 2012

This may not be much help, but I remember an occasion where I had a lost+found directory somewhere after a severe crash, and whenever I tried doing something with it, even entering it, let alone moving or copying it, the system rebooted immediately. Don't remember if it caused a panic before the reboot, but it was one touchy directory .. And find is typically one of those utilities that will touch every file and directory on the system, so you may have one of those. Maybe try running a manual find from the root of each filesystem with a print statement, and get a clue about where and when it causes a panic exactly? At least you'll know where to start looking.

jalla · Feb 22, 2012

periodic(8) starts it's daily run at 3:01 every night.

Perhaps you can narrow down the problem by running the scripts in /etc/periodic/daily/ one by one.

If you don't find the problem there, check also your crontabs for other jobs that may run around the same time.

wblock@ · Feb 22, 2012

There have been other threads with the same problem with ZFS, and some type of workaround... which I can't recall.

kpa · Feb 22, 2012

Check the snapdir property on all ZFS datasets, it should be set to "hidden".

BachiloDmitry · Feb 22, 2012

ok, thanks everyone for the answers, I checked everything you've said:

I have no lost+found folders,
my only ZFS dataset 'data' has snapdir property set to 'hidden',
all scripts in /etc/periodic/daily/ ran normally.

And also today I got kernel panic with 'smbd' process "guilty". I also have dd'ed all my disks from /dev/adaX to /dev/null successfully with no errors.

I'll keep searching.

BachiloDmitry · Feb 22, 2012

No, wait! All scripts from /etc/priodic/daily/ exited zero or somehow else themselves except for 450.status-security - it took too long so I've Ctrl-C'ed it after a couple of minutes.

But now when I gave it some time I got Kernel Panic. I'll try to reproduce it when the server will come back.

jalla · Feb 22, 2012

450.status-security doesn't do anything directly.
It will just invoke all the scripts in /etc/periodic/security, so you should try running those.

BachiloDmitry · Feb 22, 2012

Well, at least I've reproduced it twice. I'll find what security script causes it and post here.

BachiloDmitry · Feb 22, 2012

It's the first one.

Code:

 sh 100.chksetuid

Checking setuid files and devices:

and panic after near 10 minutes.

kpa · Feb 22, 2012

That script does only a simple find(1) operation, nothing out of the ordinary. I would boot the machine in single user mode and run fsck(8) on UFS file systems followed by a
# zpool scrub on ZFS pool(s).

idownes · Feb 23, 2012

I have an issue where ZFS consumes all available memory as a result of that same periodic 100.chksetuid script running. It normally grinds the machine to a halt but I think I've also had a panic.

Could you set up a log of the zfs sysctls like:

Code:

export ZFSLOG=/tmp/vfs_zfs.log; while true; do date >> $ZFSLOG; sysctl vfs.zfs >> $ZFSLOG; sleep 30; done

and see what the vfs.zfs.arc_meta_used is before it crashes?

phoenix · Feb 23, 2012

The find processes run by a couple of the periodic(8) scripts will run a ZFS system out of RAM. I've had to disable the following on all my large ZFS systems (from /etc/periodic.conf):

Code:

daily_status_security_chksetuid_enable="NO"

Search the forums for "zfs periodic panic" or even just "zfs periodic" for more info.

BachiloDmitry · Feb 23, 2012

Yes, zpool scrub seems to be causing a kernel panic, so now it's constantly falling into it. I made an ARC tuning couple of years ago, so it should not consume more than 1.5 GB, I used to have panics connected with that, but kernel panic message was saying it clearly, and now it does not. Maybe there's really just something wrong with my pool? Should I try fixing it under Solaris maybe? As I've already said, I had a panic yesterday, which was caused by smbd, not find, it was at day time. I'll still try that log idea anyway.

BachiloDmitry · Feb 23, 2012

Ok, so it seems I completely lost all my 1.5 TB of valuable data. As I've already said, after the zpool scrub command the system panicked immediately and all the reboots after it finished with panic even before the mounts. So I've decided to check what Solaris 11 would say about it, and guess what - it panics just the same way. Furthermore, I have a raidz2 (RAID6) pool with five disks in it, so theoretically I can remove any two and should be still able to access my pool, but no, even without one disk FreeBSD and Solaris say that there are not enough replicas for pool to continue functioning, showing that four of five devices are online and that it is raidz2 array. Is there anything else left to try?

kpa · Feb 24, 2012

Ask on freebsd-fs mailing list. It's hopefully not too late for you but RAID is never a substitute for proper backups, if you manage to recover your data the first thing you should do is to make backups of your valuable data.

kr651129 · Feb 29, 2012

Shot in the dark here -- have you tested your HDD to see if it's bad? On several occasions I've had a bad hard drive cause kernel panics at specific times or when programs are ran.

PacketMan · Jul 24, 2014

I'm not sure this is related or not, but my FreeBSD 10 OS is panic dumping at 3:00 am. Its happened once for sure, but maybe even twice. I will have to watch for this to be sure.

Code:

Jul 23 03:03:56 BSD001-NAS savecore: reboot after panic: page fault
Jul 23 03:03:56 BSD001-NAS savecore: writing core to /var/crash/vmcore.0

I’ll to have to login locally as root to view the crash files.

junovitch@ · Jul 27, 2014

Same advise should apply. The usual suspects would be periodic scripts that are heavy on disk IO if there are underlying disk problems. Start by running them one by one until you find the culprit. cd /etc/periodic/security; env PERIODIC=security ./100.chksetuid and so forth. There is a bug on 10.0 so you must specific the PERIODIC variable for all the security scripts.

PacketMan · Aug 13, 2014

Okay I will think that through.

But I seem to be missing something. I thought this issue was identified back in release 9, and thus I had assume that this would have been 'patched/fixed' in 10? Is there some sort of patch file I am supposed to download for release 10? Or is this issue still unresolved? Given that some of the busiest servers (thus lots of disk usage) in the world are FreeBSD based, I find it hard to believe that this issue would be still outstanding?

I am not running anything on this box other than Bittorrent SYNC, and its only used for syncing family pictures, and some other files. Not a busy box at all; typically CPU usage is 10% or less.

junovitch@ · Aug 13, 2014

The point is that on a box like yours that is not busy, the burst of disk IO done during periodic triggers a panic because of some latent hardware issue. There most likely nothing to fix in the OS. On the busiest servers in the world as your example mentions, hardware failures likely present themselves much sooner.

PacketMan · Aug 15, 2014

junovitch said:
..... because of some latent hardware issue. There most likely nothing to fix in the OS.

So you mean a hardware malfunction issue, or are you thinking hardware driver (software) issue? I'll have to see if I can find some sort of 'disk check' command. Waiting for Christmas for some books. -

I intend on building a new machine software identical to the one I have now, but the hardware will be all different.

Thanks again.

junovitch@ · Aug 21, 2014

PacketMan said:
So you mean a hardware malfunction issue, or are you thinking hardware driver (software) issue? I'll have to see if I can find some sort of 'disk check' command. Waiting for Christmas for some books. -

I intend on building a new machine software identical to the one I have now, but the hardware will be all different.

Thanks again.

Search for a memtest86 live CD to see if it's a memory issue. Try running the periodic scripts individully to see if one triggers it consistently. If it is a script that is heavy on disk IO, take a look at sysutils/smartmontools to query the drive and run a full check on it. Check for loose SATA cables or other cables in the actual machine. It could be any number of things. Those would probably be the usual culprits.

PacketMan · Aug 25, 2014

Righto. Thanks again.

ethoms · Aug 28, 2014

Sorry, I haven't had time to read the whole thread. But I had a server reboot quite often at exactly 3:00am. Basically it was a daily periodic script that checked mount points. On this particular server I had an sshfs remote filesystem mounted. The security script, ran at 3:00am daily would cause a panic (I guess).

Thanks to my /etc/periodic.conf (below), It no longer reboots at 3:00am.

Code:

$ cat /etc/periodic.conf                                                                                                                                                                                                                                   
# 200.chkmounts                                                                                                                                                                                                                                                                
security_status_chkmounts_enable="NO"                                                                                                                                                                                                                                          
                                                                                                                                                                                                                                                                               
# 310.locate                                                                                                                                                                                                                                                                   
weekly_locate_enable="NO"                              # Update locate weekly

Kernel Panic at 3:00 am

Administrator