savecore found no dumps upon reboot after panic

_martin · Jan 17, 2011

Hello @,

My 8.1-STABLE amd64 server crashed today. Syslog output confirms that dump was done:

Code:

Jan 17 15:12:07 foxi kernel: panic: kmem_malloc(131072): kmem_map too small: 1997344768 total allocated
Jan 17 15:12:07 foxi kernel: cpuid = 0
Jan 17 15:12:07 foxi kernel: Uptime: 56d17h14m33s
Jan 17 15:12:07 foxi kernel: Physical memory: 6053 MB
Jan 17 15:12:07 foxi kernel: Dumping 4082 MB: 4067 4051 4035 4019 4003 3987 3971 3955 3939 3923 3907 3891 3875 3859 3843 3827 3811 3795 3779 
3763 3747 3731 3715 3699 3683 3667 3651 3635 3619 3603 3587 3571 3555 3539 3523 3507 3491 3475 3459interrupt                   total
Jan 17 15:12:07 foxi kernel: irq4: uart0                          393
Jan 17 15:12:07 foxi kernel: irq19: atapci1+                 38539647
Jan 17 15:12:07 foxi kernel: irq21: uhci1++                  15863989
Jan 17 15:12:07 foxi kernel: cpu0: timer                   9816520734
Jan 17 15:12:07 foxi kernel: irq256: em0                    219654258
Jan 17 15:12:07 foxi kernel: cpu1: timer                   9816520632
Jan 17 15:12:07 foxi kernel: Total                 19907099793

Problem is that there is no crash available, savecore didn't even log anything to dmesg. When i run [cmd=]/etc/rc.d/savecore start[/cmd] manually, I'm stuck with

Code:

No core dumps found.

I do use gmirror, but swap is not part of it.

[CMD=]gpart show[/cmd]

Code:

=>       34  312581741  ad6  GPT  (149G)
         34        128    1  freebsd-boot  (64K)
        162    4194304    2  freebsd-ufs  (2.0G)
    4194466   16777216    3  freebsd-swap  (8.0G)
   20971682    8388608    4  freebsd-ufs  (4.0G)
   29360290    8388608    5  freebsd-ufs  (4.0G)
   37748898   41943040    6  freebsd-ufs  (20G)
   79691938  232889837    7  freebsd-zfs  (111G)

=>       34  488397101  ad8  GPT  (233G)
         34        128    1  freebsd-boot  (64K)
        162    4194304    2  freebsd-ufs  (2.0G)
    4194466   16777216    3  freebsd-swap  (8.0G)
   20971682    8388608    4  freebsd-ufs  (4.0G)
   29360290    8388608    5  freebsd-ufs  (4.0G)
   37748898   41943040    6  freebsd-ufs  (20G)
   79691938  232889837    7  freebsd-zfs  (111G)
  312581775  175815360       - free -  (84G)

[CMD=]gmirror status[/cmd]

Code:

        Name    Status  Components
mirror/groot  COMPLETE  ad6p2
                        ad8p2
 mirror/gtmp  COMPLETE  ad6p4
                        ad8p4
 mirror/gvar  COMPLETE  ad6p5
                        ad8p5
 mirror/gusr  COMPLETE  ad6p6
                        ad8p6

[CMD=]swapinfo[/cmd]

Code:

Device          1K-blocks     Used    Avail Capacity
/dev/ad6p3        8388608        0  8388608     0%
/dev/ad8p3        8388608        0  8388608     0%
Total            16777216        0 16777216     0%

Following is my configuration:
/etc/rc.conf:

Code:

dumpdev="AUTO"
crashdir="/var/crash"

[CMD=]dmesg |grep -i phys[/cmd]

Code:

Physical memory: 6053 MB

[CMD=]zfs list local/crash[/cmd]

Code:

NAME          USED  AVAIL  REFER  MOUNTPOINT
local/crash    21K  16.0G    21K  /var/crash

[CMD=]ll /dev/dumpdev[/cmd]

Code:

lrwxr-xr-x  1 root  wheel  10 Jan 17 20:19 /dev/dumpdev -> /dev/ad6p3

I did a test and set /var/crash to be on UFS (with ~3.5GB free of space) with [cmd=]reboot -d[/cmd] -> same result: no crash, savecore didn't log anything anywhere.

Any hints are deeply appreciated.
I do suspect that initial crash was due to ZFS known issues, but not having a crash did put a bug in my head - specially when I don't know why.

thanks,
martin

_martin · Jan 18, 2011

I think I might be on right track to get what happened there yesterday. There are two scenarios I've described above - one where /var/crash was on ZFS and one where /var/crash was on UFS, though only ~3.5GB free of space.

I don't know why it was not able to savecore on ZFS (early run, ZFS not yet active?), but I think I know why it didn't do a dump on UFS.

Crash summary on UFS:

Code:

Jan 17 15:12:07 foxi kernel: [color="Red"]Dumping 4082 MB: [/color]4067 4051 4035 4019 4003 3987 3971 3955 3939 3923 3907 3891 3875 3859 3843 3827 3811 3795 3779 
3763 3747 3731 3715 3699 3683 3667 3651 3635 3619 3603 3587 3571 3555 3539 3523 3507 3491 3475 3459interrupt                   total

[CMD=]bdf /var/crash[/cmd]

Code:

Filesystem       1M-blocks Used Avail Capacity  Mounted on
/dev/mirror/gvar      3962  200  [color="Red"]3445[/color]     5%    /var/crash

It seems that savecore didn't have enough free space to store the dump. Question is why it didn't log it to /var/log/messages or to dmesg? I would expect to have a warning at least.

I did a test and created /var/crash on UFS with 16GB free of space, savecore found the dump and saved it as expected.

SirDice · Jan 18, 2011

matoatlantis said:
It seems that savecore didn't have enough free space to store the dump.

Yep.

Question is why it didn't log it to /var/log/messages or to dmesg? I would expect to have a warning at least.

From savecore(8):

The savecore utility also checks the available disk space before attempting to make the copies. If there is insufficient disk space in the file system containing directory, or if the file directory/minfree exists and the number of free kilobytes (for non-superusers) in the file system after the copies were made would be less than the number in the first line of this file, the copies are not attempted.

No mentioning of any messages. It also doesn't set any return codes so the script /etc/rc.d/savecore doesn't know if the command succeeded or not.

_martin · Jan 18, 2011

No mentioning of any messages. It also doesn't set any return codes so the script /etc/rc.d/savecore doesn't know if the command succeeded or not.

I guess I have it mixed up with savecrash(1M) from HP-UX.

I know, first assumption in investigation is not to assume..

SirDice · Jan 18, 2011

I do agree though, it would be better if it did return error codes. Then the script could catch those and report on them.

I also see no reason why it didn't work on ZFS. As far as I understood it's executed somewhere at the end of the rc scripts. By that time ZFS should be available.

_martin · Jan 19, 2011

I did a lab test - I used amd64 8.1-RELEASE in VMware to simulate the crash/dump.
Virtual machine had 512MB RAM with 512MB + 8192MB swap.

I used following system settings:

[CMD=]grep zfs /boot/loader.conf[/cmd]

Code:

zfs_load="YES"

[CMD=]grep -E 'zfs|dump|crash' /etc/rc.conf[/cmd]

Code:

dumpdev="/dev/da2p1"
crashdir="/var/crash"
zfs_enable="YES"

[CMD=]zfs list cpool/crash[/cmd]

Code:

NAME          USED  AVAIL  REFER  MOUNTPOINT
cpool/crash    18K  7.81G    18K  /var/crash

[CMD=]df -m /var/crash[/cmd]

Code:

Filesystem  1M-blocks Used Avail Capacity  Mounted on
cpool/crash      8000    0  8000     0%    /var/crash

[CMD=]swapinfo[/cmd]

Code:

Device          1K-blocks     Used    Avail Capacity
/dev/da0s1b        524288        0   524288     0%
/dev/da2p1        8388572        0  8388572     0%
Total             8912860        0  8912860     0%

ZFS is indeed activated before savecore is called:

[CMD=]rcorder /etc/rc.d/* | sed = | sed 'N;s/\n/ /'[/cmd]

Code:

..
16 /etc/rc.d/zfs
..
60 /etc/rc.d/syslogd
61 /etc/rc.d/savecore
..

I used [cmd=]reboot -d[/cmd] to simulate the crash again. 1072MB was dumped followed by the successful savecore.

I've got a feeling that last crash on my physical server was not successful due to space issues (6GB RAM + 16GB swap with 16GB free of space on /var/crash ZFS FS). Unfortunately KVM was not connected to the server at the time and I had to reboot the server using remote PDU management - all console output was lost.

Anyway, logging to syslog/dmesg during [man=]savecore[/man] would be really nice. Even if I add extra space to /var/crash afterwards, savecore(8) finds no available dumps and stops (I would expect to have a dump still on dumpdev if the swap is not heavily used).

yayix · Mar 4, 2011

rc.conf(5) did not list a crashdir variable. Use dumpdir instead.

_martin · Mar 7, 2011

@yayix: thanks! This is indeed strange why I used crashdir instead of dumpdir. I had to make a typo during tests, because I used to have crash in /var/adm/crash for years.

As I'm using the default /var/crash now it didn't matter anyway.

savecore found no dumps upon reboot after panic

_martin

_martin

SirDice

Administrator

_martin

SirDice

Administrator

_martin

yayix

_martin