Solved Panic: general protection fault, smartd, new drives

I have 2 ZFS pools each with 6 drives currently, plus a boot drive. I'm adding 6 more drives, which will be added as a new vdev to an existing pool. When I plug in the new drives and reboot, the system panics when smartd starts during the boot process.

I took a video of the boot process to actually see the panic which is only on-screen for a fraction of a second. I rebooted multiple times and it always crashes.

This is my attempt to type it out the screen capture of the panic (which is also attached):

Code:
 Processor eflags        = interrupt enabled, resume, IOPL = 0
 Current process        = 738 (smartd)
 Trap number             = 9
 Panic: general protection fault
 cpuid = 3
 KDB: stack backtrace:
 #0 0xffffffff8098e3e0 at kdb_backgtrace+0x60
 #1 0xffffffff809510b6 at vpanic+0x126
 #2 0xffffffff80950f83 at panic+0x43
 #3 0xffffffff80d55f8b at trap_fatal+0x36b
 #4 0xffffffff80d55c0d at trap+0x77d
 #5 0xffffffff80d3b8d2 at calltraip+0x8
 #6 0xffffffff802ee93d at xptedtbusfunc+0x24d
 #7 0xffffffff802ed509 at xptbustraverse+0xa9
 #8 0xffffffff802e7b9b at xpt_action_default+0x
 #9 0xffffffff802f075c at xptdoioctl+0x86c
 #10 ffffffff802efeb2 at xptioctl+0x22
 #11 ffffffff80835dc9 at devfs_ioctl_f+0x13
 #12 ffffffff809a8c85 at kern_ioctl+0x255
 #13 ffffffff809a8980 at sys_ioctl+0x140
 #14 ffffffff80d5695f at amd64_syscall+0x40f
 #15 ffffffff80d3bbbb at Xfast_syscall+0xfb
 Uptime: 2m34s

Some other information about the system:
Code:
# uname -a
FreeBSD myserver 10.3-RELEASE-p11 FreeBSD 10.3-RELEASE-p11 #0: Mon Oct 24 18:49:24 UTC 2016     root@amd64-builder.daemonology.net:/usr/obj/usr/src/sys/GENERIC  amd64
# freebsd-version -uk
10.3-RELEASE-p11
10.3-RELEASE-p15

I'm looking for any tips on how to debug/fix. If I disconnect the new drives, the system boots fine.
 

Attachments

  • IMG_7643.jpg
    IMG_7643.jpg
    71.7 KB · Views: 800
I disabled smartd on boot, and the server now boots. However, simply running camcontrol devlist leads to a panic. Here's the panic that was saved:

Code:
Myserver dumped core - see /var/crash/vmcore

Thu Dec 29 10:19:51 CET 2016

FreeBSD Myserver 10.3-RELEASE-p11 FreeBSD 10.3-RELEASE-p11 #0: Mon Oct 24 18:49:24 UTC 2016     root@amd64-builder.daemonology.net:/usr/obj/usr/src/sys/GENERIC  amd64

panic: general protection fault

GNU gdb 6.1.1 [FreeBSD]
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "amd64-marcel-freebsd"...

Unread portion of the kernel message buffer:


Fatal trap 9: general protection fault while in kernel mode
cpuid = 5; apic id = 05
instruction pointer    = 0x20:0xffffffff80a0c440
stack pointer           = 0x28:0xfffffe085c29f180
frame pointer           = 0x28:0xfffffe085c29f190
code segment        = base 0x0, limit 0xfffff, type 0x1b
            = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags    = interrupt enabled, resume, IOPL = 0
current process        = 8773 (camcontrol)
trap number        = 9
panic: general protection fault
cpuid = 5
KDB: stack backtrace:
#0 0xffffffff8098e3e0 at kdb_backtrace+0x60
#1 0xffffffff809510b6 at vpanic+0x126
#2 0xffffffff80950f83 at panic+0x43
#3 0xffffffff80d55f8b at trap_fatal+0x36b
#4 0xffffffff80d55c0d at trap+0x77d
#5 0xffffffff80d3b8d2 at calltrap+0x8
#6 0xffffffff802ee93d at xptedtbusfunc+0x24d
#7 0xffffffff802ed509 at xptbustraverse+0xa9
#8 0xffffffff802e7b9b at xpt_action_default+0x36b
#9 0xffffffff802f075c at xptdoioctl+0x86c
#10 0xffffffff802efeb2 at xptioctl+0x22
#11 0xffffffff80835dc9 at devfs_ioctl_f+0x139
#12 0xffffffff809a8c85 at kern_ioctl+0x255
#13 0xffffffff809a8980 at sys_ioctl+0x140
#14 0xffffffff80d5695f at amd64_syscall+0x40f
#15 0xffffffff80d3bbbb at Xfast_syscall+0xfb
Uptime: 4m34s
Dumping 1330 out of 32640 MB:..2%..11%..21%..31%..41%..51%..61%..71%..81%..91%

Reading symbols from /boot/kernel/if_tap.ko.symbols...done.
Loaded symbols for /boot/kernel/if_tap.ko.symbols
Reading symbols from /boot/kernel/if_bridge.ko.symbols...done.
Loaded symbols for /boot/kernel/if_bridge.ko.symbols
Reading symbols from /boot/kernel/bridgestp.ko.symbols...done.
Loaded symbols for /boot/kernel/bridgestp.ko.symbols
Reading symbols from /boot/kernel/vmm.ko.symbols...done.
Loaded symbols for /boot/kernel/vmm.ko.symbols
Reading symbols from /boot/kernel/nmdm.ko.symbols...done.
Loaded symbols for /boot/kernel/nmdm.ko.symbols
Reading symbols from /boot/kernel/zfs.ko.symbols...done.
Loaded symbols for /boot/kernel/zfs.ko.symbols
Reading symbols from /boot/kernel/opensolaris.ko.symbols...done.
Loaded symbols for /boot/kernel/opensolaris.ko.symbols
Reading symbols from /boot/kernel/uhid.ko.symbols...done.
Loaded symbols for /boot/kernel/uhid.ko.symbols
Reading symbols from /boot/kernel/nullfs.ko.symbols...done.
Loaded symbols for /boot/kernel/nullfs.ko.symbols
Reading symbols from /boot/kernel/fdescfs.ko.symbols...done.
Loaded symbols for /boot/kernel/fdescfs.ko.symbols
#0  doadump (textdump=<value optimized out>) at pcpu.h:219
219    pcpu.h: No such file or directory.
    in pcpu.h
(kgdb) #0  doadump (textdump=<value optimized out>) at pcpu.h:219
#1  0xffffffff80950d12 in kern_reboot (howto=260)
    at /usr/src/sys/kern/kern_shutdown.c:486
#2  0xffffffff809510f5 in vpanic (fmt=<value optimized out>, 
    ap=<value optimized out>) at /usr/src/sys/kern/kern_shutdown.c:889
#3  0xffffffff80950f83 in panic (fmt=0x0)
    at /usr/src/sys/kern/kern_shutdown.c:818
#4  0xffffffff80d55f8b in trap_fatal (frame=<value optimized out>, 
    eva=<value optimized out>) at /usr/src/sys/amd64/amd64/trap.c:858
#5  0xffffffff80d55c0d in trap (frame=<value optimized out>)
    at /usr/src/sys/amd64/amd64/trap.c:203
#6  0xffffffff80d3b8d2 in calltrap ()
    at /usr/src/sys/amd64/amd64/exception.S:236
#7  0xffffffff80a0c440 in strncpy (dst=0xfffffe083002d5b0 "", 
    src=<value optimized out>, n=<value optimized out>)
    at /usr/src/sys/libkern/strncpy.c:44
#8  0xffffffff802ee93d in xptedtbusfunc (bus=0xfffff80008716c80, 
    arg=0xfffff800087ad000) at /usr/src/sys/cam/cam_xpt.c:1588
#9  0xffffffff802ed509 in xptbustraverse (start_bus=<value optimized out>, 
    tr_func=0xffffffff802ee6f0 <xptedtbusfunc>, arg=0xfffff800087ad000)
    at /usr/src/sys/cam/cam_xpt.c:2085
#10 0xffffffff802e7b9b in xpt_action_default (start_ccb=0xfffff800087ad000)
    at /usr/src/sys/cam/cam_xpt.c:1882
#11 0xffffffff802f075c in xptdoioctl (dev=<value optimized out>, 
    cmd=<value optimized out>, addr=0xfffff800087ad000 "", flag=805486592, 
    td=0xfffffe083002c000) at /usr/src/sys/cam/cam_xpt.c:2460
#12 0xffffffff802efeb2 in xptioctl (dev=0xfffff80015f96800, cmd=3302496258, 
    addr=0xfffff800087ad000 "", flag=3, td=0xfffff80598f0a960)
    at /usr/src/sys/cam/cam_xpt.c:393
#13 0xffffffff80835dc9 in devfs_ioctl_f (fp=0xfffff80017353a50, 
    com=3302496258, data=0xfffff800087ad000, cred=<value optimized out>, 
    td=0xfffff80598f0a960) at /usr/src/sys/fs/devfs/devfs_vnops.c:786
#14 0xffffffff809a8c85 in kern_ioctl (td=0xfffff80598f0a960, 
    fd=<value optimized out>, com=562949953421312) at file.h:321
#15 0xffffffff809a8980 in sys_ioctl (td=0xfffff80598f0a960, 
    uap=0xfffffe085c29fa40) at /usr/src/sys/kern/sys_generic.c:718
#16 0xffffffff80d5695f in amd64_syscall (td=0xfffff80598f0a960, traced=0)
    at subr_syscall.c:141
#17 0xffffffff80d3bbbb in Xfast_syscall ()
    at /usr/src/sys/amd64/amd64/exception.S:396
#18 0x0000000800fcbf7a in ?? ()
Previous frame inner to this frame (corrupt stack?)
Current language:  auto; currently minimal
(kgdb)
 
I'm not sure if it's the cause but I'd check various hardware for errors. Panics typically occur when there's a real issue. I would start by checking if your memory is still good.
 
Thanks. I've run memtest86+ for a couple of hours now without it detecting any errors. Are there other specific tests you'd recommend I should run?
 
Judging by the xpt* calls it looks like the disk controller. Which seems to be triggered by various tools that try to access it. What disk controller does it have? And which driver is used?
 
These new drives are being connected through an inexpensive disk controller using the Marvell 9705 Chipset (sorry, not in English, but here's where I bought it: https://www.amazon.it/gp/product/B0167N4QCQ ).

The controller has 8 ports, and moving cables around I find that the panic only happens if I use 4 specific ports on one half of the controller, it doesn't happen if I use the other 4 ports on the other half.

I have 2 similar cards installed, one of which only has 4 ports and is working fine. It looks like both are using the ahci driver:

Code:
ahci0@pci0:2:0:0:       class=0x010601 card=0x92351b4b chip=0x92351b4b rev=0x11
hdr=0x00
    vendor     = 'Marvell Technology Group Ltd.'
    device     = '88SE9235 PCIe 2.0 x2 4-port SATA 6 Gb/s Controller'
    class      = mass storage
    subclass   = SATA
ahci1@pci0:5:0:0:       class=0x010601 card=0x92151b4b chip=0x92151b4b rev=0x11
hdr=0x00
    vendor     = 'Marvell Technology Group Ltd.'
    class      = mass storage
    subclass   = SATA

During boot, it seems the system is only recognizing 4 ports for each of these cards, even the card that actually has 8 ports:

Code:
ahci0: <Marvell 88SE9235 AHCI SATA controller> port 0xe050-0xe057,0xe040-0xe043,0xe030-0xe037,0xe020-0xe023,0xe000-0xe01f mem 0xf7e10000-0xf7e107ff irq 17 at device 0.0 on pci2
ahci0: AHCI v1.00 with 4 6Gbps ports, Port Multiplier supported with FBS
ahcich0: <AHCI channel> at channel 0 on ahci0
ahcich1: <AHCI channel> at channel 1 on ahci0
ahcich2: <AHCI channel> at channel 2 on ahci0
ahcich3: <AHCI channel> at channel 3 on ahci0
ahci1: <Marvell 88SE9215 AHCI SATA controller> port 0xc050-0xc057,0xc040-0xc043,0xc030-0xc037,0xc020-0xc023,0xc000-0xc01f mem 0xf7c10000-0xf7c107ff irq 16 at device 0.0 on pci5
ahci1: AHCI v1.00 with 4 6Gbps ports, Port Multiplier supported with FBS
ahcich4: <AHCI channel> at channel 0 on ahci1
ahcich5: <AHCI channel> at channel 1 on ahci1
ahcich6: <AHCI channel> at channel 2 on ahci1
ahcich7: <AHCI channel> at channel 3 on ahci1

Does this suggest the card is bad? Or perhaps a driver issue? Is there more debugging I can do to perhaps get the other 4 ports properly working?
 
Does this suggest the card is bad? Or perhaps a driver issue? Is there more debugging I can do to perhaps get the other 4 ports properly working?
Let's see if mentioning mav@ sends him an alert to this thread. He should either be able to fix it with a quick patch or at least know what additional information is needed.
 
Panic is never good, BUT
Card is probably designed as marvell 9235/9215 host controller with 3 sata ports directly to disk and 4th port chained to 9705/9715 port multiplier. This port multiplier is the culprit. Manufacturer exuses himself for a 30sec boot delay on product website http://www.sybausa.com/index.php?route=product/product&product_id=160. Seemingly it works if only 1 disk is attached. Think again whether you really want 5 disks competing on a single sata channel. Even if you get this to work. Expect very uneven performance.
 
I upgraded my server to 11.0-RELEASE-p2 and then plugged the drive back into the 8 port card. It worked this time without any problems, and through multiple reboots.

As suggested above, one of the ports clearly is a multiplier, the final drive that I plugged into the 5th port shows up as a second target on the same scbus:

Code:
<ST2000LM007-1R8174 SBK2>          at scbus1 target 0 lun 0 (ada1,pass1)
<ST2000LM007-1R8174 SBK2>          at scbus2 target 0 lun 0 (ada2,pass2)
<ST2000LM007-1R8174 SBK2>          at scbus3 target 0 lun 0 (ada3,pass3)
<ST2000LM007-1R8174 SBK2>          at scbus3 target 1 lun 0 (ada4,pass4)
 
Perhaps this was indeed a hardware fault. A couple days later, here was the result:

Code:
Jan  3 21:09:00 albert devd: Processing event '!system=CAM subsystem=periph type=timeout device=ada9 serial="...." cam_status="0x4b" timeout=30000 ACB="25 00 60 d0 af 40 9d 00 00 00 08 00" '
Jan  3 21:09:31 albert kernel: (ada9:ata2:0:0:0): READ_DMA48. ACB: 25 00 60 d0 af 40 9d 00 00 00 08 00
Jan  3 21:09:31 albert kernel: ada9 at ata2 bus 0 scbus8 target 0 lun 0
Jan  3 21:09:31 albert kernel: ada9: <ST2000LM007-1R8174 SBK2> s/n .... detached
Jan  3 21:09:31 albert kernel: ada10 at ata2 bus 0 scbus8 target 1 lun 0
Jan  3 21:09:31 albert kernel: ada10: <ST2000LM003 HN-M201RAD 2BE10001> s/n .... detached
Jan  3 21:09:31 albert devd: Processing event '!system=DEVFS subsystem=CDEV type=DESTROY cdev=ada9'
Jan  3 21:09:31 albert devd: Processing event '!system=DEVFS subsystem=CDEV type=DESTROY cdev=ada10'
Jan  3 21:09:31 albert devd: Processing event '!system=GEOM subsystem=DEV type=DESTROY cdev=ada9'
Jan  3 21:09:31 albert devd: Processing event '!system=GEOM subsystem=DEV type=DESTROY cdev=ada10'
Jan  3 21:09:31 albert kernel: (ada10:ata2:0:1:0): Periph destroyed
Jan  3 21:09:31 albert kernel: (ada9:ata2:0:0:0): Periph destroyed
 
Back
Top