Howdy,
I'm new to FreeBSD, so apologies in advance if I'm incorrectly posting. If this is something that's better for one of the mailing lists, please let me know.
I've got a new system that I've been building, under test. I'm able to reliably panic the kernel by spinning down the hard drives and then issuing commands to the drives / file system.
The summary is: I use the
After a few panics I was able to get a kernel dump. I'm hoping someone can help me out and have a look at it. I've got data in /var/crash but it is 126M, so not suitable for posting, I assume.
Thanks in advance!
Here are more details:
Hardware:
Supermicro X8DTE-F Motherboard, 12G ram, 2x X5570
Supermicro SAS-846A drive backplane
2 x Samsung SSD 850 EVO connected to the motherboard as ZFS boot
3 x Dell H200 HBA cards, crossflashed to LSI 9211-8i:
Each card has 2 x SFF-8087 (ipass) to SF-8087 cables, that connect to the backplane.
There are 12 x Hitachi HUA72303 drives in the system. They are in alternating slots in the case (slot 0, 2, 4, etc.).
Here's the slot to daX mapping:
The 12 drives are in a zpool:
I'm running 10.2-RELEASE:
The mps(4) driver and firmware of the cards match:
To reproduce the panic, I run the
And wait until the drive spin-down.
I initially noticed that I could touch the filesystem on d1 without the drives spinning backup, which I thought odd.
For example:
I was also experimenting with the sg3_utils commands to see if I could query drives to see if they were spinning or not. This resulted in my first panic:
It's now in a state where
And after issuing a
At this point I power cycled to get back to a clean state and tried again. Again a similar pattern of commands caused a panic.
I then tried issuing the sg_requests command to different devices, to see if perhaps it was a drive or controller or cabling issue. It seems variable which device will panic the system each try. For example, on the 2nd time through
I'm new to FreeBSD, so apologies in advance if I'm incorrectly posting. If this is something that's better for one of the mailing lists, please let me know.
I've got a new system that I've been building, under test. I'm able to reliably panic the kernel by spinning down the hard drives and then issuing commands to the drives / file system.
The summary is: I use the
spindown command, wait 10 minutes for the drives to be spun down and then issue a sg_requests -H daX command. It varies from drive to drive, but after a small (sometimes a single) request, the command will hang (and I believe the bus as well). Issuing a reboot command after this will result in a panic.After a few panics I was able to get a kernel dump. I'm hoping someone can help me out and have a look at it. I've got data in /var/crash but it is 126M, so not suitable for posting, I assume.
Thanks in advance!
Here are more details:
Hardware:
Supermicro X8DTE-F Motherboard, 12G ram, 2x X5570
Supermicro SAS-846A drive backplane
2 x Samsung SSD 850 EVO connected to the motherboard as ZFS boot
3 x Dell H200 HBA cards, crossflashed to LSI 9211-8i:
Code:
Num Ctlr FW Ver NVDATA x86-BIOS PCI Addr
----------------------------------------------------------------------------
0 SAS2008(B2) 20.00.04.00 14.01.00.08 07.39.00.00 00:09:00:00
1 SAS2008(B2) 20.00.04.00 14.01.00.08 07.39.00.00 00:06:00:00
2 SAS2008(B1) 20.00.04.00 14.01.00.08 07.39.00.00 00:05:00:00
Each card has 2 x SFF-8087 (ipass) to SF-8087 cables, that connect to the backplane.
There are 12 x Hitachi HUA72303 drives in the system. They are in alternating slots in the case (slot 0, 2, 4, etc.).
Here's the slot to daX mapping:
Code:
# glabel status
Name Status Components
label/slot6 N/A da0
label/slot4 N/A da1
label/slot2 N/A da2
label/slot0 N/A da3
label/slot18 N/A da4
label/slot16 N/A da5
label/slot20 N/A da6
label/slot22 N/A da7
label/slot14 N/A da8
label/slot12 N/A da9
label/slot10 N/A da10
label/slot8 N/A da11
The 12 drives are in a zpool:
Code:
# zpool status
pool: d1
state: ONLINE
scan: none requested
config:
NAME STATE READ WRITE CKSUM
d1 ONLINE 0 0 0
raidz1-0 ONLINE 0 0 0
label/slot0 ONLINE 0 0 0
label/slot2 ONLINE 0 0 0
label/slot4 ONLINE 0 0 0
label/slot6 ONLINE 0 0 0
label/slot8 ONLINE 0 0 0
label/slot10 ONLINE 0 0 0
raidz1-1 ONLINE 0 0 0
label/slot12 ONLINE 0 0 0
label/slot14 ONLINE 0 0 0
label/slot16 ONLINE 0 0 0
label/slot18 ONLINE 0 0 0
label/slot20 ONLINE 0 0 0
label/slot22 ONLINE 0 0 0
errors: No known data errors
I'm running 10.2-RELEASE:
Code:
# uname -a
FreeBSD baybee 10.2-RELEASE FreeBSD 10.2-RELEASE #0 r286666: Wed Aug 12 15:26:37 UTC 2015 root@releng1.nyi.freebsd.org:/usr/obj/usr/src/sys/GENERIC amd64
The mps(4) driver and firmware of the cards match:
Code:
root@baybee:/var/crash # dmesg |grep fbsd
mps0: Firmware: 20.00.04.00, Driver: 20.00.00.00-fbsd
mps1: Firmware: 20.00.04.00, Driver: 20.00.00.00-fbsd
mps2: Firmware: 20.00.04.00, Driver: 20.00.00.00-fbsd
To reproduce the panic, I run the
spindown command:
Code:
# spindown -D -d /dev/da0 -d /dev/da1 -d /dev/da2 -d /dev/da3 -d /dev/da4 -d /dev/da5 -d /dev/da6 -d /dev/da7 -d /dev/da8 -d /dev/da9 -d /dev/da10 -d /dev/da11
And wait until the drive spin-down.
I initially noticed that I could touch the filesystem on d1 without the drives spinning backup, which I thought odd.
For example:
Code:
# cd /d1
# touch fred
# ls -l fred
# rm -r fred
I was also experimenting with the sg3_utils commands to see if I could query drives to see if they were spinning or not. This resulted in my first panic:
Code:
root@baybee:/var/log # sg_requests -H da0 <== spinning
00 70 00 00 00 00 00 00 0a 00 00 00 00 00 00 00 00
10 00 00
Stopping device da0
Unit stopped successfully
device: da0, max_idle_time: 600, rd: 58409, wr: 250, frees: 0, other: 52, idle time: 600, state: Not spinning
root@baybee:~ # sg_requests -H da0 <== not spinning
00 70 00 00 00 00 00 00 0a 00 00 00 00 04 02 00 00
10 00 00
It's now in a state where
sg_requests -H da0 is hangingAnd after issuing a
reboot in another shell:
Code:
panic: I/O to pool 'd1' appears to be hung on vdev guid 10286893083817872745 at '/dev/label/slot0'.
cpuid = 0
KDB: stack backtrace:
#0 0xffffffff80984e30 at kdb_backtrace+0x60
#1 0xffffffff809489e6 at vpanic+0x126
#2 0xffffffff809488b3 at panic+0x43
#3 0xffffffff81a1bef3 at vdev_deadman+0x123
#4 0xffffffff81a1be00 at vdev_deadman+0x30
#5 0xffffffff81a1be00 at vdev_deadman+0x30
#6 0xffffffff81a109d5 at spa_deadman+0x85
#7 0xffffffff8095e44b at softclock_call_cc+0x17b
#8 0xffffffff8095e874 at softclock+0x94
#9 0xffffffff8091482b at intr_event_execute_handlers+0xab
#10 0xffffffff80914c76 at ithread_loop+0x96
#11 0xffffffff8091244a at fork_exit+0x9a
#12 0xffffffff80d30d2e at fork_trampoline+0xe
Uptime: 1d20h45m36s
At this point I power cycled to get back to a clean state and tried again. Again a similar pattern of commands caused a panic.
I then tried issuing the sg_requests command to different devices, to see if perhaps it was a drive or controller or cabling issue. It seems variable which device will panic the system each try. For example, on the 2nd time through
sg_requests -H /dev/da8 was the one that panicked.