Hi,
We have a few boxes running FreeBSD 10.3 with RAIDZ1 pools. When a drive fails all IO to the pool is freezed and a reboot is needed to bring the pool back up. I'm able to login and use the system, except zpool and zfs commands that tries to access the degraded pool. The root file system is running in a different pool (and controller) and this is probably the reason I'm able to login.
All our systems are providing iSCSI devices based of zvol's to our VMWare infrastructure.
We have tried a number of different things to prevent the system from freezing, all without any luck. We have tried to following, in no specific order;
1. Upgraded the LSI Firmware to the latest version
2. Upgraded the mpr driver to the latest version from Avago
3. Upgraded the firmware on the disk to the latest version. We have disks from different vendors, all with the same issue. All pools runs identical disks, so no mixing, but we have different disks on different systems.
4. Upgraded to the latest release of 10.3. We started with 10.3-RELEASE, now on the latest version - same issue when a drive fails.
As I mentioned earlier we are seeing this failure on all our dedicated ZFS servers, but the ones running pools with mirror vdevs seems to be more stable, even when a drive fails. The servers with RAIDZ1 always freezes when a drive dies.
Any help would be greatly appreciated. I have tried to provide the necessary debug infomation below - please let me know if anything else is needed.
This is the output from dmesg when the disk fails:
We use the LSI SAS3008 controller with SATA disks.
Output from zpool status:
/boot/loader.conf
We have a few boxes running FreeBSD 10.3 with RAIDZ1 pools. When a drive fails all IO to the pool is freezed and a reboot is needed to bring the pool back up. I'm able to login and use the system, except zpool and zfs commands that tries to access the degraded pool. The root file system is running in a different pool (and controller) and this is probably the reason I'm able to login.
All our systems are providing iSCSI devices based of zvol's to our VMWare infrastructure.
We have tried a number of different things to prevent the system from freezing, all without any luck. We have tried to following, in no specific order;
1. Upgraded the LSI Firmware to the latest version
2. Upgraded the mpr driver to the latest version from Avago
3. Upgraded the firmware on the disk to the latest version. We have disks from different vendors, all with the same issue. All pools runs identical disks, so no mixing, but we have different disks on different systems.
4. Upgraded to the latest release of 10.3. We started with 10.3-RELEASE, now on the latest version - same issue when a drive fails.
As I mentioned earlier we are seeing this failure on all our dedicated ZFS servers, but the ones running pools with mirror vdevs seems to be more stable, even when a drive fails. The servers with RAIDZ1 always freezes when a drive dies.
Any help would be greatly appreciated. I have tried to provide the necessary debug infomation below - please let me know if anything else is needed.
This is the output from dmesg when the disk fails:
Code:
ses2: phy 0: parent 500304801eaf8a3f addr 500304801eaf8a0b
(noperiph:mpr0:0:4294967295:0): SMID 1 Aborting command 0xfffffe0001603350
mpr0: Sending reset from mprsas_send_abort for target ID 49
mpr0: mprsas_prepare_remove: Sending reset for target ID 49
da22 at mpr0 bus 0 scbus12 target 49 lun 0
da22: <ATA ST4000NM0033-9ZM SN04> s/n S1Z1S69Q detached
(da22:mpr0:0:49:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
mpr0: (da22:mpr0:0:49:0): CAM status: Command timeout
IOCStatus = 0x4b while resetting device 0x1d
(da22:mpr0: mpr0:0:Unfreezing devq for target ID 49
49:0): Error 5, Periph was invalidated
mpr0: Unfreezing devq for target ID 49
(da22:mpr0:0:49:0): Periph destroyed
We use the LSI SAS3008 controller with SATA disks.
Code:
mpr0: <Avago Technologies (LSI) SAS3008> port 0xe000-0xe0ff mem 0xfbe40000-0xfbe4ffff,0xfbe00000-0xfbe3ffff irq 64 at device 0.0 on pci130
Output from zpool status:
Code:
zpool status tank
pool: tank
state: ONLINE
scan: resilvered 1.50T in 20h34m with 0 errors on Sat Aug 19 12:19:39 2017
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
raidz1-0 ONLINE 0 0 0
gpt/Z1Z9QYR9 ONLINE 0 0 0
gpt/Z1Z9QXYJ ONLINE 0 0 0
gpt/Z1Z9QYJD ONLINE 0 0 0
gpt/Z1Z9QEZS ONLINE 0 0 0
gpt/Z1Z9R2K4 ONLINE 0 0 0
gpt/Z1Z9PJLZ ONLINE 0 0 0
raidz1-1 ONLINE 0 0 0
gpt/Z1Z9QHHD ONLINE 0 0 0
gpt/Z1Z9PYTA ONLINE 0 0 0
diskid/DISK-%20%20%20%20%20%20%20%20%20%20%20%20Z1Z9QE2Cp1 ONLINE 0 0 0
gpt/Z1Z9PYSP ONLINE 0 0 0
gpt/S1Z2H26P ONLINE 0 0 0
diskid/DISK-%20%20%20%20%20%20%20%20%20%20%20%20Z1Z9QZ6Cp1 ONLINE 0 0 0
raidz1-2 ONLINE 0 0 0
gpt/Z1Z9QYQR ONLINE 0 0 0
gpt/Z1Z9QXYQ ONLINE 0 0 0
gpt/Z1Z9QG3S ONLINE 0 0 0
gpt/Z1Z95A68 ONLINE 0 0 0
gpt/Z1Z9QY8X ONLINE 0 0 0
gpt/Z1Z9QK6D ONLINE 0 0 0
raidz1-3 ONLINE 0 0 0
gpt/Z1Z951DK ONLINE 0 0 0
gpt/Z1Z9QX6J ONLINE 0 0 0
gpt/Z1Z9QZNA ONLINE 0 0 0
gpt/Z1Z9QYHN ONLINE 0 0 0
gpt/Z1Z9QY4Z ONLINE 0 0 0
gpt/Z1Z9QY5T ONLINE 0 0 0
raidz1-4 ONLINE 0 0 0
gpt/Z1Z9PYTM ONLINE 0 0 0
gpt/Z1Z9R0ZZ ONLINE 0 0 0
gpt/Z1Z9PYS7 ONLINE 0 0 0
gpt/Z1Z9PL0W ONLINE 0 0 0
gpt/Z1ZAX999 ONLINE 0 0 0
gpt/Z1Z9QYRB ONLINE 0 0 0
logs
mirror-5 ONLINE 0 0 0
gpt/BTTV5356011L100FGN ONLINE 0 0 0
gpt/BTTV5356010K100FGN ONLINE 0 0 0
cache
diskid/DISK-S2HRNXAH200981%20%20%20%20%20%20p1 ONLINE 0 0 0
diskid/DISK-S2HRNXAH200681%20%20%20%20%20%20p1 ONLINE 0 0 0
spares
diskid/DISK-%20%20%20%20%20%20%20%20%20%20%20%20S1Z2YLJFp1 AVAIL
errors: No known data errors
/boot/loader.conf
Code:
kern.geom.label.gptid.enable="0"
mpr_load="YES"
ahci_load="YES"
zfs_load="YES"