ZFS zpool freezing when drive fails

alexanderdav · Aug 19, 2017

Hi,

We have a few boxes running FreeBSD 10.3 with RAIDZ1 pools. When a drive fails all IO to the pool is freezed and a reboot is needed to bring the pool back up. I'm able to login and use the system, except zpool and zfs commands that tries to access the degraded pool. The root file system is running in a different pool (and controller) and this is probably the reason I'm able to login.

All our systems are providing iSCSI devices based of zvol's to our VMWare infrastructure.
We have tried a number of different things to prevent the system from freezing, all without any luck. We have tried to following, in no specific order;
1. Upgraded the LSI Firmware to the latest version
2. Upgraded the mpr driver to the latest version from Avago
3. Upgraded the firmware on the disk to the latest version. We have disks from different vendors, all with the same issue. All pools runs identical disks, so no mixing, but we have different disks on different systems.
4. Upgraded to the latest release of 10.3. We started with 10.3-RELEASE, now on the latest version - same issue when a drive fails.

As I mentioned earlier we are seeing this failure on all our dedicated ZFS servers, but the ones running pools with mirror vdevs seems to be more stable, even when a drive fails. The servers with RAIDZ1 always freezes when a drive dies.

Any help would be greatly appreciated. I have tried to provide the necessary debug infomation below - please let me know if anything else is needed.

This is the output from dmesg when the disk fails:

Code:

ses2:  phy 0: parent 500304801eaf8a3f addr 500304801eaf8a0b
    (noperiph:mpr0:0:4294967295:0): SMID 1 Aborting command 0xfffffe0001603350
mpr0: Sending reset from mprsas_send_abort for target ID 49
mpr0: mprsas_prepare_remove: Sending reset for target ID 49
da22 at mpr0 bus 0 scbus12 target 49 lun 0
da22: <ATA ST4000NM0033-9ZM SN04> s/n             S1Z1S69Q detached
(da22:mpr0:0:49:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 
mpr0: (da22:mpr0:0:49:0): CAM status: Command timeout
IOCStatus = 0x4b while resetting device 0x1d
(da22:mpr0: mpr0:0:Unfreezing devq for target ID 49
49:0): Error 5, Periph was invalidated
mpr0: Unfreezing devq for target ID 49
(da22:mpr0:0:49:0): Periph destroyed

We use the LSI SAS3008 controller with SATA disks.

Code:

mpr0: <Avago Technologies (LSI) SAS3008> port 0xe000-0xe0ff mem 0xfbe40000-0xfbe4ffff,0xfbe00000-0xfbe3ffff irq 64 at device 0.0 on pci130

Output from zpool status:

Code:

zpool status tank
  pool: tank
 state: ONLINE
  scan: resilvered 1.50T in 20h34m with 0 errors on Sat Aug 19 12:19:39 2017
config:

    NAME                                                            STATE     READ WRITE CKSUM
    tank                                                            ONLINE       0     0     0
     raidz1-0                                                      ONLINE       0     0     0
       gpt/Z1Z9QYR9                                                ONLINE       0     0     0
       gpt/Z1Z9QXYJ                                                ONLINE       0     0     0
       gpt/Z1Z9QYJD                                                ONLINE       0     0     0
       gpt/Z1Z9QEZS                                                ONLINE       0     0     0
       gpt/Z1Z9R2K4                                                ONLINE       0     0     0
       gpt/Z1Z9PJLZ                                                ONLINE       0     0     0
     raidz1-1                                                      ONLINE       0     0     0
       gpt/Z1Z9QHHD                                                ONLINE       0     0     0
       gpt/Z1Z9PYTA                                                ONLINE       0     0     0
       diskid/DISK-%20%20%20%20%20%20%20%20%20%20%20%20Z1Z9QE2Cp1  ONLINE       0     0     0
       gpt/Z1Z9PYSP                                                ONLINE       0     0     0
       gpt/S1Z2H26P                                                ONLINE       0     0     0
       diskid/DISK-%20%20%20%20%20%20%20%20%20%20%20%20Z1Z9QZ6Cp1  ONLINE       0     0     0
     raidz1-2                                                      ONLINE       0     0     0
       gpt/Z1Z9QYQR                                                ONLINE       0     0     0
       gpt/Z1Z9QXYQ                                                ONLINE       0     0     0
       gpt/Z1Z9QG3S                                                ONLINE       0     0     0
       gpt/Z1Z95A68                                                ONLINE       0     0     0
       gpt/Z1Z9QY8X                                                ONLINE       0     0     0
       gpt/Z1Z9QK6D                                                ONLINE       0     0     0
     raidz1-3                                                      ONLINE       0     0     0
       gpt/Z1Z951DK                                                ONLINE       0     0     0
       gpt/Z1Z9QX6J                                                ONLINE       0     0     0
       gpt/Z1Z9QZNA                                                ONLINE       0     0     0
       gpt/Z1Z9QYHN                                                ONLINE       0     0     0
       gpt/Z1Z9QY4Z                                                ONLINE       0     0     0
       gpt/Z1Z9QY5T                                                ONLINE       0     0     0
     raidz1-4                                                      ONLINE       0     0     0
       gpt/Z1Z9PYTM                                                ONLINE       0     0     0
       gpt/Z1Z9R0ZZ                                                ONLINE       0     0     0
       gpt/Z1Z9PYS7                                                ONLINE       0     0     0
       gpt/Z1Z9PL0W                                                ONLINE       0     0     0
       gpt/Z1ZAX999                                                ONLINE       0     0     0
       gpt/Z1Z9QYRB                                                ONLINE       0     0     0
    logs
     mirror-5                                                      ONLINE       0     0     0
       gpt/BTTV5356011L100FGN                                      ONLINE       0     0     0
       gpt/BTTV5356010K100FGN                                      ONLINE       0     0     0
    cache
     diskid/DISK-S2HRNXAH200981%20%20%20%20%20%20p1                ONLINE       0     0     0
     diskid/DISK-S2HRNXAH200681%20%20%20%20%20%20p1                ONLINE       0     0     0
    spares
     diskid/DISK-%20%20%20%20%20%20%20%20%20%20%20%20S1Z2YLJFp1    AVAIL  

errors: No known data errors

/boot/loader.conf

Code:

kern.geom.label.gptid.enable="0"
mpr_load="YES"
ahci_load="YES"
zfs_load="YES"

SirDice · Aug 21, 2017

Looking at the log output it looks like the drive fails in such a way it hangs up the bus completely. That's probably the cause of the stalls. Offline (or just pull it out) the faulty disk so it cannot interfere on the bus anymore.

alexanderdav · Aug 24, 2017

Hm, the servers are remote, sp pulling the disk is not an option. How do I offline the disk? ZFS or using camcontrol?

I have upgraded most of the servers to 11.1-RELEASE now, in hope that this will help....

/A

SirDice · Aug 25, 2017

Try with zpool offline first, you can try this remotely. Not sure if it's going to help in your case but it's worth a shot.

Code:

     zpool offline [-t] pool device ...

         Takes the specified physical device offline. While the device is
         offline, no attempt is made to read or write to the device.

         -t      Temporary. Upon reboot, the specified physical device reverts
                 to its previous state.

ZFS zpool freezing when drive fails

alexanderdav

Attachments

SirDice

Administrator

alexanderdav

SirDice

Administrator