FreeBSD ZFS automatic rebooted due to hardisk failure.

belon_cfy · Oct 16, 2013

Hi
I found 2 of my FreeBSD ZFS servers were rebooted because one of the hardisk in RAID1 was failed. In order to reproduce the problem, I have setup the same server at office to simulate the problem and found that failing the HDD will cause the FreeBSD kernel panic and eventually automatically reboot the server.

Both servers are supermicro 1U with 4 disk bays, however the hardware are completely different, (Xeon 1230v2 and Xeon 5405 , 8GB Ram , 4X 2 TB SATA HDD and AHCI with hotswap enabled).

server is running ZFS RAID10 (I'm replacing the HDD)

Code:

  vol                         DEGRADED     0     0     0
          mirror-0                  ONLINE       0     0     0
            gpt/data-disk0          ONLINE       0     0     0
            gpt/data-disk1          ONLINE       0     0     0
          mirror-1                  DEGRADED     0     0     0
            replacing-0             UNAVAIL      0     0     0
              14892520269246031058  UNAVAIL      0     0     0  was /dev/gpt/data-disk2
              gpt/data_disk2        ONLINE       0     0     0  (resilvering)
            gpt/data-disk3          ONLINE       0     0     0
        logs
          gpt/slog-disk0            ONLINE       0     0     0
        cache
          gpt/l2arc-disk0           ONLINE       0     0     0

errors: No known data errors

  pool: zroot
 state: ONLINE
  scan: resilvered 2.01G in 0h4m with 0 errors on Wed Oct 16 10:49:53 2013
config:

        NAME              STATE     READ WRITE CKSUM
        zroot             ONLINE       0     0     0
          mirror-0        ONLINE       0     0     0
            gpt/os-disk0  ONLINE       0     0     0
            gpt/os-disk1  ONLINE       0     0     0
          mirror-1        ONLINE       0     0     0
            gpt/os_disk2  ONLINE       0     0     0
            gpt/os-disk3  ONLINE       0     0     0

Below is the message I have extracted from /var/log/message before it crashed.

Code:

Oct 13 00:20:10 storage kernel: ahcich2: AHCI reset: device not ready after 31000ms (tfd = 00000080)
Oct 13 00:21:58 storage kernel: ahcich2: Timeout on slot 27 port 0
Oct 13 00:21:58 storage kernel: ahcich2: is 00000000 cs 08000000 ss 00000000 rs 08000000 tfd 80 serr 00000000 cmd 00004017
Oct 13 00:21:58 storage kernel: (aprobe1:ahcich2:0:15:0): SOFT_RESET. ACB: 00 00 00 00 00 00 00 00 00 00 00 00
Oct 13 00:21:58 storage kernel: (aprobe1:ahcich2:0:15:0): CAM status: Unconditionally Re-queue Request
Oct 13 00:21:58 storage kernel: (aprobe1:ahcich2:0:15:0): Error 5, Retry was blocked
Oct 13 00:21:58 storage kernel: (aprobe0:ahcich2:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00
Oct 13 00:21:58 storage kernel: (aprobe0:ahcich2:0:0:0): CAM status: Command timeout
Oct 13 00:21:58 storage kernel: (aprobe0:ahcich2:0:0:0): Error 5, Retry was blocked
Oct 13 00:21:58 storage kernel: ahcich2: AHCI reset: device not ready after 31000ms (tfd = 00000080)
Oct 13 00:21:58 storage kernel: ahcich2: Poll timeout on slot 27 port 15
Oct 13 00:21:58 storage kernel: ahcich2: is 00000000 cs 08000000 ss 00000000 rs 08000000 tfd 80 serr 00000000 cmd 0000c017
Oct 13 00:21:58 storage kernel: (aprobe1:ahcich2:0:15:0): SOFT_RESET. ACB: 00 00 00 00 00 00 00 00 00 00 00 00
Oct 13 00:21:58 storage kernel: (aprobe1:ahcich2:0:15:0): CAM status: Command timeout
Oct 13 00:21:58 storage kernel: (aprobe1:ahcich2:0:15:0): Error 5, Retries exhausted
Oct 13 00:21:58 storage kernel: ahcich2: Timeout on slot 27 port 0
Oct 13 00:21:58 storage kernel: ahcich2: is 00000000 cs 08000000 ss 00000000 rs 08000000 tfd 80 serr 00000000 cmd 0000c017
Oct 13 00:21:58 storage kernel: (aprobe0:ahcich2:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00
Oct 13 00:21:58 storage kernel: (aprobe0:ahcich2:0:0:0): CAM status: Command timeout
Oct 13 00:21:58 storage kernel: (aprobe0:ahcich2:0:0:0): Error 5, Retry was blocked
Oct 13 00:21:58 storage kernel: (pass3:(ada2:ahcich2:0:ahcich2:0:0:0:0): passdevgonecb: devfs entry is gone
Oct 13 00:21:58 storage kernel: 0): lost device

Savagedlight · Oct 16, 2013

Which version of FreeBSD are you running?

belon_cfy · Oct 16, 2013

FreeBSD 9.1-RELEASE-p5 and 9.1-RELEASE-p7

Sebulon · Oct 16, 2013

@@belon_cfy

ItÂ´s a built-in SATA controller on the mobo right? Do you know what it is? I know SuperMicro throw in built-in SAS2008 controllers on newer mobos, so this is perhaps an older model? Have you tried to reproduce the problem on a SAS2008 controller with the mps driver?

/Sebulon

belon_cfy · Oct 16, 2013

Hi Sebulon,
One of the server is using the onboard SATA port with SAS815T backPlane, no LSI or others JBOD card is installed, below is the motherboard detail:
http://www.supermicro.com/products/motherboard/xeon/c202_c204/x9scl-f.cfm

I found the problem can be easily produced when GELI encryption is enabled on all the disk. When pulling out the HDD during heavy write for more than 20 minutes, it will trigger kernal panic and automatically reboot the server.

Sebulon · Oct 16, 2013

@@belon_cfy

OK, one of the serversÂ´s using the onboard SATA. What about the other one?

/Sebulon

belon_cfy · Oct 16, 2013

Hi Sebulon,
Another one using onboard SATA as well, however I don't know whether is it connected to the 815T backplane or not. This server does not have any GELI encrypted drive however it was automatically rebooted because one of the SATA disk failed.

Sebulon · Oct 22, 2013

@belon_cfy, do you have access to some SAS2008 controllers so you can test the procedure with the mps driver instead? Like a SM USAS2, LSI 9211-8i or IBM ServeRAID (all flashed IT), just to test the difference in hardware.

/Sebulon

belon_cfy · Oct 22, 2013

Sebulon said:
@belon_cfy, do you have access to some SAS2008 controllers so you can test the procedure with the mps driver instead? Like a SM USAS2, LSI 9211-8i or IBM ServeRAID (all flashed IT), just to test the difference in hardware.

/Sebulon

Hi @Sebulon, the server doesn't come with any LSI controller, no onboard RAID as well.