ZFS why in bsd "Disk goes standby" is equal to "disk removed"

nerozero · Jul 7, 2021

Hello,

I know a lots of BSD users experiencing the disk fall off and come back in a seconds as a result of internal disk power management timers which cause of ZFS pools to be degraded and sometimes make system completely unresponsive.
The same issue with removable usb backup drives.

here is the sample log entries, time between disk foes offline and back online ~15 sec:

Code:

08:11:27 vmhost kernel: (da1:mrsas0:1:1:0): WRITE(10). CDB: 2a 00 05 49 48 d8 00 00 10 00
08:11:27 vmhost kernel: (da1:mrsas0:1:1:0): CAM status: SCSI Status Error
08:11:27 vmhost kernel: (da1:mrsas0:1:1:0): SCSI status: OK
08:11:27 vmhost kernel: (da1:mrsas0:1:1:0): Invalidating pack
08:11:27 vmhost kernel: da1 at mrsas0 bus 1 scbus17 target 1 lun 0
08:11:27 vmhost kernel: da1: <ATA KINGSTON SA400S3 B1E2>  s/n 50026B7683B6A18D  detached
08:11:27 vmhost ZFS[22210]: vdev I/O failure, zpool=$zhost path=$/dev/da1p2 offset=$270336 size=$8192 error=$6 
08:11:27 vmhost ZFS[22211]: vdev I/O failure, zpool=$zhost path=$/dev/da1p2 offset=$120032862208 size=$8192 error=$6
08:11:27 vmhost ZFS[22212]: vdev I/O failure, zpool=$zhost path=$/dev/da1p2 offset=$120033124352 size=$8192 error=$6
08:11:27 vmhost ZFS[22213]: vdev probe failure, zpool=$zhost path=$/dev/da1p2
08:11:27 vmhost kernel: mrsas0: System PD deleted target ID: 0x1 
08:11:27 vmhost ZFS[22214]: vdev state changed, pool_guid=$8743077180665994084 vdev_guid=$3959867686622359320
08:11:27 vmhost ZFS[22215]: vdev I/O failure, zpool=$zhost path=$/dev/da1p2 offset=$270336 size=$8192 error=$6 
08:11:27 vmhost ZFS[22216]: vdev I/O failure, zpool=$zhost path=$/dev/da1p2 offset=$120032862208 size=$8192 error=$6
08:11:27 vmhost ZFS[22217]: vdev I/O failure, zpool=$zhost path=$/dev/da1p2 offset=$120033124352 size=$8192 error=$6
08:11:27 vmhost ZFS[22218]: vdev probe failure, zpool=$zhost path=$/dev/da1p2
08:11:27 vmhost ZFS[22219]: vdev state changed, pool_guid=$8743077180665994084 vdev_guid=$3959867686622359320
08:11:27 vmhost kernel: (da1:mrsas0:1:1:0): Periph destroyed
08:11:27 vmhost ZFS[22220]: vdev state changed, pool_guid=$8743077180665994084 vdev_guid=$3959867686622359320
08:11:27 vmhost ZFS[22221]: vdev is removed, pool_guid=$8743077180665994084 vdev_guid=$3959867686622359320 
08:11:41 vmhost kernel: mrsas0: System PD created target ID: 0x1 
08:11:41 vmhost kernel: da1 at mrsas0 bus 1 scbus17 target 1 lun 0
08:11:41 vmhost kernel: da1: <ATA KINGSTON SA400S3 B1E2> Fixed Direct Access SPC-4 SCSI device 
08:11:41 vmhost kernel: da1: Serial Number 50026B7683B6A18D
08:11:41 vmhost kernel: da1: 150.000MB/s transfers
08:11:41 vmhost kernel: da1: 114473MB (234441648 512 byte sectors)
08:11:41 vmhost kernel: ses2: pass3,da1 in 'Drive Slot 1', SAS Slot: 2 phys at slot 1
08:11:41 vmhost kernel: ses2:  phy 0: SATA device
08:11:41 vmhost kernel: ses2:  phy 0: parent 500056b36d81e5ff addr 500056b36d81e5c1
08:11:41 vmhost kernel: ses2:  phy 1: SAS device type 0 phy 0
08:11:41 vmhost kernel: ses2:  phy 1: parent 0 addr 0

In the most cases, disabling/modifying disk internal APM (advanced power management) and/or EPC (extended power conditions) does solves the issue. Sometimes....
My question, why disk idle cause so big impact on the system while on other systems this is only a minor os freeze caused by disk wakeup?

Is there a general solution for this issue ?

ralphbsz · Jul 7, 2021

nerozero said:
My question, why disk idle cause so big impact on the system while on other systems this is only a minor os freeze caused by disk wakeup?

It's a tuning question. How long is the OS willing to wait until declaring a slow IO to be failed? Years ago, when I was doing low-level IO for a living, I knew the answer for one particular Linux distribution we were using, and it was 300 seconds (5 minutes) by default, but adjustable using parameters of kernel calls (since my code was in the kernel, I was able to specify the timeout directly). I don't know how big that timeout limit is in FreeBSD, but I bet there are sysctl parameters for it. Try "sysctl -a | grep timeout", and perhaps decorate it with "fgrep .ada" or "fgrep .da" to see just disk-related timeouts. Try adjusting those (once you understand which sysctl does what).

I have another question about your log file: You seem to be using the mrsas driver. That's the driver for high-end LSI/Avago SAS disk controller cards. Yet you say you are seeing this problem on USB backup disks. That seems inconsistent: How do you connect a USB disk to an LSI card?

nerozero · Jul 7, 2021

ralphbsz said:
I have another question about your log file: You seem to be using the mrsas driver. That's the driver for high-end LSI/Avago SAS disk controller cards. Yet you say you are seeing this problem on USB backup disks. That seems inconsistent: How do you connect a USB disk to an LSI card?

The log is the sample. Same happening with USB drives, general sata drives on all types of machines.
here he a USB removable backup drive which I didn't managed to fix with APM/EPC options, have tries everything... Note that Smart Long Test shows no errors on drives. No performance or speed issue detected. Just random events like this ...

Code:

kernel: (da0:umass-sim0:0:0:0): WRITE(10). CDB: 2a 00 4f 0d 4c f0 00 01 00 00
kernel: (da0:umass-sim0:0:0:0): CAM status: SCSI Status Error
kernel: (da0:umass-sim0:0:0:0): SCSI status: Check Condition
kernel: (da0:umass-sim0:0:0:0): SCSI sense: ABORTED COMMAND asc:47,3 (Information unit iuCRC error detected)
kernel: (da0:umass-sim0:0:0:0): Retrying command (per sense data)
kernel: (da0:umass-sim0:0:0:0): WRITE(10). CDB: 2a 00 4f 7b b3 b8 00 01 00 00
kernel: (da0:umass-sim0:0:0:0): CAM status: SCSI Status Error
kernel: (da0:umass-sim0:0:0:0): SCSI status: Check Condition
kernel: (da0:umass-sim0:0:0:0): SCSI sense: ABORTED COMMAND asc:47,3 (Information unit iuCRC error detected)
kernel: (da0:umass-sim0:0:0:0): Retrying command (per sense data)

and yet another issue - camcontrol most time is useless for mrsas devices .... For those who trying make changes in apm - smartctl most times does the job, most times ...

olli@ · Jul 7, 2021

“CRC error detected” clearly does not sound like it’s just a timeout or power management issue.

drhowarddrfine · Jul 7, 2021

nerozero said:
I know a lots of BSD users experiencing

Lots? This is the first I've heard of it.

nerozero · Jul 7, 2021

drhowarddrfine said:
Lots? This is the first I've heard of it.

including 2 my ones at the bottom.

Solved - CAM status: SCSI Status Error

I noticed these errors this morning. Any ideas? Two such drives had similar messages, but on msp2:0 (i.e. the other drive was (da19:mps2:0:12:0)). More info at https://gist.github.com/dlangille/88eac25349577aaca22a401ac08e9d1b Aug 25 06:10:32 knew kernel: (da18:mps2:0:11:0): READ(10). CDB...

forums.freebsd.org

How to make hard drive standby (spin down) on a timer?

I'd like to have my hard drives spin down when idle for, let's say 20 minutes. Any number of minutes is fine, as this NAS is rarely used. What I've tried and didn't work: ataidle -S 20 /dev/ada0 s...

serverfault.com

CAM status: SCSI Status Error

I have a VPS running FreeBSD 10.1-RELEASE-p41 on amd64 and see numerous instances of the following errors when running dmesg (da0:mpt0:0:0:0): CAM status: SCSI Status Error (da0:mpt0:0:0:0): SCSI status: Busy (da0:mpt0:0:0:0): Retrying command (da0:mpt0:0:0:0): WRITE(10). CDB: 2a 00 01 fc b9...

forums.freebsd.org

Seagate Hard disk randomly keeps turning on and off

Hello, I dont't know is this right place to discus... I have 2 computers built on two different Asus motherboards doing same with new (Green)Seagate Barakuda hard drives (Model Number: ST1000DM010 / Family: BARRACUDA35) - hard drives randomly disconnects and reconnects . This starts after...

forums.freebsd.org

Solved - HDD deattached randomly, no errors in the logs

Hello, Hard drive de-attached/re-attached randomly. No performance issues while attached detected. Log below. This is one of the disks in the ZFS Mirror, the first disk has no issues. No power lost/issues.... Jul 23 11:57:50 BSD01 kernel: ada1 at ahcich2 bus 0 scbus2 target 0 lun 0 Jul 23...

forums.freebsd.org

try to ddg/google : freebsd disk deattached
most of the issues related to disk idle time.

On the new hard drives / ssd with advanced power management system drives this issue become more prevalent. almost ALL Seagate "Green drives" just don't work without disabling APM and modifying EPC on freebsd/zfs raid/mirrors

~~~ edited ~~~
and some more, usb related:

USB disks hang system

I have a Toshiba 4TB external USB hard disk that I'm trying to use with FreeBSD. I can read from and write to the disk just after the system boots (or just after I plug it in). However, if I leave the disk unused for ~5 minutes, any subsequent reads or writes hang the system (to the extent that...

forums.freebsd.org

This one contains very good explanation of APM/EPC applying to BSD

https://serverfault.com/questions/1...ard-disk-spin-down-or-head-parking-in-freebsd

nerozero · Jul 18, 2021

Looks like this is issue of ZFS and not FreeBSD. A friend of mine had same issue on openzfs on linux. Hard drive fall off with same messages when become idle...