ZFS why in bsd "Disk goes standby" is equal to "disk removed"

Hello,

I know a lots of BSD users experiencing the disk fall off and come back in a seconds as a result of internal disk power management timers which cause of ZFS pools to be degraded and sometimes make system completely unresponsive.
The same issue with removable usb backup drives.

here is the sample log entries, time between disk foes offline and back online ~15 sec:
Code:
08:11:27 vmhost kernel: (da1:mrsas0:1:1:0): WRITE(10). CDB: 2a 00 05 49 48 d8 00 00 10 00
08:11:27 vmhost kernel: (da1:mrsas0:1:1:0): CAM status: SCSI Status Error
08:11:27 vmhost kernel: (da1:mrsas0:1:1:0): SCSI status: OK
08:11:27 vmhost kernel: (da1:mrsas0:1:1:0): Invalidating pack
08:11:27 vmhost kernel: da1 at mrsas0 bus 1 scbus17 target 1 lun 0
08:11:27 vmhost kernel: da1: <ATA KINGSTON SA400S3 B1E2>  s/n 50026B7683B6A18D  detached
08:11:27 vmhost ZFS[22210]: vdev I/O failure, zpool=$zhost path=$/dev/da1p2 offset=$270336 size=$8192 error=$6 
08:11:27 vmhost ZFS[22211]: vdev I/O failure, zpool=$zhost path=$/dev/da1p2 offset=$120032862208 size=$8192 error=$6
08:11:27 vmhost ZFS[22212]: vdev I/O failure, zpool=$zhost path=$/dev/da1p2 offset=$120033124352 size=$8192 error=$6
08:11:27 vmhost ZFS[22213]: vdev probe failure, zpool=$zhost path=$/dev/da1p2
08:11:27 vmhost kernel: mrsas0: System PD deleted target ID: 0x1 
08:11:27 vmhost ZFS[22214]: vdev state changed, pool_guid=$8743077180665994084 vdev_guid=$3959867686622359320
08:11:27 vmhost ZFS[22215]: vdev I/O failure, zpool=$zhost path=$/dev/da1p2 offset=$270336 size=$8192 error=$6 
08:11:27 vmhost ZFS[22216]: vdev I/O failure, zpool=$zhost path=$/dev/da1p2 offset=$120032862208 size=$8192 error=$6
08:11:27 vmhost ZFS[22217]: vdev I/O failure, zpool=$zhost path=$/dev/da1p2 offset=$120033124352 size=$8192 error=$6
08:11:27 vmhost ZFS[22218]: vdev probe failure, zpool=$zhost path=$/dev/da1p2
08:11:27 vmhost ZFS[22219]: vdev state changed, pool_guid=$8743077180665994084 vdev_guid=$3959867686622359320
08:11:27 vmhost kernel: (da1:mrsas0:1:1:0): Periph destroyed
08:11:27 vmhost ZFS[22220]: vdev state changed, pool_guid=$8743077180665994084 vdev_guid=$3959867686622359320
08:11:27 vmhost ZFS[22221]: vdev is removed, pool_guid=$8743077180665994084 vdev_guid=$3959867686622359320 
08:11:41 vmhost kernel: mrsas0: System PD created target ID: 0x1 
08:11:41 vmhost kernel: da1 at mrsas0 bus 1 scbus17 target 1 lun 0
08:11:41 vmhost kernel: da1: <ATA KINGSTON SA400S3 B1E2> Fixed Direct Access SPC-4 SCSI device 
08:11:41 vmhost kernel: da1: Serial Number 50026B7683B6A18D
08:11:41 vmhost kernel: da1: 150.000MB/s transfers
08:11:41 vmhost kernel: da1: 114473MB (234441648 512 byte sectors)
08:11:41 vmhost kernel: ses2: pass3,da1 in 'Drive Slot 1', SAS Slot: 2 phys at slot 1
08:11:41 vmhost kernel: ses2:  phy 0: SATA device
08:11:41 vmhost kernel: ses2:  phy 0: parent 500056b36d81e5ff addr 500056b36d81e5c1
08:11:41 vmhost kernel: ses2:  phy 1: SAS device type 0 phy 0
08:11:41 vmhost kernel: ses2:  phy 1: parent 0 addr 0

In the most cases, disabling/modifying disk internal APM (advanced power management) and/or EPC (extended power conditions) does solves the issue. Sometimes....
My question, why disk idle cause so big impact on the system while on other systems this is only a minor os freeze caused by disk wakeup?

Is there a general solution for this issue ?
 
My question, why disk idle cause so big impact on the system while on other systems this is only a minor os freeze caused by disk wakeup?
It's a tuning question. How long is the OS willing to wait until declaring a slow IO to be failed? Years ago, when I was doing low-level IO for a living, I knew the answer for one particular Linux distribution we were using, and it was 300 seconds (5 minutes) by default, but adjustable using parameters of kernel calls (since my code was in the kernel, I was able to specify the timeout directly). I don't know how big that timeout limit is in FreeBSD, but I bet there are sysctl parameters for it. Try "sysctl -a | grep timeout", and perhaps decorate it with "fgrep .ada" or "fgrep .da" to see just disk-related timeouts. Try adjusting those (once you understand which sysctl does what).

I have another question about your log file: You seem to be using the mrsas driver. That's the driver for high-end LSI/Avago SAS disk controller cards. Yet you say you are seeing this problem on USB backup disks. That seems inconsistent: How do you connect a USB disk to an LSI card?
 
I have another question about your log file: You seem to be using the mrsas driver. That's the driver for high-end LSI/Avago SAS disk controller cards. Yet you say you are seeing this problem on USB backup disks. That seems inconsistent: How do you connect a USB disk to an LSI card?
The log is the sample. Same happening with USB drives, general sata drives on all types of machines.
here he a USB removable backup drive which I didn't managed to fix with APM/EPC options, have tries everything... Note that Smart Long Test shows no errors on drives. No performance or speed issue detected. Just random events like this ...
Code:
kernel: (da0:umass-sim0:0:0:0): WRITE(10). CDB: 2a 00 4f 0d 4c f0 00 01 00 00
kernel: (da0:umass-sim0:0:0:0): CAM status: SCSI Status Error
kernel: (da0:umass-sim0:0:0:0): SCSI status: Check Condition
kernel: (da0:umass-sim0:0:0:0): SCSI sense: ABORTED COMMAND asc:47,3 (Information unit iuCRC error detected)
kernel: (da0:umass-sim0:0:0:0): Retrying command (per sense data)
kernel: (da0:umass-sim0:0:0:0): WRITE(10). CDB: 2a 00 4f 7b b3 b8 00 01 00 00
kernel: (da0:umass-sim0:0:0:0): CAM status: SCSI Status Error
kernel: (da0:umass-sim0:0:0:0): SCSI status: Check Condition
kernel: (da0:umass-sim0:0:0:0): SCSI sense: ABORTED COMMAND asc:47,3 (Information unit iuCRC error detected)
kernel: (da0:umass-sim0:0:0:0): Retrying command (per sense data)

and yet another issue - camcontrol most time is useless for mrsas devices .... For those who trying make changes in apm - smartctl most times does the job, most times ...
 
“CRC error detected” clearly does not sound like it’s just a timeout or power management issue.
 
Lots? This is the first I've heard of it.

including 2 my ones at the bottom.


try to ddg/google : freebsd disk deattached
most of the issues related to disk idle time.

On the new hard drives / ssd with advanced power management system drives this issue become more prevalent. almost ALL Seagate "Green drives" just don't work without disabling APM and modifying EPC on freebsd/zfs raid/mirrors

~~~ edited ~~~
and some more, usb related:

This one contains very good explanation of APM/EPC applying to BSD
 
  • Like
Reactions: mtu
Looks like this is issue of ZFS and not FreeBSD. A friend of mine had same issue on openzfs on linux. Hard drive fall off with same messages when become idle...
 
Back
Top