Solved CAM status: SCSI Status Error

dvl@ · Sep 9, 2017

Same system, another SSD, but I don't think this is related. I suspect all might be cables.

Background:

Code:

ada1 at ahcich6 bus 0 scbus10 target 0 lun 0
ada1: <CT480BX200SSD1 MU01.4> ACS-2 ATA SATA 3.x device
ada1: Serial Number 1543F00F3587
ada1: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 1024bytes)
ada1: Command Queueing enabled
ada1: 457862MB (937703088 512 byte sectors)
ada1: Previously was known as ad16

The errors.

Code:

Sep  9 05:03:50 knew kernel: ahcich6: Timeout on slot 8 port 0
Sep  9 05:03:50 knew kernel: ahcich6: is 00000000 cs 00000000 ss 0003ff00 rs 0003ff00 tfd 40 serr 00000000 cmd 0004d117
Sep  9 05:03:50 knew kernel: (ada1:ahcich6:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 f8 28 32 02 40 00 00 00 00 00 00
Sep  9 05:03:50 knew kernel: (ada1:ahcich6:0:0:0): CAM status: Command timeout
Sep  9 05:03:50 knew kernel: (ada1:ahcich6:0:0:0): Retrying command
Sep  9 05:11:34 knew kernel: ahcich6: AHCI reset: device not ready after 31000ms (tfd = 00000080)
Sep  9 05:11:34 knew kernel: ahcich6: Timeout on slot 18 port 0
Sep  9 05:11:34 knew kernel: ahcich6: is 00000000 cs 00040000 ss 00000000 rs 00040000 tfd 80 serr 00000000 cmd 0004d217
Sep  9 05:11:34 knew kernel: (aprobe0:ahcich6:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00
Sep  9 05:11:34 knew kernel: (aprobe0:ahcich6:0:0:0): CAM status: Command timeout
Sep  9 05:11:34 knew kernel: (aprobe0:ahcich6:0:0:0): Retrying command
Sep  9 05:11:34 knew kernel: ahcich6: AHCI reset: device not ready after 31000ms (tfd = 00000080)
Sep  9 05:11:34 knew kernel: ahcich6: Timeout on slot 19 port 0
Sep  9 05:11:34 knew kernel: ahcich6: is 00000000 cs 00080000 ss 00000000 rs 00080000 tfd 80 serr 00000000 cmd 0004d317
Sep  9 05:11:34 knew kernel: (aprobe0:ahcich6:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00
Sep  9 05:11:34 knew kernel: (aprobe0:ahcich6:0:0:0): CAM status: Command timeout
Sep  9 05:11:34 knew kernel: (aprobe0:ahcich6:0:0:0): Error 5, Retries exhausted
Sep  9 05:11:34 knew kernel: ahcich6: AHCI reset: device not ready after 31000ms (tfd = 00000080)
Sep  9 05:11:34 knew kernel: ahcich6: Timeout on slot 20 port 0
Sep  9 05:11:34 knew kernel: ahcich6: is 00000000 cs 00100000 ss 00000000 rs 00100000 tfd 80 serr 00000000 cmd 0004d417
Sep  9 05:11:34 knew kernel: (aprobe0:ahcich6:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00
Sep  9 05:11:34 knew kernel: (aprobe0:ahcich6:0:0:0): CAM status: Command timeout
Sep  9 05:11:34 knew kernel: (aprobe0:ahcich6:0:0:0): Error 5, Retry was blocked
Sep  9 05:11:34 knew kernel: ada1 at ahcich6 bus 0 scbus10 target 0 lun 0
Sep  9 05:11:34 knew kernel: ada1: <CT480BX200SSD1 MU01.4> s/n 1543F00F3587 detached
Sep  9 05:11:34 knew kernel: ahcich6: AHCI reset: device not ready after 31000ms (tfd = 00000080)
Sep  9 05:11:34 knew kernel: xptioctl: pass driver is not in the kernel
Sep  9 05:11:34 knew kernel: xptioctl: put "device pass" in your kernel config file
Sep  9 05:11:34 knew kernel: xptioctl: pass driver is not in the kernel
Sep  9 05:11:34 knew kernel: xptioctl: put "device pass" in your kernel config file
Sep  9 05:11:34 knew kernel: ahcich6: Timeout on slot 21 port 0
Sep  9 05:11:34 knew kernel: ahcich6: is 00000000 cs 00200000 ss 00000000 rs 00200000 tfd 80 serr 00000000 cmd 0004d517
Sep  9 05:11:34 knew kernel: (aprobe0:ahcich6:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00
Sep  9 05:11:34 knew kernel: (aprobe0:ahcich6:0:0:0): CAM status: Command timeout
Sep  9 05:11:34 knew kernel: (aprobe0:ahcich6:0:0:0): Retrying command
Sep  9 05:11:34 knew kernel: ahcich6: AHCI reset: device not ready after 31000ms (tfd = 00000080)
Sep  9 05:11:34 knew kernel: xptioctl: pass driver is not in the kernel
Sep  9 05:11:34 knew kernel: xptioctl: put "device pass" in your kernel config file
Sep  9 05:11:34 knew kernel: xptioctl: pass driver is not in the kernel
Sep  9 05:11:34 knew kernel: xptioctl: put "device pass" in your kernel config file
Sep  9 05:11:34 knew kernel: ahcich6: Timeout on slot 22 port 0
Sep  9 05:11:34 knew kernel: ahcich6: is 00000000 cs 00400000 ss 00000000 rs 00400000 tfd 80 serr 00000000 cmd 0004d617
Sep  9 05:11:34 knew kernel: (aprobe0:ahcich6:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00
Sep  9 05:11:34 knew kernel: (aprobe0:ahcich6:0:0:0): CAM status: Command timeout
Sep  9 05:11:34 knew kernel: (aprobe0:ahcich6:0:0:0): Error 5, Retries exhausted
Sep  9 05:11:34 knew kernel: xptioctl: pass driver is not in the kernel
Sep  9 05:11:34 knew kernel: xptioctl: put "device pass" in your kernel config file
Sep  9 05:11:34 knew kernel: xptioctl: pass driver is not in the kernel
Sep  9 05:11:34 knew kernel: xptioctl: put "device pass" in your kernel config file
Sep  9 05:11:34 knew kernel: ahcich6: AHCI reset: device not ready after 31000ms (tfd = 00000080)
Sep  9 05:11:34 knew kernel: ahcich6: Poll timeout on slot 24 port 0
Sep  9 05:11:34 knew kernel: ahcich6: is 00000000 cs 01000000 ss 00000000 rs 01000000 tfd 80 serr 00000000 cmd 0004d817
Sep  9 05:11:34 knew kernel: (aprobe0:ahcich6:0:0:0): NOP. ACB: 00 00 00 00 00 00 00 00 00 00 00 00
Sep  9 05:11:34 knew kernel: (aprobe0:ahcich6:0:0:0): CAM status: Command timeout
Sep  9 05:11:34 knew kernel: (aprobe0:ahcich6:0:0:0): Error 5, Retries exhausted
Sep  9 05:11:34 knew kernel: ahcich6: Timeout on slot 25 port 0
Sep  9 05:11:34 knew kernel: ahcich6: is 00000000 cs 02000000 ss 00000000 rs 02000000 tfd 80 serr 00000000 cmd 0004d917
Sep  9 05:11:34 knew kernel: (ada1:ahcich6:0:0:0): SETFEATURES ENABLE RCACHE. ACB: ef aa 00 00 00 40 00 00 00 00 00 00
Sep  9 05:11:34 knew kernel: (ada1:ahcich6:0:0:0): CAM status: Command timeout
Sep  9 05:11:34 knew kernel: (ada1:ahcich6:0:0:0): Error 5, Periph was invalidated
Sep  9 05:11:34 knew kernel: xptioctl: pass driver is not in the kernel
Sep  9 05:11:34 knew kernel: xptioctl: put "device pass" in your kernel config file
Sep  9 05:11:34 knew kernel: xptioctl: pass driver is not in the kernel
Sep  9 05:11:34 knew kernel: xptioctl: put "device pass" in your kernel config file
Sep  9 05:11:34 knew kernel: ahcich6: AHCI reset: device not ready after 31000ms (tfd = 00000080)
Sep  9 05:11:34 knew kernel: ahcich6: Poll timeout on slot 27 port 0
Sep  9 05:11:34 knew kernel: ahcich6: is 00000000 cs 08000000 ss 00000000 rs 08000000 tfd 80 serr 00000000 cmd 0004db17
Sep  9 05:11:34 knew kernel: (aprobe0:ahcich6:0:0:0): NOP. ACB: 00 00 00 00 00 00 00 00 00 00 00 00
Sep  9 05:11:34 knew kernel: (aprobe0:ahcich6:0:0:0): CAM status: Command timeout
Sep  9 05:11:34 knew kernel: (aprobe0:ahcich6:0:0:0): Error 5, Retries exhausted
Sep  9 05:11:34 knew kernel: ahcich6: Timeout on slot 28 port 0
Sep  9 05:11:34 knew kernel: ahcich6: is 00000000 cs f000003f ss f000003f rs f000003f tfd 80 serr 00000000 cmd 0004dc17
Sep  9 05:11:34 knew kernel: (ada1:ahcich6:0:0:0): DSM TRIM. ACB: 06 01 00 00 00 40 00 00 00 00 01 00
Sep  9 05:11:34 knew kernel: (ada1:ahcich6:0:0:0): CAM status: Unconditionally Re-queue Request
Sep  9 05:11:34 knew kernel: (ada1:ahcich6:0:0:0): Error 5, Periph was invalidated
Sep  9 05:11:34 knew kernel: (ada1:ahcich6:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 f8 28 32 02 40 00 00 00 00 00 00
Sep  9 05:11:34 knew kernel: (ada1:ahcich6:0:0:0): CAM status: Command timeout
Sep  9 05:11:34 knew kernel: (ada1:ahcich6:0:0:0): Error 5, Periph was invalidated
Sep  9 05:11:34 knew kernel: (ada1:ahcich6:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 f8 28 36 02 40 00 00 00 00 00 00
Sep  9 05:11:34 knew kernel: (ada1:ahcich6:0:0:0): CAM status: Unconditionally Re-queue Request
Sep  9 05:11:34 knew kernel: (ada1:ahcich6:0:0:0): Error 5, Periph was invalidated
Sep  9 05:11:34 knew kernel: (ada1:ahcich6:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 f8 98 0a 04 40 00 00 00 00 00 00
Sep  9 05:11:34 knew kernel: (ada1:ahcich6:0:0:0): CAM status: Unconditionally Re-queue Request
Sep  9 05:11:34 knew kernel: (ada1:ahcich6:0:0:0): Error 5, Periph was invalidated
Sep  9 05:11:34 knew kernel: (ada1:ahcich6:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 f8 98 0b 04 40 00 00 00 00 00 00
Sep  9 05:11:34 knew kernel: (ada1:ahcich6:0:0:0): CAM status: Unconditionally Re-queue Request
Sep  9 05:11:34 knew kernel: (ada1:ahcich6:0:0:0): Error 5, Periph was invalidated
Sep  9 05:11:34 knew kernel: (ada1:ahcich6:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 f8 98 0c 04 40 00 00 00 00 00 00
Sep  9 05:11:34 knew kernel: (ada1:ahcich6:0:0:0): CAM status: Unconditionally Re-queue Request
Sep  9 05:11:34 knew kernel: (ada1:ahcich6:0:0:0): Error 5, Periph was invalidated
Sep  9 05:11:34 knew kernel: (ada1:ahcich6:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 f8 98 0d 04 40 00 00 00 00 00 00
Sep  9 05:11:34 knew kernel: (ada1:ahcich6:0:0:0): CAM status: Unconditionally Re-queue Request
Sep  9 05:11:34 knew kernel: (ada1:ahcich6:0:0:0): Error 5, Periph was invalidated
Sep  9 05:11:34 knew kernel: (ada1:ahcich6:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 f8 98 0e 04 40 00 00 00 00 00 00
Sep  9 05:11:34 knew kernel: (ada1:ahcich6:0:0:0): CAM status: Unconditionally Re-queue Request
Sep  9 05:11:34 knew kernel: (ada1:ahcich6:0:0:0): Error 5, Periph was invalidated
Sep  9 05:11:34 knew kernel: (ada1:ahcich6:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 f8 28 33 02 40 00 00 00 00 00 00
Sep  9 05:11:34 knew kernel: (ada1:ahcich6:0:0:0): CAM status: Unconditionally Re-queue Request
Sep  9 05:11:34 knew kernel: (ada1:ahcich6:0:0:0): Error 5, Periph was invalidated
Sep  9 05:11:34 knew kernel: (ada1:ahcich6:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 f8 28 34 02 40 00 00 00 00 00 00
Sep  9 05:11:34 knew kernel: (ada1:ahcich6:0:0:0): CAM status: Unconditionally Re-queue Request
Sep  9 05:11:34 knew kernel: (ada1:ahcich6:0:0:0): Error 5, Periph was invalidated
Sep  9 05:11:34 knew kernel: (ada1:ahcich6:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 f8 28 35 02 40 00 00 00 00 00 00
Sep  9 05:11:34 knew kernel: (ada1:ahcich6:0:0:0): CAM status: Unconditionally Re-queue Request
Sep  9 05:11:34 knew kernel: (ada1:ahcich6:0:0:0): Error 5, Periph was invalidated
Sep  9 05:11:35 knew devd: Executing 'logger -p kern.notice -t ZFS 'vdev is removed, pool_guid=18365655534929942323 vdev_guid=14409491293717520628''
Sep  9 05:11:35 knew kernel: (ada1:ahcich6:0:0:0): Periph destroyed
Sep  9 05:11:35 knew ZFS: vdev is removed, pool_guid=18365655534929942323 vdev_guid=14409491293717520628
Sep  9 05:11:35 knew devd: Executing 'logger -p kern.notice -t ZFS 'vdev is removed, pool_guid=15378250086669402288 vdev_guid=158613527330193223''
Sep  9 05:11:35 knew ZFS: vdev is removed, pool_guid=15378250086669402288 vdev_guid=158613527330193223
Sep  9 05:12:07 knew kernel: ahcich6: AHCI reset: device not ready after 31000ms (tfd = 00000080)
Sep  9 05:12:22 knew kernel: ahcich6: Poll timeout on slot 7 port 0
Sep  9 05:12:22 knew kernel: ahcich6: is 00000000 cs 00000080 ss 00000000 rs 00000080 tfd 80 serr 00000000 cmd 0004c717
Sep  9 05:12:22 knew kernel: (aprobe0:ahcich6:0:0:0): NOP. ACB: 00 00 00 00 00 00 00 00 00 00 00 00
Sep  9 05:12:22 knew kernel: (aprobe0:ahcich6:0:0:0): CAM status: Command timeout
Sep  9 05:12:22 knew kernel: (aprobe0:ahcich6:0:0:0): Error 5, Retries exhausted

ralphbsz · Sep 9, 2017

The unit attention because of power cycle seems to be explained ... something is cycling the power.

The timeout is strange. That should not happen. It means that the OS sent a SCSI command down the stack (to the drive), and within a reasonable period (above it says 31 seconds), nobody has responded. That's impossible; a functioning drive will never take that long for a command. Therefore it means that the command was lost somewhere in the stack (either on the way to the drive, or the response on the way back from the host). Which means that there is a bug somewhere. The reaction of the OS is sensible: it resets things (hits them with a big hammer) to get back to a known state.

Given how complex a SAS stack is (with lots of firmware), it is unfortunately not surprising that bugs exists.

dvl@ · Sep 9, 2017

Could

ralphbsz said:
The unit attention because of power cycle seems to be explained ... something is cycling the power.

What is drawing your attention to power cycling?

ralphbsz said:
That's impossible; a functioning drive will never take that long for a command.

I think it could be explained by either a bad cable or a now-dead drive. I will explore the cable issue ASAP.

ralphbsz · Sep 10, 2017

Sorry, my fault: the most recent error log from yesterday does not show power cycling. Usually, that's indicated by the device sending a check condition unit attention, with the reason being "power, reset, ...". But that's not visible in the recent log.

A timeout can easily be caused by bad connectivity, so hunting down cables is a good idea.

dvl@ · Sep 12, 2017

Please ignore. I'm just documenting as we go along.

Code:

Sep 12 18:00:09 knew kernel: (da19:mps2:0:12:0): WRITE(16). CDB: 8a 00 00 00 00 01 82 c1 2b 10 00 00 00 08 00 00 
Sep 12 18:00:09 knew kernel: (da19:mps2:0:12:0): CAM status: SCSI Status Error
Sep 12 18:00:09 knew kernel: (da19:mps2:0:12:0): SCSI status: Check Condition
Sep 12 18:00:09 knew kernel: (da19:mps2:0:12:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Sep 12 18:00:09 knew kernel: (da19:mps2:0:12:0): Retrying command (per sense data)

Terri_Kennedy · Sep 12, 2017

dvl@ said:
Please ignore. I'm just documenting as we go along.

These are all happening on da18 and da19. Presumably you have a da0 through da17 where it isn't happening, and possibly da20 and above.

Did you ever try swapping the drives in da18 and da19 with drives from two other positions that don't have the problem? If the problem moves, you have 2 bad drives. If the problem doesn't move, you have a bad backplane, SAS/SATA cable, or power connection.

dvl@ · Sep 12, 2017

Terry_Kennedy said:
Did you ever try swapping the drives in da18 and da19 with drives from two other positions that don't have the problem?

No, I've not been near the server.

dvl@ · Sep 13, 2017

Still just documenting. I hope to visit the system soon.

Code:

Sep 12 22:00:09 knew kernel: (da18:mps2:0:11:0): WRITE(16). CDB: 8a 00 00 00 00 02 2e 47 a4 28 00 00 00 08 00 00 
Sep 12 22:00:09 knew kernel: (da18:mps2:0:11:0): CAM status: SCSI Status Error
Sep 12 22:00:09 knew kernel: (da18:mps2:0:11:0): SCSI status: Check Condition
Sep 12 22:00:09 knew kernel: (da18:mps2:0:11:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Sep 12 22:00:09 knew kernel: (da18:mps2:0:11:0): Retrying command (per sense data)

dvl@ · Sep 28, 2017

FYI, there have been several subsequent incidents. I have recorded them in a GitHub gist.

I now think there is a correlation with temperature. I should REALLY annotate a graph based on this information. For example, this graph is from the UPS at the bottom of the rack.

ralphbsz · Sep 29, 2017

You do know that you can use smartctl to read the temperatures of the disk drives? Might be a good idea to have a little periodic task (cron job?) that measures and records the temperatures regularly.

There is correlation between disk error rates and temperature. I vividly remember an extreme case; a very large data center that was kept so cold that the frontmost disk drives in the enclosure were below their "safety" temperature, and went into a verify cycle after every write, completely killing their performance. And a fraction of all disks being slow does nasty things to RAID performance. I think there was a research paper (by Google or NetApp people, don't remember the details) that was published in FAST or IEEE-MSS that showed correlation between temperature and disk errors. I think the net was that you want them to be medium, not too hot, not too cold, and in particular not changing rapidly, but I'm not sure I remember that right.

Eric A. Borisch · Sep 29, 2017

The smartd daemon (from smartmontools) has an option to write out log files of parameters: "-A /path/to/logs” ... can be very useful for correlating with error events...

dvl@ · Sep 29, 2017

ralphbsz said:
You do know that you can use smartctl to read the temperatures of the disk drives? Might be a good idea to have a little periodic task (cron job?) that measures and records the temperatures regularly.

Yes, I do know that. The drives are monitored by Nagios, via smartctl.

dvl@ · Oct 10, 2017

I'm adding to the record.

More CAM messages arrived today, the first event at about 4:33, and for different drives than originally reports.*

Also present, CPU temperature warnings starting at 05:28:22 UTC 2017

Looking at the graph for the UPS temperature, which monitors both chassis and rack temperature, the rack did get warmer. It was about 26C/78F on Sunday, but was at 30C (86F last night).

This makes me think the CAM issues are heat related.

Code:

Oct 10 04:33:24 knew kernel: (da17:mps2:0:10:0): READ(10). CDB: 28 00 02 74 5d 58 00 00 98 00 length 77824 SMID 439 terminated ioc 804b loginfo 31110d00 scsi 0 state c xfer 40960
Oct 10 04:33:24 knew kernel: (da17:mps2:0:10:0): READ(10). CDB: 28 00 02 74 5d f0 00 00 28 00 length 20480 SMID 468 terminated ioc 804b loginfo 31110d00 s(da17:mps2:0:10:0): READ(10). CDB: 28 00 02 74 5d 58 00 00 98 00 
Oct 10 04:33:24 knew kernel: csi 0 state c xfer 0
Oct 10 04:33:24 knew kernel: (da17:mps2:0:10:0): CAM status: CCB request completed with an error
Oct 10 04:33:24 knew kernel: (da17:mps2:0:10:0): Retrying command
Oct 10 04:33:24 knew kernel: (da17:mps2:0:10:0): READ(10). CDB: 28 00 02 74 5d f0 00 00 28 00 
Oct 10 04:33:24 knew kernel: (da17:mps2:0:10:0): CAM status: CCB request completed with an error
Oct 10 04:33:24 knew kernel: (da17:mps2:0:10:0): Retrying command
Oct 10 04:33:24 knew kernel: (da17:mps2:0:10:0): READ(10). CDB: 28 00 02 74 5d 58 00 00 98 00 
Oct 10 04:33:24 knew kernel: (da17:mps2:0:10:0): CAM status: SCSI Status Error
Oct 10 04:33:24 knew kernel: (da17:mps2:0:10:0): SCSI status: Check Condition
Oct 10 04:33:24 knew kernel: (da17:mps2:0:10:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Oct 10 04:33:24 knew kernel: (da17:mps2:0:10:0): Retrying command (per sense data)
Oct 10 04:33:24 knew kernel: (da17:mps2:0:10:0): READ(10). CDB: 28 00 02 75 96 90 00 00 20 00 
Oct 10 04:33:24 knew kernel: (da17:mps2:0:10:0): CAM status: SCSI Status Error
Oct 10 04:33:24 knew kernel: (da17:mps2:0:10:0): SCSI status: Check Condition
Oct 10 04:33:24 knew kernel: (da17:mps2:0:10:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Oct 10 04:33:24 knew kernel: (da17:mps2:0:10:0): Retrying command (per sense data)

Oct 10 05:19:29 knew kernel: (da16:mps2:0:2:0): READ(10). CDB: 28 00 1a d7 ad 28 00 00 28 00 length 20480 SMID 271 terminated ioc 804b loginfo 31110d00 scsi 0 state c xfer 0
Oct 10 05:19:29 knew kernel: (da16:mps2:0:2:0): READ(10). CDB: 28 00 1a d7 ad 50 00 00 20 00 length 16384 SMID 238 terminated ioc 804b loginfo 31110d00 sc(da16:mps2:0:2:0): READ(10). CDB: 28 00 1a d7 ad 28 00 00 28 00 
Oct 10 05:19:29 knew kernel: si 0 state c xfer 0
Oct 10 05:19:29 knew kernel: (da16:mps2:0:2:0): CAM status: CCB request completed with an error
Oct 10 05:19:29 knew kernel: (da16:mps2:0:2:0): Retrying command
Oct 10 05:19:29 knew kernel: (da16:mps2:0:2:0): READ(10). CDB: 28 00 1a d7 ad 50 00 00 20 00 
Oct 10 05:19:29 knew kernel: (da16:mps2:0:2:0): CAM status: CCB request completed with an error
Oct 10 05:19:29 knew kernel: (da16:mps2:0:2:0): Retrying command
Oct 10 05:19:29 knew kernel: (da16:mps2:0:2:0): READ(10). CDB: 28 00 1a d7 ad 28 00 00 28 00 
Oct 10 05:19:29 knew kernel: (da16:mps2:0:2:0): CAM status: SCSI Status Error
Oct 10 05:19:29 knew kernel: (da16:mps2:0:2:0): SCSI status: Check Condition
Oct 10 05:19:29 knew kernel: (da16:mps2:0:2:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Oct 10 05:19:29 knew kernel: (da16:mps2:0:2:0): Retrying command (per sense data)
Oct 10 05:19:30 knew kernel: (da16:mps2:0:2:0): READ(10). CDB: 28 00 1a d8 6e e0 00 00 28 00 
Oct 10 05:19:30 knew kernel: (da16:mps2:0:2:0): CAM status: SCSI Status Error
Oct 10 05:19:30 knew kernel: (da16:mps2:0:2:0): SCSI status: Check Condition
Oct 10 05:19:30 knew kernel: (da16:mps2:0:2:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Oct 10 05:19:30 knew kernel: (da16:mps2:0:2:0): Retrying command (per sense data)

*NOTE: these are not the same drives as originally posted. In case they were renumbered, I verified the serial numbers via smartctl.

dvl@ · Nov 2, 2017

This just in and is from the original drive posted on Aug 25. Temperature is unlikely to be an issue today. The UPS says it's 26C/79F in for about two hours. The drives are all 41C/105F or below. The drive in question is 38C/82F.

Code:

Nov  1 18:00:16 knew kernel: (da18:mps2:0:11:0): WRITE(16). CDB: 8a 00 00 00 00 01 fd 81 5f d0 00 00 00 08 00 00
Nov  1 18:00:16 knew kernel: (da18:mps2:0:11:0): CAM status: SCSI Status Error
Nov  1 18:00:16 knew kernel: (da18:mps2:0:11:0): SCSI status: Check Condition
Nov  1 18:00:16 knew kernel: (da18:mps2:0:11:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Nov  1 18:00:16 knew kernel: (da18:mps2:0:11:0): Retrying command (per sense data)

dvl@ · Nov 18, 2017

Same as two weeks ago:

Code:

Nov 17 00:00:00 knew newsyslog[72425]: logfile turned over
Nov 17 20:00:07 knew kernel: (da18:mps2:0:11:0): WRITE(10). CDB: 2a 00 03 85 86 e8 00 00 08 00
Nov 17 20:00:07 knew kernel: (da18:mps2:0:11:0): CAM status: SCSI Status Error
Nov 17 20:00:07 knew kernel: (da18:mps2:0:11:0): SCSI status: Check Condition
Nov 17 20:00:07 knew kernel: (da18:mps2:0:11:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Nov 17 20:00:07 knew kernel: (da18:mps2:0:11:0): Retrying command (per sense data)
Nov 18 00:00:00 knew newsyslog[76971]: logfile turned over

The UPS (same rack) reports internal temperature of 20C/68F and the top of the rack was 26C/79.

aht0 · Nov 18, 2017

dvl@ said:
Agreed.

All drives are behind LSI HBA (e.g. SAS2008).

I would expect the HBA to handle that, not the BIOS.

Problem is not so much in HBA but in SATA drives, which for the most part do not support the staggered startup feature. You appear to have Enterprise-grade drives, might be different with these.

EDIT: added a question.

Have you tried different firmwares on your LSI? I assume here that you're using some discrete card like LSI 9211 and it's flashable and not some motherboard that has LSI chipset on it?

dvl@ · Apr 26, 2018

These errors persist. No other issue with the system.

Latest smartcl output here. https://gist.github.com/dlangille/cb6abadd677364498563f5d213d1c94e

Checking back, these messages seems to affect only da16..da18. Thoes are are the top four drives in the chassis.

There are 20 drives in the chassis, five rows of four.

I have moved the four drives in question to other drive bays. Let's see what happens next.

ralphbsz · Apr 26, 2018

Just looked at your data. Those 4 drives are a bit warm ... mostly 41 degrees (one 39). That's not extreme, but 30 or 35 would be better. I don't know how warm the other (healthy) drives in the system are. And even if these four are a bit warm: It makes *NO* sense that being warm should cause "reset or power on" unit attentions.

I have another theory about the root cause, which is also just as wrong as the heat theory: It could be that because of their location, these drives have more vibration. That would be visible by looking at their raw corrected error rates, unfortunately that is not reported by the SATA version of SMART (it is reported in the SCSI version of SMART). But that's irrelevant, since these drives are not suffering from too many read/write errors, but from unit attention.

Good luck with trying to use switching locations to track down the root cause.

sammys · Jun 15, 2018

Hi ... i'm relay new here ... just regsitered because of this thread.
I have the same problem on ym FreeNAS (based on freeBSD) system ... and i have done also a lot of trouble shooting in this case.
im still testing further, but in my case it looks like its the Hot-Swap bay of the related harddisk. maybe the caddy for the disk or the backplane connector has an issue thats why trouble shootiung the cabeling (switch cables) didn't helped. to make a final test, i wait for a brand new 4TB WD Red HD which should be deliverd nect week. (wich i use mostly in my FreeNAS).
So i just want to share my issue ... that not only cables but also HotSwap connector bays should be checked twice

If i can realy approve this is my root cause, i will post here an update.

And many thanks for this forum and this threat ... it helped me a lot in my trouble shooting.

dvl@ · Oct 23, 2018

There have been additional but less frequent errors.

Last night, this happened:

Code:

Oct 22 19:52:59 knew kernel: (da17:mps2:0:2:0): WRITE(16). CDB: 8a 00 00 00 00 01 59 9f b7 60 00 00 00 08 00 00 length 4096 SMID 796 terminated ioc 804b loginfo 31110d00 scsi 0 state c xfer 0
Oct 22 19:52:59 knew kernel: (da17:mps2:0:2:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 888 terminated ioc 804b loginfo 31(da17:mps2:0:2:0): WRITE(16). CDB: 8a 00 00 00 00 01 59 9f b7 60 00 00 00 08 00 00
Oct 22 19:52:59 knew kernel: 110d00 scsi 0 state c xfer 0
Oct 22 19:52:59 knew kernel: (da17:mps2:0:2:0): CAM status: CCB request completed with an error
Oct 22 19:52:59 knew kernel: (da17:mps2:0:2:0): Retrying command
Oct 22 19:52:59 knew kernel: (da17:mps2:0:2:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
Oct 22 19:52:59 knew kernel: (da17:mps2:0:2:0): CAM status: CCB request completed with an error
Oct 22 19:52:59 knew kernel: (da17:mps2:0:2:0): Retrying command
Oct 22 19:52:59 knew kernel: (da17:mps2:0:2:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
Oct 22 19:52:59 knew kernel: (da17:mps2:0:2:0): CAM status: SCSI Status Error
Oct 22 19:52:59 knew kernel: (da17:mps2:0:2:0): SCSI status: Check Condition
Oct 22 19:52:59 knew kernel: (da17:mps2:0:2:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Oct 22 19:52:59 knew kernel: (da17:mps2:0:2:0): Error 6, Retries exhausted
Oct 22 19:52:59 knew kernel: (da17:mps2:0:2:0): Invalidating pack
Oct 22 19:53:00 knew ZFS: vdev state changed, pool_guid=15378250086669402288 vdev_guid=5840140110512920130
Oct 22 19:53:00 knew ZFS: vdev state changed, pool_guid=15378250086669402288 vdev_guid=5840140110512920130

I powered off, opened the case, reseated the cable, the drive, and various other things. Powered up. All good. A short while later, the zpool degraded again. It was quickly fixed with an offline/online.

A short while later, the problem occurred again. Nothing more was done that night.

This morning, I replaced the cable between the HBA card and this level of the backplane. It has been only 5 hours, but so far, all good.

More detail than you want is in my blog post.

dvl@ · Oct 24, 2018

Well, damn.

Code:

Oct 24 12:24:40 knew kernel: (da18:mps2:0:10:0): READ(10). CDB: 28 00 21 d2 cc 80 00 00 d8 00 length 110592 SMID 471 terminated ioc 804b loginfo 31110d01 scsi 0 state c xfer 0
Oct 24 12:24:40 knew kernel: (da18:mps2:0:10:0): READ(10). CDB: 28 00 21 d2 cb a0 00 00 e0 00 length 114688 SMID 205 terminated ioc 804b loginfo 31110d01 (da18:mps2:0:10:0): READ(10). CDB: 28 00 21 d2 cc 80 00 00 d8 00
Oct 24 12:24:40 knew kernel: scsi 0 state c xfer 98308
Oct 24 12:24:40 knew kernel: (da18:mps2:0:10:0): CAM status: CCB request completed with an error
Oct 24 12:24:40 knew kernel: (da18:mps2:0:10:0): Retrying command
Oct 24 12:24:40 knew kernel: (da18:mps2:0:10:0): READ(10). CDB: 28 00 21 d2 cb a0 00 00 e0 00
Oct 24 12:24:40 knew kernel: (da18:mps2:0:10:0): CAM status: CCB request completed with an error
Oct 24 12:24:40 knew kernel: (da18:mps2:0:10:0): Retrying command
Oct 24 12:24:40 knew kernel: (da18:mps2:0:10:0): READ(10). CDB: 28 00 21 d2 cb a0 00 00 e0 00
Oct 24 12:24:40 knew kernel: (da18:mps2:0:10:0): CAM status: SCSI Status Error
Oct 24 12:24:40 knew kernel: (da18:mps2:0:10:0): SCSI status: Check Condition
Oct 24 12:24:40 knew kernel: (da18:mps2:0:10:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Oct 24 12:24:40 knew kernel: (da18:mps2:0:10:0): Retrying command (per sense data)
Oct 24 12:24:41 knew kernel: (da18:mps2:0:10:0): READ(10). CDB: 28 00 1d 44 c7 c0 00 00 d8 00
Oct 24 12:24:41 knew kernel: (da18:mps2:0:10:0): CAM status: SCSI Status Error
Oct 24 12:24:41 knew kernel: (da18:mps2:0:10:0): SCSI status: Check Condition
Oct 24 12:24:41 knew kernel: (da18:mps2:0:10:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Oct 24 12:24:41 knew kernel: (da18:mps2:0:10:0): Retrying command (per sense data)
Oct 24 12:25:42 knew kernel: (da18:mps2:0:10:0): READ(10). CDB: 28 00 22 32 2b c8 00 01 00 00 length 131072 SMID 284 terminated ioc 804b loginfo 31110d00 scsi 0 state c xfer 65536
Oct 24 12:25:42 knew kernel: (da18:mps2:0:10:0): READ(10). CDB: 28 00 22 32 2d 68 00 01 00 00 length 131072 SMID 392 terminated ioc 804b loginfo 31110d00 (da18:mps2:0:10:0): READ(10). CDB: 28 00 22 32 2b c8 00 01 00 00
Oct 24 12:25:42 knew kernel: scsi 0 state c xfer 0
Oct 24 12:25:42 knew kernel: (da18:mps2:0:10:0): CAM status: CCB request completed with an error
Oct 24 12:25:42 knew kernel: (da18:mps2:0:10:0): Retrying command
Oct 24 12:25:42 knew kernel: (da18:mps2:0:10:0): READ(10). CDB: 28 00 22 32 2d 68 00 01 00 00
Oct 24 12:25:42 knew kernel: (da18:mps2:0:10:0): CAM status: CCB request completed with an error
Oct 24 12:25:42 knew kernel: (da18:mps2:0:10:0): Retrying command
Oct 24 12:25:42 knew kernel: (da18:mps2:0:10:0): READ(10). CDB: 28 00 22 32 2b c8 00 01 00 00
Oct 24 12:25:42 knew kernel: (da18:mps2:0:10:0): CAM status: SCSI Status Error
Oct 24 12:25:42 knew kernel: (da18:mps2:0:10:0): SCSI status: Check Condition
Oct 24 12:25:42 knew kernel: (da18:mps2:0:10:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Oct 24 12:25:42 knew kernel: (da18:mps2:0:10:0): Retrying command (per sense data)
Oct 24 12:25:42 knew kernel: (da18:mps2:0:10:0): READ(10). CDB: 28 00 22 34 a4 58 00 01 00 00
Oct 24 12:25:42 knew kernel: (da18:mps2:0:10:0): CAM status: SCSI Status Error
Oct 24 12:25:42 knew kernel: (da18:mps2:0:10:0): SCSI status: Check Condition
Oct 24 12:25:42 knew kernel: (da18:mps2:0:10:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Oct 24 12:25:42 knew kernel: (da18:mps2:0:10:0): Retrying command (per sense data)
Oct 24 12:26:03 knew kernel: (da18:mps2:0:10:0): READ(10). CDB: 28 00 1f 78 08 d0 00 00 88 00 length 69632 SMID 514 terminated ioc 804b loginfo 31110d00 scsi 0 state c xfer 0
Oct 24 12:26:03 knew kernel: (da18:mps2:0:10:0): READ(10). CDB: 28 00 1f 78 08 18 00 00 b8 00 length 94208 SMID 1008 terminated ioc 804b loginfo 31110d00 scsi 0 state c xfer 8192
Oct 24 12:26:03 knew kernel: (da18:mps2:0:10:0): READ(10). CDB: 28 00 1f 78 08 d0 00 00 88 00
Oct 24 12:26:03 knew kernel: (da18:mps2:0:10:0): CAM status: CCB request completed with an error
Oct 24 12:26:03 knew kernel: (da18:mps2:0:10:0): Retrying command
Oct 24 12:26:03 knew kernel: (da18:mps2:0:10:0): READ(10). CDB: 28 00 1f 78 08 18 00 00 b8 00
Oct 24 12:26:03 knew kernel: (da18:mps2:0:10:0): CAM status: CCB request completed with an error
Oct 24 12:26:03 knew kernel: (da18:mps2:0:10:0): Retrying command
Oct 24 12:26:03 knew kernel: (da18:mps2:0:10:0): READ(10). CDB: 28 00 1f 78 08 18 00 00 b8 00
Oct 24 12:26:03 knew kernel: (da18:mps2:0:10:0): CAM status: SCSI Status Error
Oct 24 12:26:03 knew kernel: (da18:mps2:0:10:0): SCSI status: Check Condition
Oct 24 12:26:03 knew kernel: (da18:mps2:0:10:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Oct 24 12:26:03 knew kernel: (da18:mps2:0:10:0): Retrying command (per sense data)
Oct 24 12:26:04 knew kernel: (da18:mps2:0:10:0): WRITE(16). CDB: 8a 00 00 00 00 01 5b 92 e0 d8 00 00 00 10 00 00
Oct 24 12:26:04 knew kernel: (da18:mps2:0:10:0): CAM status: SCSI Status Error
Oct 24 12:26:04 knew kernel: (da18:mps2:0:10:0): SCSI status: Check Condition
Oct 24 12:26:04 knew kernel: (da18:mps2:0:10:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Oct 24 12:26:04 knew kernel: (da18:mps2:0:10:0): Retrying command (per sense data)
Oct 24 12:26:43 knew kernel: (da18:mps2:0:10:0): READ(10). CDB: 28 00 22 54 d8 30 00 00 c8 00 length 102400 SMID 210 terminated ioc 804b loginfo 31110d00 scsi 0 state c xfer 16384
Oct 24 12:26:43 knew kernel: (da18:mps2:0:10:0): READ(10). CDB: 28 00 22 54 d8 f8 00 00 d8 00 length 110592 SMID 581 terminated ioc 804b loginfo 31110d00 (da18:mps2:0:10:0): READ(10). CDB: 28 00 22 54 d8 30 00 00 c8 00
Oct 24 12:26:43 knew kernel: scsi 0 state c xfer 0
Oct 24 12:26:43 knew kernel: (da18:mps2:0:10:0): CAM status: CCB request completed with an error
Oct 24 12:26:43 knew kernel: (da18:mps2:0:10:0): Retrying command
Oct 24 12:26:43 knew kernel: (da18:mps2:0:10:0): READ(10). CDB: 28 00 22 54 d8 f8 00 00 d8 00
Oct 24 12:26:43 knew kernel: (da18:mps2:0:10:0): CAM status: CCB request completed with an error
Oct 24 12:26:43 knew kernel: (da18:mps2:0:10:0): Retrying command
Oct 24 12:26:43 knew kernel: (da18:mps2:0:10:0): READ(10). CDB: 28 00 22 54 d8 30 00 00 c8 00
Oct 24 12:26:43 knew kernel: (da18:mps2:0:10:0): CAM status: SCSI Status Error
Oct 24 12:26:43 knew kernel: (da18:mps2:0:10:0): SCSI status: Check Condition
Oct 24 12:26:43 knew kernel: (da18:mps2:0:10:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Oct 24 12:26:43 knew kernel: (da18:mps2:0:10:0): Retrying command (per sense data)
Oct 24 12:26:44 knew kernel: (da18:mps2:0:10:0): WRITE(6). CDB: 0a 00 01 80 08 00
Oct 24 12:26:44 knew kernel: (da18:mps2:0:10:0): CAM status: SCSI Status Error
Oct 24 12:26:44 knew kernel: (da18:mps2:0:10:0): SCSI status: Check Condition
Oct 24 12:26:44 knew kernel: (da18:mps2:0:10:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Oct 24 12:26:44 knew kernel: (da18:mps2:0:10:0): Retrying command (per sense data)
Oct 24 12:26:53 knew kernel: (da18:mps2:0:10:0): READ(10). CDB: 28 00 20 0f 1f f0 00 00 c0 00 length 98304 SMID 593 terminated ioc 804b loginfo 31110d00 scsi 0 state c xfer 73728
Oct 24 12:26:53 knew kernel: (da18:mps2:0:10:0): READ(10). CDB: 28 00 20 97 09 d0 00 00 90 00 length 73728 SMID 923 terminated ioc 804b loginfo 31110d00 s(da18:mps2:0:10:0): READ(10). CDB: 28 00 20 0f 1f f0 00 00 c0 00
Oct 24 12:26:53 knew kernel: csi 0 state c xfer 0
Oct 24 12:26:53 knew kernel: (da18:mps2:0:10:0): CAM status: CCB request completed with an error
Oct 24 12:26:53 knew kernel: (da18:mps2:0:10:0): Retrying command
Oct 24 12:26:53 knew kernel: (da18:mps2:0:10:0): READ(10). CDB: 28 00 20 97 09 d0 00 00 90 00
Oct 24 12:26:53 knew kernel: (da18:mps2:0:10:0): CAM status: CCB request completed with an error
Oct 24 12:26:53 knew kernel: (da18:mps2:0:10:0): Retrying command
Oct 24 12:26:53 knew kernel: (da18:mps2:0:10:0): READ(10). CDB: 28 00 20 0f 1f f0 00 00 c0 00
Oct 24 12:26:53 knew kernel: (da18:mps2:0:10:0): CAM status: SCSI Status Error
Oct 24 12:26:53 knew kernel: (da18:mps2:0:10:0): SCSI status: Check Condition
Oct 24 12:26:53 knew kernel: (da18:mps2:0:10:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Oct 24 12:26:53 knew kernel: (da18:mps2:0:10:0): Retrying command (per sense data)
Oct 24 12:26:53 knew kernel: (da18:mps2:0:10:0): READ(10). CDB: 28 00 22 60 61 98 00 00 d8 00
Oct 24 12:26:53 knew kernel: (da18:mps2:0:10:0): CAM status: SCSI Status Error
Oct 24 12:26:53 knew kernel: (da18:mps2:0:10:0): SCSI status: Check Condition
Oct 24 12:26:53 knew kernel: (da18:mps2:0:10:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Oct 24 12:26:53 knew kernel: (da18:mps2:0:10:0): Retrying command (per sense data)
Oct 24 12:27:29 knew kernel: (da18:mps2:0:10:0): READ(10). CDB: 28 00 1f ec 25 48 00 00 c0 00 length 98304 SMID 234 terminated ioc 804b loginfo 31110d00 scsi 0 state c xfer 81920
Oct 24 12:27:29 knew kernel: (da18:mps2:0:10:0): READ(10). CDB: 28 00 1f ec 27 08 00 00 c8 00 length 102400 SMID 355 terminated ioc 804b loginfo 31110d00 (da18:mps2:0:10:0): READ(10). CDB: 28 00 1f ec 25 48 00 00 c0 00
Oct 24 12:27:29 knew kernel: scsi 0 state c xfer 0
Oct 24 12:27:29 knew kernel: (da18:mps2:0:10:0): CAM status: CCB request completed with an error
Oct 24 12:27:29 knew kernel: (da18:mps2:0:10:0): Retrying command
Oct 24 12:27:29 knew kernel: (da18:mps2:0:10:0): READ(10). CDB: 28 00 1f ec 27 08 00 00 c8 00
Oct 24 12:27:29 knew kernel: (da18:mps2:0:10:0): CAM status: CCB request completed with an error
Oct 24 12:27:29 knew kernel: (da18:mps2:0:10:0): Retrying command
Oct 24 12:27:29 knew kernel: (da18:mps2:0:10:0): READ(10). CDB: 28 00 1f ec 25 48 00 00 c0 00
Oct 24 12:27:29 knew kernel: (da18:mps2:0:10:0): CAM status: SCSI Status Error
Oct 24 12:27:29 knew kernel: (da18:mps2:0:10:0): SCSI status: Check Condition
Oct 24 12:27:29 knew kernel: (da18:mps2:0:10:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Oct 24 12:27:29 knew kernel: (da18:mps2:0:10:0): Retrying command (per sense data)
Oct 24 12:27:30 knew kernel: (da18:mps2:0:10:0): READ(10). CDB: 28 00 1f ed 39 00 00 01 00 00
Oct 24 12:27:30 knew kernel: (da18:mps2:0:10:0): CAM status: SCSI Status Error
Oct 24 12:27:30 knew kernel: (da18:mps2:0:10:0): SCSI status: Check Condition
Oct 24 12:27:30 knew kernel: (da18:mps2:0:10:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Oct 24 12:27:30 knew kernel: (da18:mps2:0:10:0): Retrying command (per sense data)
Oct 24 12:28:21 knew kernel: (da18:mps2:0:10:0): READ(10). CDB: 28 00 1d 7e fb 60 00 01 00 00 length 131072 SMID 469 terminated ioc 804b loginfo 31110d01 scsi 0 state c xfer 114692
Oct 24 12:28:21 knew kernel: (da18:mps2:0:10:0): READ(10). CDB: 28 00 1d 7e fc 60 00 01 00 00 length 131072 SMID 735 terminated ioc 804b loginfo 31110d01 (da18:mps2:0:10:0): READ(10). CDB: 28 00 1d 7e fb 60 00 01 00 00
Oct 24 12:28:21 knew kernel: scsi 0 state c xfer 0
Oct 24 12:28:21 knew kernel: (da18:mps2:0:10:0): CAM status: CCB request completed with an error
Oct 24 12:28:21 knew kernel: (da18:mps2:0:10:0): Retrying command
Oct 24 12:28:21 knew kernel: (da18:mps2:0:10:0): READ(10). CDB: 28 00 1d 7e fc 60 00 01 00 00
Oct 24 12:28:21 knew kernel: (da18:mps2:0:10:0): CAM status: CCB request completed with an error
Oct 24 12:28:21 knew kernel: (da18:mps2:0:10:0): Retrying command
Oct 24 12:28:21 knew kernel: (da18:mps2:0:10:0): READ(10). CDB: 28 00 1d 7e fb 60 00 01 00 00
Oct 24 12:28:21 knew kernel: (da18:mps2:0:10:0): CAM status: SCSI Status Error
Oct 24 12:28:21 knew kernel: (da18:mps2:0:10:0): SCSI status: Check Condition
Oct 24 12:28:21 knew kernel: (da18:mps2:0:10:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Oct 24 12:28:21 knew kernel: (da18:mps2:0:10:0): Retrying command (per sense data)
Oct 24 12:28:21 knew kernel: (da18:mps2:0:10:0): READ(10). CDB: 28 00 1d 7f 53 30 00 01 00 00
Oct 24 12:28:21 knew kernel: (da18:mps2:0:10:0): CAM status: SCSI Status Error
Oct 24 12:28:21 knew kernel: (da18:mps2:0:10:0): SCSI status: Check Condition
Oct 24 12:28:21 knew kernel: (da18:mps2:0:10:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Oct 24 12:28:21 knew kernel: (da18:mps2:0:10:0): Retrying command (per sense data)
Oct 24 12:29:42 knew kernel: (da18:mps2:0:10:0): READ(10). CDB: 28 00 1e b9 7f 78 00 01 00 00 length 131072 SMID 392 terminated ioc 804b loginfo 31110d00 scsi 0 state c xfer 0
Oct 24 12:29:42 knew kernel: (da18:mps2:0:10:0): READ(10). CDB: 28 00 1e b9 7e 78 00 01 00 00 length 131072 SMID 1080 terminated ioc 804b loginfo 31110d00(da18:mps2:0:10:0): READ(10). CDB: 28 00 1e b9 7f 78 00 01 00 00
Oct 24 12:29:42 knew kernel: scsi 0 state c xfer 65536
Oct 24 12:29:42 knew kernel: (da18:mps2:0:10:0): CAM status: CCB request completed with an error
Oct 24 12:29:42 knew kernel: (da18:mps2:0:10:0): Retrying command
Oct 24 12:29:42 knew kernel: (da18:mps2:0:10:0): READ(10). CDB: 28 00 1e b9 7e 78 00 01 00 00
Oct 24 12:29:42 knew kernel: (da18:mps2:0:10:0): CAM status: CCB request completed with an error
Oct 24 12:29:42 knew kernel: (da18:mps2:0:10:0): Retrying command
Oct 24 12:29:42 knew kernel: (da18:mps2:0:10:0): READ(10). CDB: 28 00 1e b9 7e 78 00 01 00 00
Oct 24 12:29:42 knew kernel: (da18:mps2:0:10:0): CAM status: SCSI Status Error
Oct 24 12:29:42 knew kernel: (da18:mps2:0:10:0): SCSI status: Check Condition
Oct 24 12:29:42 knew kernel: (da18:mps2:0:10:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Oct 24 12:29:42 knew kernel: (da18:mps2:0:10:0): Retrying command (per sense data)
Oct 24 12:29:43 knew kernel: (da18:mps2:0:10:0): READ(10). CDB: 28 00 21 85 d9 50 00 01 00 00
Oct 24 12:29:43 knew kernel: (da18:mps2:0:10:0): CAM status: SCSI Status Error
Oct 24 12:29:43 knew kernel: (da18:mps2:0:10:0): SCSI status: Check Condition
Oct 24 12:29:43 knew kernel: (da18:mps2:0:10:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Oct 24 12:29:43 knew kernel: (da18:mps2:0:10:0): Retrying command (per sense data)
Oct 24 12:30:06 knew kernel: (da18:mps2:0:10:0): READ(10). CDB: 28 00 20 b7 32 80 00 00 f0 00 length 122880 SMID 76 terminated ioc 804b loginfo 31110d00 scsi 0 state c xfer 0
Oct 24 12:30:06 knew kernel: (da18:mps2:0:10:0): READ(10). CDB: 28 00 20 b7 33 70 00 00 d0 00 length 106496 SMID 789 terminated ioc 804b loginfo 31110d00 (da18:mps2:0:10:0): READ(10). CDB: 28 00 20 b7 32 80 00 00 f0 00
Oct 24 12:30:06 knew kernel: scsi 0 state c xfer 0
Oct 24 12:30:06 knew kernel: (da18:mps2:0:10:0): CAM status: CCB request completed with an error
Oct 24 12:30:06 knew kernel: (da18:mps2:0:10:0): Retrying command
Oct 24 12:30:06 knew kernel: (da18:mps2:0:10:0): READ(10). CDB: 28 00 20 b7 33 70 00 00 d0 00
Oct 24 12:30:06 knew kernel: (da18:mps2:0:10:0): CAM status: CCB request completed with an error
Oct 24 12:30:06 knew kernel: (da18:mps2:0:10:0): Retrying command
Oct 24 12:30:07 knew kernel: (da18:mps2:0:10:0): READ(10). CDB: 28 00 20 b7 32 80 00 00 f0 00
Oct 24 12:30:07 knew kernel: (da18:mps2:0:10:0): CAM status: SCSI Status Error
Oct 24 12:30:07 knew kernel: (da18:mps2:0:10:0): SCSI status: Check Condition
Oct 24 12:30:07 knew kernel: (da18:mps2:0:10:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Oct 24 12:30:07 knew kernel: (da18:mps2:0:10:0): Retrying command (per sense data)
Oct 24 12:30:07 knew kernel: (da18:mps2:0:10:0): READ(10). CDB: 28 00 20 b7 53 b8 00 00 80 00
Oct 24 12:30:07 knew kernel: (da18:mps2:0:10:0): CAM status: SCSI Status Error
Oct 24 12:30:07 knew kernel: (da18:mps2:0:10:0): SCSI status: Check Condition
Oct 24 12:30:07 knew kernel: (da18:mps2:0:10:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Oct 24 12:30:07 knew kernel: (da18:mps2:0:10:0): Retrying command (per sense data)
Oct 24 12:32:55 knew kernel: (da18:mps2:0:10:0): READ(10). CDB: 28 00 1a 29 63 c8 00 00 18 00 length 12288 SMID 568 terminated ioc 804b loginfo 31110d00 scsi 0 state c xfer 0
Oct 24 12:32:55 knew kernel: (da18:mps2:0:10:0): READ(10). CDB: 28 00 1a 29 5e d0 00 00 18 00 length 12288 SMID 283 terminated ioc 804b loginfo 31110d00 s(da18:mps2:0:10:0): READ(10). CDB: 28 00 1a 29 63 c8 00 00 18 00
Oct 24 12:32:55 knew kernel: csi 0 state c xfer 0
Oct 24 12:32:55 knew kernel: (da18:mps2:0:10:0): CAM status: CCB request completed with an error
Oct 24 12:32:55 knew kernel: (da18:mps2:0:10:0): Retrying command
Oct 24 12:32:55 knew kernel: (da18:mps2:0:10:0): READ(10). CDB: 28 00 1a 29 5e d0 00 00 18 00
Oct 24 12:32:55 knew kernel: (da18:mps2:0:10:0): CAM status: CCB request completed with an error
Oct 24 12:32:55 knew kernel: (da18:mps2:0:10:0): Retrying command
Oct 24 12:32:55 knew kernel: (da18:mps2:0:10:0): READ(10). CDB: 28 00 1a 29 5e d0 00 00 18 00
Oct 24 12:32:55 knew kernel: (da18:mps2:0:10:0): CAM status: SCSI Status Error
Oct 24 12:32:55 knew kernel: (da18:mps2:0:10:0): SCSI status: Check Condition
Oct 24 12:32:55 knew kernel: (da18:mps2:0:10:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Oct 24 12:32:55 knew kernel: (da18:mps2:0:10:0): Retrying command (per sense data)
Oct 24 12:32:55 knew kernel: (da18:mps2:0:10:0): READ(10). CDB: 28 00 1a 5f 77 e8 00 00 10 00
Oct 24 12:32:55 knew kernel: (da18:mps2:0:10:0): CAM status: SCSI Status Error
Oct 24 12:32:55 knew kernel: (da18:mps2:0:10:0): SCSI status: Check Condition
Oct 24 12:32:55 knew kernel: (da18:mps2:0:10:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Oct 24 12:32:55 knew kernel: (da18:mps2:0:10:0): Retrying command (per sense data)

The drive seems OK:

Code:

[dan@knew:~] $ sudo smartctl -a /dev/da18
smartctl 6.6 2017-11-05 r4594 [FreeBSD 11.2-RELEASE-p4 amd64] (local build)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Toshiba 3.5" MD04ACA... Enterprise HDD
Device Model:     TOSHIBA MD04ACA500
Serial Number:    6539K3OJFS9A
LU WWN Device Id: 5 000039 65bb0025d
Firmware Version: FP2A
User Capacity:    5,000,981,078,016 bytes [5.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Wed Oct 24 13:51:21 2018 UTC
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82)    Offline data collection activity
                    was completed without error.
                    Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)    The previous self-test routine completed
                    without error or no self-test has ever
                    been run.
Total time to complete Offline
data collection:         (  120) seconds.
Offline data collection
capabilities:              (0x5b) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    command.
                    Offline surface scan supported.
                    Self-test supported.
                    No Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003)    Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01)    Error logging supported.
                    General Purpose Logging supported.
Short self-test routine
recommended polling time:      (   2) minutes.
Extended self-test routine
recommended polling time:      ( 541) minutes.
SCT capabilities:            (0x003d)    SCT Status supported.
                    SCT Error Recovery Control supported.
                    SCT Feature Control supported.
                    SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   100   100   050    Pre-fail  Always       -       0
  2 Throughput_Performance  0x0005   100   100   050    Pre-fail  Offline      -       0
  3 Spin_Up_Time            0x0027   100   100   001    Pre-fail  Always       -       522
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       102
  5 Reallocated_Sector_Ct   0x0033   100   100   050    Pre-fail  Always       -       48
  7 Seek_Error_Rate         0x000b   100   100   050    Pre-fail  Always       -       0
  8 Seek_Time_Performance   0x0005   100   100   050    Pre-fail  Offline      -       0
  9 Power_On_Hours          0x0032   037   037   000    Old_age   Always       -       25262
10 Spin_Retry_Count        0x0033   102   100   030    Pre-fail  Always       -       0
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       102
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       149
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       93
193 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       738
194 Temperature_Celsius     0x0022   100   100   000    Old_age   Always       -       39 (Min/Max 19/52)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       6
197 Current_Pending_Sector  0x0032   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   253   000    Old_age   Always       -       0
220 Disk_Shift              0x0002   100   100   000    Old_age   Always       -       0
222 Loaded_Hours            0x0032   038   038   000    Old_age   Always       -       25093
223 Load_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
224 Load_Friction           0x0022   100   100   000    Old_age   Always       -       0
226 Load-in_Time            0x0026   100   100   000    Old_age   Always       -       203
240 Head_Flying_Hours       0x0001   100   100   001    Pre-fail  Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%     20994         -
# 2  Extended offline    Completed without error       00%         9         -

SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

[dan@knew:~] $

leebrown66 · Oct 24, 2018

dvl@ said:
191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 149

First time I've seen that attribute, but it's shock/vibration which I would not expect to see non-zero.

Any chance you have a train going past the building, or something similar? I know it sounds far-fetched, but I once did have a customer whose building would experience power fluctations when a train stopped next to their premises, waiting for a signal to clear.

dvl@ · Oct 24, 2018

This server sits in a rack on concrete in the basement.

No trains here. Everything on a UPS.

But here are the rest of the disks:

Code:

[dan@knew:~] $ /bin/sh
$ sysctl -n kern.disks
da19 da18 da17 da16 da15 da14 da13 da12 da11 da10 da9 da8 da7 da6 da5 da4 da3 da2 da1 da0 ada3 ada2 ada1 ada0
$ for disk in `sysctl -n kern.disks`                                                                                                                                                                                 do
echo -n "${disk}: " && sudo smartctl -a /dev/${disk} | grep Sense_Error_Rate
done
da19: 191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       684
da18: 191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       150
da17: 191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       783
da16: 191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       50
da15: 191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       33
da14: 191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       53
da13: 191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       35
da12: 191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       607
da11: 191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       12
da10: 191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       1652
da9: 191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       169
da8: 191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       448
da7: 191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       36
da6: 191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       19
da5: 191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       57
da4: 191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       133
da3: 191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       1578
da2: 191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       15
da1: 191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       254
da0: 191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       470
ada3: ada2: ada1: ada0: $

Here are the same values, I think, from a year ago: https://gist.github.com/dlangille/1882b58d913e78e74d6c15358d32774f

Looking back, I see this attribute in a blog post from 2013 and it is zero.

dvl@ · Oct 25, 2018

This morning, the zpool was unavailable. Several things happened.

da18 from above now looks like this:

EDIT this error was 5 hours earlier

Code:

[dan@knew:~] $ sudo smartctl -a /dev/da18
smartctl 6.6 2017-11-05 r4594 [FreeBSD 11.2-RELEASE-p4 amd64] (local build)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Toshiba 3.5" MD04ACA... Enterprise HDD
Device Model:     TOSHIBA MD04ACA500
Serial Number:    6539K3OJFS9A
LU WWN Device Id: 5 000039 65bb0025d
Firmware Version: FP2A
User Capacity:    5,000,981,078,016 bytes [5.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Thu Oct 25 13:19:22 2018 UTC
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82)    Offline data collection activity
                    was completed without error.
                    Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)    The previous self-test routine completed
                    without error or no self-test has ever
                    been run.
Total time to complete Offline
data collection:         (  120) seconds.
Offline data collection
capabilities:              (0x5b) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    command.
                    Offline surface scan supported.
                    Self-test supported.
                    No Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003)    Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01)    Error logging supported.
                    General Purpose Logging supported.
Short self-test routine
recommended polling time:      (   2) minutes.
Extended self-test routine
recommended polling time:      ( 541) minutes.
SCT capabilities:            (0x003d)    SCT Status supported.
                    SCT Error Recovery Control supported.
                    SCT Feature Control supported.
                    SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   100   100   050    Pre-fail  Always       -       0
  2 Throughput_Performance  0x0005   100   100   050    Pre-fail  Offline      -       0
  3 Spin_Up_Time            0x0027   100   100   001    Pre-fail  Always       -       9005
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       109
  5 Reallocated_Sector_Ct   0x0033   100   100   050    Pre-fail  Always       -       17664
  7 Seek_Error_Rate         0x000b   100   100   050    Pre-fail  Always       -       0
  8 Seek_Time_Performance   0x0005   100   100   050    Pre-fail  Offline      -       0
  9 Power_On_Hours          0x0032   037   037   000    Old_age   Always       -       25283
10 Spin_Retry_Count        0x0033   102   100   030    Pre-fail  Always       -       0
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       109
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       150
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       99
193 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       747
194 Temperature_Celsius     0x0022   100   100   000    Old_age   Always       -       32 (Min/Max 19/52)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       2121
197 Current_Pending_Sector  0x0032   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   253   000    Old_age   Always       -       0
220 Disk_Shift              0x0002   100   100   000    Old_age   Always       -       0
222 Loaded_Hours            0x0032   038   038   000    Old_age   Always       -       25114
223 Load_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
224 Load_Friction           0x0022   100   100   000    Old_age   Always       -       0
226 Load-in_Time            0x0026   100   100   000    Old_age   Always       -       206
240 Head_Flying_Hours       0x0001   100   100   001    Pre-fail  Offline      -       0

SMART Error Log Version: 1
ATA Error Count: 1
    CR = Command Register [HEX]
    FR = Features Register [HEX]
    SC = Sector Count Register [HEX]
    SN = Sector Number Register [HEX]
    CL = Cylinder Low Register [HEX]
    CH = Cylinder High Register [HEX]
    DH = Device/Head Register [HEX]
    DC = Device Command Register [HEX]
    ER = Error register [HEX]
    ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 1 occurred at disk power-on lifetime: 25278 hours (1053 days + 6 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  04 31 00 17 b1 29 06

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  ea 00 00 00 00 00 00 00      09:24:25.037  FLUSH CACHE EXT
  61 20 00 10 5a f4 40 00      09:24:25.036  WRITE FPDMA QUEUED
  61 50 40 78 2b a4 40 00      09:24:25.024  WRITE FPDMA QUEUED
  61 30 48 78 5a f4 40 00      09:24:25.024  WRITE FPDMA QUEUED
  61 28 38 50 5a f4 40 00      09:24:25.023  WRITE FPDMA QUEUED

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%     20994         -
# 2  Extended offline    Completed without error       00%         9         -

SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

[dan@knew:~] $

dvl@ · Oct 25, 2018

da17 has other errors:

EDIT: this error was 54 hours ago.

Code:

[dan@knew:~] $ sudo smartctl -a /dev/da17  
smartctl 6.6 2017-11-05 r4594 [FreeBSD 11.2-RELEASE-p4 amd64] (local build)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Toshiba 3.5" MD04ACA... Enterprise HDD
Device Model:     TOSHIBA MD04ACA500
Serial Number:    653AK2MXFS9A
LU WWN Device Id: 5 000039 65bb8029e
Firmware Version: FP2A
User Capacity:    5,000,981,078,016 bytes [5.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Thu Oct 25 13:19:08 2018 UTC
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82)    Offline data collection activity
                    was completed without error.
                    Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)    The previous self-test routine completed
                    without error or no self-test has ever
                    been run.
Total time to complete Offline
data collection:         (  120) seconds.
Offline data collection
capabilities:              (0x5b) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    command.
                    Offline surface scan supported.
                    Self-test supported.
                    No Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003)    Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01)    Error logging supported.
                    General Purpose Logging supported.
Short self-test routine
recommended polling time:      (   2) minutes.
Extended self-test routine
recommended polling time:      ( 540) minutes.
SCT capabilities:            (0x003d)    SCT Status supported.
                    SCT Error Recovery Control supported.
                    SCT Feature Control supported.
                    SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   100   100   050    Pre-fail  Always       -       0
  2 Throughput_Performance  0x0005   100   100   050    Pre-fail  Offline      -       0
  3 Spin_Up_Time            0x0027   100   100   001    Pre-fail  Always       -       9331
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       90
  5 Reallocated_Sector_Ct   0x0033   100   100   050    Pre-fail  Always       -       11192
  7 Seek_Error_Rate         0x000b   100   100   050    Pre-fail  Always       -       0
  8 Seek_Time_Performance   0x0005   100   100   050    Pre-fail  Offline      -       0
  9 Power_On_Hours          0x0032   037   037   000    Old_age   Always       -       25221
10 Spin_Retry_Count        0x0033   101   100   030    Pre-fail  Always       -       0
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       90
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       783
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       81
193 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       685
194 Temperature_Celsius     0x0022   100   100   000    Old_age   Always       -       33 (Min/Max 18/51)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       1332
197 Current_Pending_Sector  0x0032   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   253   000    Old_age   Always       -       7
220 Disk_Shift              0x0002   100   100   000    Old_age   Always       -       0
222 Loaded_Hours            0x0032   038   038   000    Old_age   Always       -       25052
223 Load_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
224 Load_Friction           0x0022   100   100   000    Old_age   Always       -       0
226 Load-in_Time            0x0026   100   100   000    Old_age   Always       -       206
240 Head_Flying_Hours       0x0001   100   100   001    Pre-fail  Offline      -       0

SMART Error Log Version: 1
ATA Error Count: 7 (device log contains only the most recent five errors)
    CR = Command Register [HEX]
    FR = Features Register [HEX]
    SC = Sector Count Register [HEX]
    SN = Sector Number Register [HEX]
    CL = Cylinder Low Register [HEX]
    CH = Cylinder High Register [HEX]
    DH = Device/Head Register [HEX]
    DC = Device Command Register [HEX]
    ER = Error register [HEX]
    ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 7 occurred at disk power-on lifetime: 25167 hours (1048 days + 15 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 41 00 27 62 d7 40  Error: ICRC, ABRT at LBA = 0x00d76227 = 14115367

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  61 00 08 60 91 c4 40 00      02:14:50.103  WRITE FPDMA QUEUED
  61 20 00 08 62 d7 40 00      02:14:49.857  WRITE FPDMA QUEUED
  ef 10 02 00 00 00 00 00      02:14:49.683  SET FEATURES [Enable SATA feature]
  ef 02 00 00 00 00 00 00      02:14:49.683  SET FEATURES [Enable write cache]
  ef aa 00 00 00 00 00 00      02:14:49.682  SET FEATURES [Enable read look-ahead]

Error 6 occurred at disk power-on lifetime: 25167 hours (1048 days + 15 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 41 00 5f 92 c4 40  Error: ICRC, ABRT at LBA = 0x00c4925f = 12882527

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  61 00 00 60 91 c4 40 00      02:14:49.357  WRITE FPDMA QUEUED
  ef 10 02 00 00 00 00 00      02:14:49.108  SET FEATURES [Enable SATA feature]
  ef 02 00 00 00 00 00 00      02:14:49.108  SET FEATURES [Enable write cache]
  ef aa 00 00 00 00 00 00      02:14:49.108  SET FEATURES [Enable read look-ahead]
  ef 03 45 00 00 00 00 00      02:14:49.107  SET FEATURES [Set transfer mode]

Error 5 occurred at disk power-on lifetime: 25167 hours (1048 days + 15 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 41 00 9f 61 d7 40  Error: ICRC, ABRT at LBA = 0x00d7619f = 14115231

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  61 00 08 28 39 c3 40 00      02:14:46.369  WRITE FPDMA QUEUED
  61 20 00 80 61 d7 40 00      02:14:46.369  WRITE FPDMA QUEUED
  b0 d5 01 00 4f c2 00 00      02:14:46.357  SMART READ LOG
  61 00 00 28 38 c3 40 00      02:14:46.356  WRITE FPDMA QUEUED
  ef 10 02 00 00 00 00 00      02:14:46.213  SET FEATURES [Enable SATA feature]

Error 4 occurred at disk power-on lifetime: 25167 hours (1048 days + 15 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 41 00 f7 60 d7 40  Error: ICRC, ABRT at LBA = 0x00d760f7 = 14115063

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  61 20 00 d8 60 d7 40 00      02:14:43.356  WRITE FPDMA QUEUED
  ef 10 02 00 00 00 00 00      02:14:43.274  SET FEATURES [Enable SATA feature]
  ef 02 00 00 00 00 00 00      02:14:43.274  SET FEATURES [Enable write cache]
  ef aa 00 00 00 00 00 00      02:14:43.274  SET FEATURES [Enable read look-ahead]
  ef 03 45 00 00 00 00 00      02:14:43.274  SET FEATURES [Set transfer mode]

Error 3 occurred at disk power-on lifetime: 25167 hours (1048 days + 15 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 41 00 f7 60 d7 40  Error: ICRC, ABRT at LBA = 0x00d760f7 = 14115063

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  61 20 00 d8 60 d7 40 00      02:14:42.945  WRITE FPDMA QUEUED
  ea 00 00 00 00 00 00 00      02:14:42.857  FLUSH CACHE EXT
  61 20 00 b8 60 d7 40 00      02:14:42.857  WRITE FPDMA QUEUED
  ef 10 02 00 00 00 00 00      02:14:42.653  SET FEATURES [Enable SATA feature]
  ef 02 00 00 00 00 00 00      02:14:42.652  SET FEATURES [Enable write cache]

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%     20993         -
# 2  Extended offline    Completed without error       00%         9         -

SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.