I am experiencing problems with one of two Intel SSD since approx
July 2017. Reformatting the disc improved the situation for a while,
currently the disc is rejected during boot.
The disk is part of a ZFS mirror:
The logs show two commands that run into timeouts:
The setup was running fine since initial setup (> 1 year) and tests with
other operation systems were successful (CentOS on same hardware, MacOS
with disk on external disk controller). First I suspected r317080 as possible
reason. I was wrong, reverting the change didn't help.
The disk is also rejected while booting from a FreeBSD 11.1 rescue image.
What I've tried up to now (without success):
My confusion was increased further by reading this Intel discussion.
Such a symptom was fixed in a firmware update - only available for the
more expensive professional version S3610 but not for the 530 series. If
that's the same problem, my attempts to switch off TRIM in FreeBSD were
not successful, while other OS work around that issue? Am I digging in
the wrong area?
Since the ada1 disk is rejected during boot only the remaining SSD on
ada0 is shown here (same product).
System info (dmesg) is attached.
July 2017. Reformatting the disc improved the situation for a while,
currently the disc is rejected during boot.
- 2 x SSD: INTEL SSDSC2BW480A4 DC32
- Intel Patsburg AHCI SATA controller
The disk is part of a ZFS mirror:
Code:
# zpool status -v zboot
pool: zboot
state: DEGRADED
status: One or more devices has been removed by the administrator.
Sufficient replicas exist for the pool to continue functioning in a
degraded state.
action: Online the device using 'zpool online' or replace the device with
'zpool replace'.
scan: resilvered 1.43G in 0h10m with 0 errors on Sun Aug 6 15:17:09 2017
config:
NAME STATE READ WRITE CKSUM
zboot DEGRADED 0 0 0
mirror-0 DEGRADED 0 0 0
diskid/DISK-PHDA409400P44805GNp4 ONLINE 0 0 0
3706358200868667397 REMOVED 0 0 0 was /dev/diskid/DISK-BTDA404403064805GNp4
errors: No known data errors
The logs show two commands that run into timeouts:
- WRITE_FPDMA_QUEUED
- SOFT_RESET
Code:
ahcich1: AHCI reset: device not ready after 31000ms (tfd = 00000080)
ahcich1: Poll timeout on slot 1 port 0
ahcich1: is 00000000 cs 00000002 ss 00000000 rs 00000002 tfd 80 serr 00000000 cmd 0004c117
(aprobe1:ahcich1:0:0:0): SOFT_RESET. ACB: 00 00 00 00 00 00 00 00 00 00 00 00
(aprobe1:ahcich1:0:0:0): CAM status: Command timeout
(aprobe1:ahcich1:0:0:0): Error 5, Retries exhausted
Code:
ahcich1: Timeout on slot 14 port 0
ahcich1: is 00000000 cs 00000000 ss 00004000 rs 00004000 tfd 40 serr 00000000 cmd 0004ce17
(ada1:ahcich1:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 c8 10 43 6e 40 02 00 00 00 00 00
(ada1:ahcich1:0:0:0): CAM status: Command timeout
(ada1:ahcich1:0:0:0): Retrying command
ahcich1: AHCI reset: device not ready after 31000ms (tfd = 00000080)
ahcich1: Timeout on slot 15 port 0
ahcich1: is 00000000 cs 00008000 ss 00000000 rs 00008000 tfd 80 serr 00000000 cmd 0004cf17
(aprobe0:ahcich1:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00
(aprobe0:ahcich1:0:0:0): CAM status: Command timeout
(aprobe0:ahcich1:0:0:0): Retrying command
ahcich1: AHCI reset: device not ready after 31000ms (tfd = 00000080)
ahcich1: Timeout on slot 16 port 0
ahcich1: is 00000000 cs 00010000 ss 00000000 rs 00010000 tfd 80 serr 00000000 cmd 0004d017
(aprobe0:ahcich1:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00
(aprobe0:ahcich1:0:0:0): CAM status: Command timeout
(aprobe0:ahcich1:0:0:0): Error 5, Retries exhausted
ahcich1: AHCI reset: device not ready after 31000ms (tfd = 00000080)
ahcich1: Timeout on slot 17 port 0
ahcich1: is 00000000 cs 00020000 ss 00000000 rs 00020000 tfd 80 serr 00000000 cmd 0004d117
(aprobe0:ahcich1:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00
(aprobe0:ahcich1:0:0:0): CAM status: Command timeout
(aprobe0:ahcich1:0:0:0): Error 5, Retry was blocked
ada1 at ahcich1 bus 0 scbus1 target 0 lun 0
ada1: <INTEL SSDSC2BW480A4 DC32> s/n BTDA404403064805GN detached
ahcich1: AHCI reset: device not ready after 31000ms (tfd = 00000080)
ahcich1: Timeout on slot 18 port 0
ahcich1: is 00000000 cs 00040000 ss 00000000 rs 00040000 tfd 80 serr 00000000 cmd 0004d217
(aprobe0:ahcich1:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00
(aprobe0:ahcich1:0:0:0): CAM status: Command timeout
(aprobe0:ahcich1:0:0:0): Retrying command
ahcich1: AHCI reset: device not ready after 31000ms (tfd = 00000080)
ahcich1: Timeout on slot 19 port 0
ahcich1: is 00000000 cs 00080000 ss 00000000 rs 00080000 tfd 80 serr 00000000 cmd 0004d317
(aprobe0:ahcich1:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00
(aprobe0:ahcich1:0:0:0): CAM status: Command timeout
(aprobe0:ahcich1:0:0:0): Error 5, Retries exhausted
ahcich1: AHCI reset: device not ready after 31000ms (tfd = 00000080)
ahcich1: Poll timeout on slot 20 port 0
ahcich1: is 00000000 cs 00100000 ss 00000000 rs 00100000 tfd 80 serr 00000000 cmd 0004d417
(aprobe0:ahcich1:0:0:0): SOFT_RESET. ACB: 00 00 00 00 00 00 00 00 00 00 00 00
(aprobe0:ahcich1:0:0:0): CAM status: Command timeout
(aprobe0:ahcich1:0:0:0): Error 5, Retries exhausted
ahcich1: Timeout on slot 21 port 0
ahcich1: is 00000000 cs 00200000 ss 00000000 rs 00200000 tfd 80 serr 00000000 cmd 0004d517
(ada1:ahcich1:0:0:0): SETFEATURES ENABLE RCACHE. ACB: ef aa 00 00 00 40 00 00 00 00 00 00
(ada1:ahcich1:0:0:0): CAM status: Command timeout
(ada1:ahcich1:0:0:0): Error 5, Periph was invalidated
ahcich1: AHCI reset: device not ready after 31000ms (tfd = 00000080)
ahcich1: Poll timeout on slot 22 port 0
ahcich1: is 00000000 cs 00400000 ss 00000000 rs 00400000 tfd 80 serr 00000000 cmd 0004d617
(aprobe0:ahcich1:0:0:0): SOFT_RESET. ACB: 00 00 00 00 00 00 00 00 00 00 00 00
(aprobe0:ahcich1:0:0:0): CAM status: Command timeout
(aprobe0:ahcich1:0:0:0): Error 5, Retries exhausted
ahcich1: Timeout on slot 23 port 0
ahcich1: is 00000000 cs 00800000 ss 00000000 rs 00800000 tfd 80 serr 00000000 cmd 0004d717
(ada1:ahcich1:0:0:0): SETFEATURES ENABLE WCACHE. ACB: ef 02 00 00 00 40 00 00 00 00 00 00
(ada1:ahcich1:0:0:0): CAM status: Command timeout
(ada1:ahcich1:0:0:0): Error 5, Periph was invalidated
ahcich1: AHCI reset: device not ready after 31000ms (tfd = 00000080)
ahcich1: Poll timeout on slot 24 port 0
ahcich1: is 00000000 cs 01000000 ss 00000000 rs 01000000 tfd 80 serr 00000000 cmd 0004d817
(aprobe0:ahcich1:0:0:0): SOFT_RESET. ACB: 00 00 00 00 00 00 00 00 00 00 00 00
(aprobe0:ahcich1:0:0:0): CAM status: Command timeout
(aprobe0:ahcich1:0:0:0): Error 5, Retries exhausted
ahcich1: Timeout on slot 25 port 0
ahcich1: is 00000000 cs 02000000 ss 02000000 rs 02000000 tfd 80 serr 00000000 cmd 0004d917
(ada1:ahcich1:0:0:0): DSM TRIM. ACB: 06 01 00 00 00 40 00 00 00 00 01 00
(ada1:ahcich1:0:0:0): CAM status: Unconditionally Re-queue Request
(ada1:ahcich1:0:0:0): Error 5, Periph was invalidated
(ada1:ahcich1:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 c8 10 43 6e 40 02 00 00 00 00 00
(ada1:ahcich1:0:0:0): CAM status: Command timeout
(ada1:ahcich1:0:0:0): Error 5, Periph was invalidated
(ada1:ahcich1:0:0:0): Periph destroyed
ahcich1: AHCI reset: device not ready after 31000ms (tfd = 00000080)
ahcich1: Poll timeout on slot 26 port 0
ahcich1: is 00000000 cs 04000000 ss 00000000 rs 04000000 tfd 80 serr 00000000 cmd 0004da17
(aprobe0:ahcich1:0:0:0): SOFT_RESET. ACB: 00 00 00 00 00 00 00 00 00 00 00 00
(aprobe0:ahcich1:0:0:0): CAM status: Command timeout
(aprobe0:ahcich1:0:0:0): Error 5, Retries exhausted
The setup was running fine since initial setup (> 1 year) and tests with
other operation systems were successful (CentOS on same hardware, MacOS
with disk on external disk controller). First I suspected r317080 as possible
reason. I was wrong, reverting the change didn't help.
The disk is also rejected while booting from a FreeBSD 11.1 rescue image.
What I've tried up to now (without success):
- hint.ahci.X.msi=0 # with proper X
- kern.cam.ada.X.quirks=2 # default: 1
- swapping disks ada0 and ada1
- changing cables
- after the first errors I was writing zeros to the disk (under MacOS
and FreeBSD when I could still mount the disk). No errors, but I read
later this is not a good idea on SSD...
My confusion was increased further by reading this Intel discussion.
Such a symptom was fixed in a firmware update - only available for the
more expensive professional version S3610 but not for the 530 series. If
that's the same problem, my attempts to switch off TRIM in FreeBSD were
not successful, while other OS work around that issue? Am I digging in
the wrong area?
Since the ada1 disk is rejected during boot only the remaining SSD on
ada0 is shown here (same product).
Code:
# camcontrol devlist
<INTEL SSDSC2BW480A4 DC32> at scbus0 target 0 lun 0 (ada0,pass0)
<ST8000VN0002-1Z8112 SC60> at scbus2 target 0 lun 0 (ada1,pass1)
<ST8000VN0002-1Z8112 SC60> at scbus3 target 0 lun 0 (ada2,pass2)
<ST8000VN0002-1Z8112 SC60> at scbus4 target 0 lun 0 (ada3,pass3)
<ST8000VN0002-1Z8112 SC60> at scbus5 target 0 lun 0 (ada4,pass4)
<AHCI SGPIO Enclosure 1.00 0001> at scbus6 target 0 lun 0 (ses0,pass5)
# camcontrol identify ada0
pass0: <INTEL SSDSC2BW480A4 DC32> ACS-2 ATA SATA 3.x device
pass0: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 8192bytes)
protocol ATA/ATAPI-9 SATA 3.x
device model INTEL SSDSC2BW480A4
firmware revision DC32
serial number PHDA409400P44805GN
WWN 55cd2e400038eb82
cylinders 16383
heads 16
sectors/track 63
sector size logical 512, physical 512, offset 0
LBA supported 268435455 sectors
LBA48 supported 937703088 sectors
PIO supported PIO4
DMA supported WDMA2 UDMA6
media RPM non-rotating
Feature Support Enabled Value Vendor
read ahead yes yes
write cache yes yes
flush cache yes yes
overlap no
Tagged Command Queuing (TCQ) no no
Native Command Queuing (NCQ) yes 32 tags
NCQ Queue Management no
NCQ Streaming no
Receive & Send FPDMA Queued no
SMART yes yes
microcode download yes yes
security yes no
power management yes yes
advanced power management yes yes 254/0xFE
automatic acoustic management no no
media status notification no no
power-up in Standby yes no
write-read-verify no no
unload yes yes
general purpose logging yes yes
free-fall no no
Data Set Management (DSM/TRIM) yes
DSM - max 512byte blocks yes 1
DSM - deterministic read yes any value
Host Protected Area (HPA) yes no 937703088/937703088
HPA - Security no
# smartctl -a /dev/ada0
smartctl 6.5 2016-05-07 r4318 [FreeBSD 11.1-STABLE amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Family: Intel 53x and Pro 2500 Series SSDs
Device Model: INTEL SSDSC2BW480A4
Serial Number: PHDA409400P44805GN
LU WWN Device Id: 5 5cd2e4 00038eb82
Firmware Version: DC32
User Capacity: 480,103,981,056 bytes [480 GB]
Sector Size: 512 bytes logical/physical
Rotation Rate: Solid State Device
Device is: In smartctl database [for details use: -P show]
ATA Version is: ACS-2 (minor revision not indicated)
SATA Version is: SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Sun Aug 6 22:30:28 2017 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x05) Offline data collection activity
was aborted by an interrupting command from host.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 33) The self-test routine was interrupted
by the host with a hard or soft reset.
Total time to complete Offline
data collection: ( 5860) seconds.
Offline data collection
capabilities: (0x7f) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Abort Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 1) minutes.
Extended self-test routine
recommended polling time: ( 48) minutes.
Conveyance self-test routine
recommended polling time: ( 2) minutes.
SCT capabilities: (0x0025) SCT Status supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
5 Reallocated_Sector_Ct 0x0032 100 100 000 Old_age Always - 56
9 Power_On_Hours_and_Msec 0x0032 100 100 000 Old_age Always - 17627h+12m+48.210s
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 202
170 Available_Reservd_Space 0x0033 100 100 010 Pre-fail Always - 0
171 Program_Fail_Count 0x0032 100 100 000 Old_age Always - 28
172 Erase_Fail_Count 0x0032 100 100 000 Old_age Always - 0
174 Unexpect_Power_Loss_Ct 0x0032 100 100 000 Old_age Always - 24
183 SATA_Downshift_Count 0x0032 100 100 000 Old_age Always - 0
184 End-to-End_Error 0x0033 100 100 090 Pre-fail Always - 0
187 Uncorrectable_Error_Cnt 0x0032 099 099 000 Old_age Always - 1
190 Airflow_Temperature_Cel 0x0032 032 047 000 Old_age Always - 32 (Min/Max 20/47)
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 24
199 UDMA_CRC_Error_Count 0x0032 100 100 000 Old_age Always - 0
225 Host_Writes_32MiB 0x0032 100 100 000 Old_age Always - 53219847
226 Workld_Media_Wear_Indic 0x0032 100 100 000 Old_age Always - 65535
227 Workld_Host_Reads_Perc 0x0032 100 100 000 Old_age Always - 0
228 Workload_Minutes 0x0032 100 100 000 Old_age Always - 65535
232 Available_Reservd_Space 0x0033 100 100 010 Pre-fail Always - 0
233 Media_Wearout_Indicator 0x0032 061 061 000 Old_age Always - 0
241 Host_Writes_32MiB 0x0032 100 100 000 Old_age Always - 53219847
242 Host_Reads_32MiB 0x0032 100 100 000 Old_age Always - 199701
249 NAND_Writes_1GiB 0x0032 100 100 000 Old_age Always - 616365
SMART Error Log not supported
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Offline Interrupted (host reset) 10% 17626 -
# 2 Offline Interrupted (host reset) 10% 17626 -
# 3 Offline Interrupted (host reset) 10% 17626 -
# 4 Offline Interrupted (host reset) 10% 17626 -
# 5 Offline Interrupted (host reset) 10% 17626 -
# 6 Offline Interrupted (host reset) 10% 17626 -
# 7 Offline Interrupted (host reset) 10% 17626 -
# 8 Offline Interrupted (host reset) 10% 17625 -
# 9 Offline Interrupted (host reset) 10% 17623 -
#10 Offline Interrupted (host reset) 10% 17623 -
#11 Offline Interrupted (host reset) 10% 17614 -
#12 Offline Interrupted (host reset) 10% 17614 -
#13 Offline Interrupted (host reset) 10% 17520 -
#14 Offline Interrupted (host reset) 10% 17519 -
#15 Offline Interrupted (host reset) 10% 17515 -
#16 Offline Interrupted (host reset) 10% 17515 -
#17 Offline Interrupted (host reset) 10% 17515 -
#18 Offline Interrupted (host reset) 10% 17515 -
#19 Offline Interrupted (host reset) 10% 17514 -
#20 Offline Interrupted (host reset) 10% 17512 -
#21 Offline Interrupted (host reset) 10% 17510 -
SMART Selective self-test log data structure revision number 0
Note: revision number not 1 implies that no selective self-test has ever been run
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
System info (dmesg) is attached.