Solved Timeouts with INTEL SSD on Intel Patsburg AHCI SATA controller

I am experiencing problems with one of two Intel SSD since approx
July 2017. Reformatting the disc improved the situation for a while,
currently the disc is rejected during boot.

  • 2 x SSD: INTEL SSDSC2BW480A4 DC32
  • Intel Patsburg AHCI SATA controller

The disk is part of a ZFS mirror:

Code:
# zpool status -v zboot
  pool: zboot
 state: DEGRADED
status: One or more devices has been removed by the administrator.
        Sufficient replicas exist for the pool to continue functioning in a
        degraded state.
action: Online the device using 'zpool online' or replace the device with
        'zpool replace'.
  scan: resilvered 1.43G in 0h10m with 0 errors on Sun Aug  6 15:17:09 2017
config:

        NAME                                  STATE     READ WRITE CKSUM
        zboot                                 DEGRADED     0     0     0
          mirror-0                            DEGRADED     0     0     0
            diskid/DISK-PHDA409400P44805GNp4  ONLINE       0     0     0
            3706358200868667397               REMOVED      0     0     0  was /dev/diskid/DISK-BTDA404403064805GNp4

errors: No known data errors

The logs show two commands that run into timeouts:
  • WRITE_FPDMA_QUEUED
  • SOFT_RESET


Code:
ahcich1: AHCI reset: device not ready after 31000ms (tfd = 00000080)
ahcich1: Poll timeout on slot 1 port 0
ahcich1: is 00000000 cs 00000002 ss 00000000 rs 00000002 tfd 80 serr 00000000 cmd 0004c117
(aprobe1:ahcich1:0:0:0): SOFT_RESET. ACB: 00 00 00 00 00 00 00 00 00 00 00 00
(aprobe1:ahcich1:0:0:0): CAM status: Command timeout
(aprobe1:ahcich1:0:0:0): Error 5, Retries exhausted


Code:
ahcich1: Timeout on slot 14 port 0
ahcich1: is 00000000 cs 00000000 ss 00004000 rs 00004000 tfd 40 serr 00000000 cmd 0004ce17
(ada1:ahcich1:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 c8 10 43 6e 40 02 00 00 00 00 00
(ada1:ahcich1:0:0:0): CAM status: Command timeout
(ada1:ahcich1:0:0:0): Retrying command
ahcich1: AHCI reset: device not ready after 31000ms (tfd = 00000080)
ahcich1: Timeout on slot 15 port 0
ahcich1: is 00000000 cs 00008000 ss 00000000 rs 00008000 tfd 80 serr 00000000 cmd 0004cf17
(aprobe0:ahcich1:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00
(aprobe0:ahcich1:0:0:0): CAM status: Command timeout
(aprobe0:ahcich1:0:0:0): Retrying command
ahcich1: AHCI reset: device not ready after 31000ms (tfd = 00000080)
ahcich1: Timeout on slot 16 port 0
ahcich1: is 00000000 cs 00010000 ss 00000000 rs 00010000 tfd 80 serr 00000000 cmd 0004d017
(aprobe0:ahcich1:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00
(aprobe0:ahcich1:0:0:0): CAM status: Command timeout
(aprobe0:ahcich1:0:0:0): Error 5, Retries exhausted
ahcich1: AHCI reset: device not ready after 31000ms (tfd = 00000080)
ahcich1: Timeout on slot 17 port 0
ahcich1: is 00000000 cs 00020000 ss 00000000 rs 00020000 tfd 80 serr 00000000 cmd 0004d117
(aprobe0:ahcich1:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00
(aprobe0:ahcich1:0:0:0): CAM status: Command timeout
(aprobe0:ahcich1:0:0:0): Error 5, Retry was blocked
ada1 at ahcich1 bus 0 scbus1 target 0 lun 0
ada1: <INTEL SSDSC2BW480A4 DC32> s/n BTDA404403064805GN detached
ahcich1: AHCI reset: device not ready after 31000ms (tfd = 00000080)
ahcich1: Timeout on slot 18 port 0
ahcich1: is 00000000 cs 00040000 ss 00000000 rs 00040000 tfd 80 serr 00000000 cmd 0004d217
(aprobe0:ahcich1:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00
(aprobe0:ahcich1:0:0:0): CAM status: Command timeout
(aprobe0:ahcich1:0:0:0): Retrying command
ahcich1: AHCI reset: device not ready after 31000ms (tfd = 00000080)
ahcich1: Timeout on slot 19 port 0
ahcich1: is 00000000 cs 00080000 ss 00000000 rs 00080000 tfd 80 serr 00000000 cmd 0004d317
(aprobe0:ahcich1:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00
(aprobe0:ahcich1:0:0:0): CAM status: Command timeout
(aprobe0:ahcich1:0:0:0): Error 5, Retries exhausted
ahcich1: AHCI reset: device not ready after 31000ms (tfd = 00000080)
ahcich1: Poll timeout on slot 20 port 0
ahcich1: is 00000000 cs 00100000 ss 00000000 rs 00100000 tfd 80 serr 00000000 cmd 0004d417
(aprobe0:ahcich1:0:0:0): SOFT_RESET. ACB: 00 00 00 00 00 00 00 00 00 00 00 00
(aprobe0:ahcich1:0:0:0): CAM status: Command timeout
(aprobe0:ahcich1:0:0:0): Error 5, Retries exhausted
ahcich1: Timeout on slot 21 port 0
ahcich1: is 00000000 cs 00200000 ss 00000000 rs 00200000 tfd 80 serr 00000000 cmd 0004d517
(ada1:ahcich1:0:0:0): SETFEATURES ENABLE RCACHE. ACB: ef aa 00 00 00 40 00 00 00 00 00 00
(ada1:ahcich1:0:0:0): CAM status: Command timeout
(ada1:ahcich1:0:0:0): Error 5, Periph was invalidated
ahcich1: AHCI reset: device not ready after 31000ms (tfd = 00000080)
ahcich1: Poll timeout on slot 22 port 0
ahcich1: is 00000000 cs 00400000 ss 00000000 rs 00400000 tfd 80 serr 00000000 cmd 0004d617
(aprobe0:ahcich1:0:0:0): SOFT_RESET. ACB: 00 00 00 00 00 00 00 00 00 00 00 00
(aprobe0:ahcich1:0:0:0): CAM status: Command timeout
(aprobe0:ahcich1:0:0:0): Error 5, Retries exhausted
ahcich1: Timeout on slot 23 port 0
ahcich1: is 00000000 cs 00800000 ss 00000000 rs 00800000 tfd 80 serr 00000000 cmd 0004d717
(ada1:ahcich1:0:0:0): SETFEATURES ENABLE WCACHE. ACB: ef 02 00 00 00 40 00 00 00 00 00 00
(ada1:ahcich1:0:0:0): CAM status: Command timeout
(ada1:ahcich1:0:0:0): Error 5, Periph was invalidated
ahcich1: AHCI reset: device not ready after 31000ms (tfd = 00000080)
ahcich1: Poll timeout on slot 24 port 0
ahcich1: is 00000000 cs 01000000 ss 00000000 rs 01000000 tfd 80 serr 00000000 cmd 0004d817
(aprobe0:ahcich1:0:0:0): SOFT_RESET. ACB: 00 00 00 00 00 00 00 00 00 00 00 00
(aprobe0:ahcich1:0:0:0): CAM status: Command timeout
(aprobe0:ahcich1:0:0:0): Error 5, Retries exhausted
ahcich1: Timeout on slot 25 port 0
ahcich1: is 00000000 cs 02000000 ss 02000000 rs 02000000 tfd 80 serr 00000000 cmd 0004d917
(ada1:ahcich1:0:0:0): DSM TRIM. ACB: 06 01 00 00 00 40 00 00 00 00 01 00
(ada1:ahcich1:0:0:0): CAM status: Unconditionally Re-queue Request
(ada1:ahcich1:0:0:0): Error 5, Periph was invalidated
(ada1:ahcich1:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 c8 10 43 6e 40 02 00 00 00 00 00
(ada1:ahcich1:0:0:0): CAM status: Command timeout
(ada1:ahcich1:0:0:0): Error 5, Periph was invalidated
(ada1:ahcich1:0:0:0): Periph destroyed
ahcich1: AHCI reset: device not ready after 31000ms (tfd = 00000080)
ahcich1: Poll timeout on slot 26 port 0
ahcich1: is 00000000 cs 04000000 ss 00000000 rs 04000000 tfd 80 serr 00000000 cmd 0004da17
(aprobe0:ahcich1:0:0:0): SOFT_RESET. ACB: 00 00 00 00 00 00 00 00 00 00 00 00
(aprobe0:ahcich1:0:0:0): CAM status: Command timeout
(aprobe0:ahcich1:0:0:0): Error 5, Retries exhausted


The setup was running fine since initial setup (> 1 year) and tests with
other operation systems were successful (CentOS on same hardware, MacOS
with disk on external disk controller). First I suspected r317080 as possible
reason. I was wrong, reverting the change didn't help.


The disk is also rejected while booting from a FreeBSD 11.1 rescue image.

What I've tried up to now (without success):
  • hint.ahci.X.msi=0 # with proper X
  • kern.cam.ada.X.quirks=2 # default: 1
  • swapping disks ada0 and ada1
  • changing cables
  • after the first errors I was writing zeros to the disk (under MacOS
    and FreeBSD when I could still mount the disk). No errors, but I read
    later this is not a good idea on SSD...

My confusion was increased further by reading this Intel discussion.
Such a symptom was fixed in a firmware update - only available for the
more expensive professional version S3610 but not for the 530 series. If
that's the same problem, my attempts to switch off TRIM in FreeBSD were
not successful, while other OS work around that issue? Am I digging in
the wrong area?

Since the ada1 disk is rejected during boot only the remaining SSD on
ada0 is shown here (same product).

Code:
# camcontrol devlist
<INTEL SSDSC2BW480A4 DC32>         at scbus0 target 0 lun 0 (ada0,pass0)
<ST8000VN0002-1Z8112 SC60>         at scbus2 target 0 lun 0 (ada1,pass1)
<ST8000VN0002-1Z8112 SC60>         at scbus3 target 0 lun 0 (ada2,pass2)
<ST8000VN0002-1Z8112 SC60>         at scbus4 target 0 lun 0 (ada3,pass3)
<ST8000VN0002-1Z8112 SC60>         at scbus5 target 0 lun 0 (ada4,pass4)
<AHCI SGPIO Enclosure 1.00 0001>   at scbus6 target 0 lun 0 (ses0,pass5)

# camcontrol identify ada0
pass0: <INTEL SSDSC2BW480A4 DC32> ACS-2 ATA SATA 3.x device
pass0: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 8192bytes)

protocol              ATA/ATAPI-9 SATA 3.x
device model          INTEL SSDSC2BW480A4
firmware revision     DC32
serial number         PHDA409400P44805GN
WWN                   55cd2e400038eb82
cylinders             16383
heads                 16
sectors/track         63
sector size           logical 512, physical 512, offset 0
LBA supported         268435455 sectors
LBA48 supported       937703088 sectors
PIO supported         PIO4
DMA supported         WDMA2 UDMA6
media RPM             non-rotating

Feature                      Support  Enabled   Value           Vendor
read ahead                     yes      yes
write cache                    yes      yes
flush cache                    yes      yes
overlap                        no
Tagged Command Queuing (TCQ)   no       no
Native Command Queuing (NCQ)   yes              32 tags
NCQ Queue Management           no
NCQ Streaming                  no
Receive & Send FPDMA Queued    no
SMART                          yes      yes
microcode download             yes      yes
security                       yes      no
power management               yes      yes
advanced power management      yes      yes     254/0xFE
automatic acoustic management  no       no
media status notification      no       no
power-up in Standby            yes      no
write-read-verify              no       no
unload                         yes      yes
general purpose logging        yes      yes
free-fall                      no       no
Data Set Management (DSM/TRIM) yes
DSM - max 512byte blocks       yes              1
DSM - deterministic read       yes              any value
Host Protected Area (HPA)      yes      no      937703088/937703088
HPA - Security                 no


# smartctl -a /dev/ada0
smartctl 6.5 2016-05-07 r4318 [FreeBSD 11.1-STABLE amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Intel 53x and Pro 2500 Series SSDs
Device Model:     INTEL SSDSC2BW480A4
Serial Number:    PHDA409400P44805GN
LU WWN Device Id: 5 5cd2e4 00038eb82
Firmware Version: DC32
User Capacity:    480,103,981,056 bytes [480 GB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Sun Aug  6 22:30:28 2017 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x05) Offline data collection activity
                                        was aborted by an interrupting command from host.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      (  33) The self-test routine was interrupted
                                        by the host with a hard or soft reset.
Total time to complete Offline
data collection:                ( 5860) seconds.
Offline data collection
capabilities:                    (0x7f) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Abort Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        (  48) minutes.
Conveyance self-test routine
recommended polling time:        (   2) minutes.
SCT capabilities:              (0x0025) SCT Status supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0032   100   100   000    Old_age   Always       -       56
  9 Power_On_Hours_and_Msec 0x0032   100   100   000    Old_age   Always       -       17627h+12m+48.210s
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       202
170 Available_Reservd_Space 0x0033   100   100   010    Pre-fail  Always       -       0
171 Program_Fail_Count      0x0032   100   100   000    Old_age   Always       -       28
172 Erase_Fail_Count        0x0032   100   100   000    Old_age   Always       -       0
174 Unexpect_Power_Loss_Ct  0x0032   100   100   000    Old_age   Always       -       24
183 SATA_Downshift_Count    0x0032   100   100   000    Old_age   Always       -       0
184 End-to-End_Error        0x0033   100   100   090    Pre-fail  Always       -       0
187 Uncorrectable_Error_Cnt 0x0032   099   099   000    Old_age   Always       -       1
190 Airflow_Temperature_Cel 0x0032   032   047   000    Old_age   Always       -       32 (Min/Max 20/47)
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       24
199 UDMA_CRC_Error_Count    0x0032   100   100   000    Old_age   Always       -       0
225 Host_Writes_32MiB       0x0032   100   100   000    Old_age   Always       -       53219847
226 Workld_Media_Wear_Indic 0x0032   100   100   000    Old_age   Always       -       65535
227 Workld_Host_Reads_Perc  0x0032   100   100   000    Old_age   Always       -       0
228 Workload_Minutes        0x0032   100   100   000    Old_age   Always       -       65535
232 Available_Reservd_Space 0x0033   100   100   010    Pre-fail  Always       -       0
233 Media_Wearout_Indicator 0x0032   061   061   000    Old_age   Always       -       0
241 Host_Writes_32MiB       0x0032   100   100   000    Old_age   Always       -       53219847
242 Host_Reads_32MiB        0x0032   100   100   000    Old_age   Always       -       199701
249 NAND_Writes_1GiB        0x0032   100   100   000    Old_age   Always       -       616365

SMART Error Log not supported

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Offline             Interrupted (host reset)      10%     17626         -
# 2  Offline             Interrupted (host reset)      10%     17626         -
# 3  Offline             Interrupted (host reset)      10%     17626         -
# 4  Offline             Interrupted (host reset)      10%     17626         -
# 5  Offline             Interrupted (host reset)      10%     17626         -
# 6  Offline             Interrupted (host reset)      10%     17626         -
# 7  Offline             Interrupted (host reset)      10%     17626         -
# 8  Offline             Interrupted (host reset)      10%     17625         -
# 9  Offline             Interrupted (host reset)      10%     17623         -
#10  Offline             Interrupted (host reset)      10%     17623         -
#11  Offline             Interrupted (host reset)      10%     17614         -
#12  Offline             Interrupted (host reset)      10%     17614         -
#13  Offline             Interrupted (host reset)      10%     17520         -
#14  Offline             Interrupted (host reset)      10%     17519         -
#15  Offline             Interrupted (host reset)      10%     17515         -
#16  Offline             Interrupted (host reset)      10%     17515         -
#17  Offline             Interrupted (host reset)      10%     17515         -
#18  Offline             Interrupted (host reset)      10%     17515         -
#19  Offline             Interrupted (host reset)      10%     17514         -
#20  Offline             Interrupted (host reset)      10%     17512         -
#21  Offline             Interrupted (host reset)      10%     17510         -

SMART Selective self-test log data structure revision number 0
Note: revision number not 1 implies that no selective self-test has ever been run
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

System info (dmesg) is attached.
 

Attachments

  • dmesg.txt
    13.7 KB · Views: 555
Sounds like a defective drive. The SMART data indicates that the host has written 1624 TB (attribute number 224 =
53219847), of which 601 TB went to the flash itself (attribute number 249 = 616365). Given a formatted capacity of 480 GB for the drive, that's a flash endurance of between 1250 and 3400 cycles (actually less, since I don't know the raw capacity), which is high but not fatal.

Have you contacted Intel? The drive probably has a 3- or 5-year warranty.
 
Thanks for this hint, I will get in contact with Intel now.

I was just worried about the fact the SSD is still operating under CentOS and MacOS.

Reading about the 5-years warranty: Intel's 5 years are based on a daily workload of 20 GB host writes. Based on the SMART attributes this disk had 2320 GB of host writes per day. I am surprised - is my interpretation correct?
 
Sadly, your long division is correct: 53219847 host writes (in units of 32MB) divided by 17627 hours comes out to about 2300 GB/day (the exact answer depends on whether you user 1024 or 1000 in several places). Which is roughly the same as 25 MByte/second, sustained for over two years. And that might be a perfectly sensible workload for a busy system, but it would be a pretty busy system.

If that's true, then it's quite possible that you have worn the drive out. This is why I divided the total writes above by the capacity of the drive (and depending on how you do it, you get around 1250 to 3400 write cycles, meaning so many times every byte of the drive has been overwritten). Unfortunately, the typical write endurance of modern MLC NAND flash is typically 10^3 to 3*10^3, while SLC is still around 10^4 to 10^5. Then one has to correct for over provisioning (your 480gig drive probably has anywhere between 481 and 2000gig of actual flash in it, so it doesn't go bad right away when the first flash pages start failing). But you are getting dangerously close. It's quite possible that Intel won't honor the warranty, or they might anyhow just out to be gracious.
 
Back
Top