Solved Seagate Archive 8TB - WRITE_FPDMA_QUEUED

Brian Watson · Jul 14, 2015

Hi,

I'm new here... I've been running FreeBSD for nearly 3 years (after many years of Linux before that) and this is the first time I have needed you

I have just installed a new Seagate Archive 8TB (ST8000AS0002) in my box and started copying data to it. Logging lots (tens of thousands) of:

Code:

Jul 15 08:05:54 slug kernel: (ada1:ahcich1:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 00 18 0d 42 40 60 02 00 01 00 00
Jul 15 08:05:54 slug kernel: (ada1:ahcich1:0:0:0): CAM status: Uncorrectable parity/CRC error
Jul 15 08:05:54 slug kernel: (ada1:ahcich1:0:0:0): Retrying command

I'm running:

Code:

FreeBSD slug.home 10.1-RELEASE-p10 FreeBSD 10.1-RELEASE-p10 #0: Wed May 13 06:54:13 UTC 2015     root@amd64-builder.daemonology.net:/usr/obj/usr/src/sys/GENERIC  amd64

The disk is connected to a Marvell SATA controller and as far as I can tell it looks normal:

Code:

root@slug:~ # grep ahci0 /var/run/dmesg.boot
ahci0: <Marvell 88SE9172 AHCI SATA controller> port 0xe040-0xe047,0xe030-0xe033,0xe020-0xe027,0xe010-0xe013,0xe000-0xe00f mem 0xf7d10000-0xf7d101ff irq 19 at device 0.0 on pci3
ahci0: AHCI v1.00 with 2 6Gbps ports, Port Multiplier supported with FBS
ahcich0: <AHCI channel> at channel 0 on ahci0
ahcich1: <AHCI channel> at channel 1 on ahci0
root@slug:~ # grep ada1 /var/run/dmesg.boot
ada1 at ahcich1 bus 0 scbus1 target 0 lun 0
ada1: <ST8000AS0002-1NA17Z AR13> ATA-9 SATA 3.x device
ada1: Serial Number Z8406DMC
ada1: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 8192bytes)
ada1: Command Queueing enabled
ada1: 7630885MB (15628053168 512 byte sectors: 16H 63S/T 16383C)
ada1: Previously was known as ad6

I want to wait until the copy completes before I run SMART self tests. Current SMART status is http://paste2.org/Dy9aFEjA

I've built a ZFS pool with just this new disk and am using zfs send/receive for the copy. It ran at around 150MB/s for a long time (hundreds of GB) but I left it running overnight and has now slowed down to around 20MB/s. I understand slow write is a feature of these drives - but I was expecting it to slow down after a few tens of GB, not many hundreds. Maybe the unhealthy raid pool (tank) I am reading from can't supply data fast enough to fill the (in disk) write buffers quickly?

I'm really hoping that what I am seeing is just the device driver's way of telling me that the Seagate's internal buffers are full and it has to wait for a bit before writing more data.

Does this sound reasonable?

Any way to confirm it?

Thanks,

Brian

junovitch@ · Jul 15, 2015

CAM Timeout errors aren't usually a good sign. You shouldn't see this on a healthy setup. Given the 18 hours of being powered on, the drive might have been handled poorly during shipping. A safe best to start with the CRC errors is checking/replacing all your cabling and/or trying a different hot swap bay if you are using one. It could be something as simple as that. In the meantime I wouldn't trust it fully until you get to the bottom of the cause.

Brian Watson · Jul 15, 2015

Thanks for your reply. However, I am not seeing CAM Timeout errors, I am seeing "CAM status: Uncorrectable parity/CRC error". I'm not just being pedantic - it's just that most things I found while searching related to timeout errors as well. My current copy is still running (back up to ~60MB/s now and 1TB to go) and I will let it complete then replace the SATA cable.

junovitch said:
In the meantime I wouldn't trust it fully until you get to the bottom of the cause.

I have no intention of trusting it until resolved - that's why I am here...

usdmatt · Jul 15, 2015

You have quite a high UDMA_CRC_Error_Count. Looking around, it does seem to relate to problems getting data from the disk to system, so replacing the cable is probably a good start.

From Wikipedia:

199 0xC7 UltraDMA CRC Error Count
The count of errors in data transfer via the interface cable as determined by ICRC (Interface Cyclic Redundancy Check).

It contrast, I have a few WD RED 1 TB disks in a ZFS pool that have 17,000 hours on them and not a single CRC error.

Code:

  9 Power_On_Hours          0x0032   077   077   000    Old_age   Always       -       16952
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0

junovitch@ · Jul 16, 2015

Brian Watson said:
Thanks for your reply. However, I am not seeing CAM Timeout errors, I am seeing "CAM status: Uncorrectable parity/CRC error".

You are correct. I was thinking the right thing but between busy and tired that didn't come across in what I wrote.

For comparision, here is a healthy WD Green drive in my home NAS that is over three years old. There are no CRC errors.

Code:

Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   155   148   021    Pre-fail  Always       -       9250
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       60
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   057   057   000    Old_age   Always       -       31645
 10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       59
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       55
193 Load_Cycle_Count        0x0032   129   129   000    Old_age   Always       -       215381
194 Temperature_Celsius     0x0022   122   117   000    Old_age   Always       -       30
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0

Brian Watson · Jul 16, 2015

This is now going to be difficult to get to the bottom of. The error messages just stopped for no apparent reason. My 6TB zfs send/receive was still running and the last (approx.) 1TB ran with no errors at all. I ran both short and long SMART tests and both completed without error.

I'm going to just start using it for a while now, but not destroy the original data for a while.

Thanks for all responses.

Brian Watson · Jul 16, 2015

junovitch said:
For comparision, here is a healthy WD Green drive in my home NAS that is over three years old. There are no CRC errors.

Yep - it's unusual in my system as well:

Code:

# for x in 1 2 3 4 5 6 7
do
echo -n ada$x:; smartctl -a /dev/ada$x | grep UDMA_CRC
done

ada1:199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       32460
ada2:199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
ada3:199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
ada4:199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       45
ada5:199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
ada6:199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
ada7:199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0

and these are not new drives

Code:

ada0:  9 Power_On_Hours          0x0032   077   077   000    Old_age   Always       -       17223
ada1:  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       48
ada2:  9 Power_On_Hours          0x0032   069   069   000    Old_age   Always       -       23276
ada3:  9 Power_On_Hours          0x0032   071   071   000    Old_age   Always       -       21882
ada4:  9 Power_On_Hours          0x0032   075   075   000    Old_age   Always       -       18551
ada5:  9 Power_On_Hours          0x0032   066   066   000    Old_age   Always       -       24839
ada6:  9 Power_On_Hours          0x0032   066   066   000    Old_age   Always       -       24839
ada7:  9 Power_On_Hours          0x0032   053   053   000    Old_age   Always       -       35032

protocelt · Jul 16, 2015

FWIW, if it's not the disk itself or a bad SATA cable, it could be the controller. While some Marvell controllers do of course work under FreeBSD they're not really very good from what I understand. ZFS can stress the disk/controller quite a bit.

Brian Watson · Jul 21, 2015

Thanks for all the input. I kept using the drive and after a while the CAM parity errors escalated to CAM timeout errors and the filesystem hung. I shut the box down and replaced the SATA cable. It came up clean. I've stressed it quite hard and can't get it to error again so I am confident that it was the cable, or at least the cable seating.

Uniballer · Aug 11, 2015

Just FYI - You can't compare SMART attribute error numbers between Seagate and WD drives because they view these SMART values differently. Reallocated_Sector_Ct, yes. Raw_Read_Error_Rate or Seek_Error_Rate, no.