Solved CAM status: ATA Status Error

Hi everyone,

On one server I got some disk related errors. There are not many (the shown dmesg(1) is about 5 months), but frightening anyway. I have no data loss until now, many thanks to mirrored ZFS. Does this messages point to a real harddisk controller failure? Or only a bad configured kernel module? Are there some kernel-parameters to tweak? Something like bus timing settings?

Any suggestions?

dmesg
Code:
(ada1:ahcich1:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 08 b8 0b 00 40 00 00 00 00 00 00
(ada1:ahcich1:0:0:0): CAM status: Uncorrectable parity/CRC error
(ada1:ahcich1:0:0:0): Retrying command
(ada1:ahcich1:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 08 b8 9f 50 40 5d 01 00 00 00 00
(ada1:ahcich1:0:0:0): CAM status: Uncorrectable parity/CRC error
(ada1:ahcich1:0:0:0): Retrying command
(ada1:ahcich1:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 08 b8 a1 50 40 5d 01 00 00 00 00
(ada1:ahcich1:0:0:0): CAM status: Uncorrectable parity/CRC error
(ada1:ahcich1:0:0:0): Retrying command
(ada1:ahcich1:0:0:0): READ_FPDMA_QUEUED. ACB: 60 10 f0 2c 9c 40 60 00 00 00 00 00
(ada1:ahcich1:0:0:0): CAM status: Uncorrectable parity/CRC error
(ada1:ahcich1:0:0:0): Retrying command
ahcich0: Timeout on slot 18 port 0
ahcich0: is 00000000 cs 003c0000 ss 003c0000 rs 003c0000 tfd c0 serr 00000000 cmd 0000d217
(ada0:ahcich0:0:0:0): READ_FPDMA_QUEUED. ACB: 60 08 88 d9 2d 40 5c 00 00 00 00 00
(ada0:ahcich0:0:0:0): CAM status: Command timeout
(ada0:ahcich0:0:0:0): Retrying command
swap_pager: indefinite wait buffer: bufobj: 0, blkno: 113720, size: 4096
swap_pager: indefinite wait buffer: bufobj: 0, blkno: 1063093, size: 4096
swap_pager: indefinite wait buffer: bufobj: 0, blkno: 1058432, size: 32768
swap_pager: indefinite wait buffer: bufobj: 0, blkno: 606048, size: 8192
(ada0:ahcich0:0:0:0): READ_FPDMA_QUEUED. ACB: 60 00 30 35 fc 40 9d 00 00 01 00 00
(ada0:ahcich0:0:0:0): CAM status: ATA Status Error
(ada0:ahcich0:0:0:0): ATA status: 41 (DRDY ERR), error: 40 (UNC )
(ada0:ahcich0:0:0:0): RES: 41 40 18 36 fc 00 9d 00 00 00 01
(ada0:ahcich0:0:0:0): Retrying command
(ada0:ahcich0:0:0:0): READ_FPDMA_QUEUED. ACB: 60 10 f8 31 1f 40 9d 00 00 00 00 00
(ada0:ahcich0:0:0:0): CAM status: ATA Status Error
(ada0:ahcich0:0:0:0): ATA status: 41 (DRDY ERR), error: 40 (UNC )
(ada0:ahcich0:0:0:0): RES: 41 40 00 32 1f 00 9d 00 00 10 00
(ada0:ahcich0:0:0:0): Retrying command
(ada0:ahcich0:0:0:0): READ_FPDMA_QUEUED. ACB: 60 10 70 55 f9 40 9c 00 00 00 00 00
(ada0:ahcich0:0:0:0): CAM status: ATA Status Error
(ada0:ahcich0:0:0:0): ATA status: 41 (DRDY ERR), error: 40 (UNC )
(ada0:ahcich0:0:0:0): RES: 41 40 70 55 f9 00 9c 00 00 10 00
(ada0:ahcich0:0:0:0): Retrying command
(ada0:ahcich0:0:0:0): READ_FPDMA_QUEUED. ACB: 60 10 70 55 f9 40 9c 00 00 00 00 00
(ada0:ahcich0:0:0:0): CAM status: ATA Status Error
(ada0:ahcich0:0:0:0): ATA status: 41 (DRDY ERR), error: 40 (UNC )
(ada0:ahcich0:0:0:0): RES: 41 40 70 55 f9 00 9c 00 00 10 00
(ada0:ahcich0:0:0:0): Retrying command
(ada0:ahcich0:0:0:0): READ_FPDMA_QUEUED. ACB: 60 10 70 55 f9 40 9c 00 00 00 00 00
(ada0:ahcich0:0:0:0): CAM status: ATA Status Error
(ada0:ahcich0:0:0:0): ATA status: 41 (DRDY ERR), error: 40 (UNC )
(ada0:ahcich0:0:0:0): RES: 41 40 70 55 f9 00 9c 00 00 10 00
(ada0:ahcich0:0:0:0): Retrying command
ahcich0: Timeout on slot 3 port 0
ahcich0: is 00000000 cs 00000008 ss 00000000 rs 00000008 tfd c0 serr 00000000 cmd 0000c317
(ada0:ahcich0:0:0:0): FLUSHCACHE48. ACB: ea 00 00 00 00 40 00 00 00 00 00 00
(ada0:ahcich0:0:0:0): CAM status: Command timeout
(ada0:ahcich0:0:0:0): Retrying command
swap_pager: indefinite wait buffer: bufobj: 0, blkno: 290833, size: 4096
swap_pager: indefinite wait buffer: bufobj: 0, blkno: 637539, size: 4096
swap_pager: indefinite wait buffer: bufobj: 0, blkno: 1082327, size: 32768
swap_pager: indefinite wait buffer: bufobj: 0, blkno: 767227, size: 16384
swap_pager: indefinite wait buffer: bufobj: 0, blkno: 586772, size: 12288
swap_pager: indefinite wait buffer: bufobj: 0, blkno: 290833, size: 4096
swap_pager: indefinite wait buffer: bufobj: 0, blkno: 1057171, size: 24576
swap_pager: indefinite wait buffer: bufobj: 0, blkno: 201066, size: 4096
swap_pager: indefinite wait buffer: bufobj: 0, blkno: 1055856, size: 32768
swap_pager: indefinite wait buffer: bufobj: 0, blkno: 854055, size: 32768
swap_pager: indefinite wait buffer: bufobj: 0, blkno: 637539, size: 4096
swap_pager: indefinite wait buffer: bufobj: 0, blkno: 1082327, size: 32768
swap_pager: indefinite wait buffer: bufobj: 0, blkno: 767227, size: 16384
swap_pager: indefinite wait buffer: bufobj: 0, blkno: 586772, size: 12288
swap_pager: indefinite wait buffer: bufobj: 0, blkno: 1057171, size: 24576
swap_pager: indefinite wait buffer: bufobj: 0, blkno: 174964, size: 4096
swap_pager: indefinite wait buffer: bufobj: 0, blkno: 1051025, size: 36864
swap_pager: indefinite wait buffer: bufobj: 0, blkno: 1028930, size: 32768
swap_pager: indefinite wait buffer: bufobj: 0, blkno: 201066, size: 4096
swap_pager: indefinite wait buffer: bufobj: 0, blkno: 1055856, size: 32768
swap_pager: indefinite wait buffer: bufobj: 0, blkno: 854055, size: 32768
swap_pager: indefinite wait buffer: bufobj: 0, blkno: 1082327, size: 32768

smartctl -a /dev/ada0 shows:
Code:
Error 1027 occurred at disk power-on lifetime: 23571 hours (982 days + 3 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: WP at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  61 00 58 ff ff ff 4f 00  36d+18:37:39.417  WRITE FPDMA QUEUED
  61 00 20 ff ff ff 4f 00  36d+18:37:39.417  WRITE FPDMA QUEUED
  61 00 30 ff ff ff 4f 00  36d+18:37:39.417  WRITE FPDMA QUEUED
  61 00 20 ff ff ff 4f 00  36d+18:37:39.416  WRITE FPDMA QUEUED
  61 00 30 ff ff ff 4f 00  36d+18:37:39.416  WRITE FPDMA QUEUED

smartctl -a /dev/ada1 shows
Code:
Error 392 occurred at disk power-on lifetime: 22009 hours (917 days + 1 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 40 ff ff ff 4f 00  21d+09:42:49.392  READ FPDMA QUEUED
  60 00 10 ff ff ff 4f 00  21d+09:42:49.361  READ FPDMA QUEUED
  60 00 08 ff ff ff 4f 00  21d+09:42:49.355  READ FPDMA QUEUED
  60 00 10 ff ff ff 4f 00  21d+09:42:49.345  READ FPDMA QUEUED
  60 00 10 ff ff ff 4f 00  21d+09:42:49.336  READ FPDMA QUEUED

uname -imor
Code:
FreeBSD 9.2-RELEASE amd64 GENERIC

pciconf -lv
Code:
ahci0@pci0:0:31:2:	class=0x010601 card=0x844d1043 chip=0x1c028086 rev=0x05 hdr=0x00
    vendor     = 'Intel Corporation'
    device     = '6 Series/C200 Series Chipset Family 6 port SATA AHCI Controller'
    class      = mass storage
    subclass   = SATA
 
Re: CAM status: ATA Status Error

Looking at the lifetime (982 days) and the type of errors my first guess would be a disk that's close to dying.
 
Re: CAM status: ATA Status Error

Ordoban said:
Both at the same time?

Not impossible in any way. If there's a problem in the manufacturing process of the disks the same flaw tends to creep on multiple units that are built around the same time and those disks tend to die around the same age.
 
Re: CAM status: ATA Status Error

This are 2 different errors: the "CAM status" one and the "swap_pager" one. The first are rare and seems not critical, but the second one leads me to a real disk fault. The Reallocated_Sector_Ct of the first disk is jumped up from 0 to ~20k at last 2 days! The disk is replaced now and all is fine.

(How i can mark this thread as solved?)
 
Re: CAM status: ATA Status Error

Ordoban said:
How i can mark this thread as solved?

Edit the first post of this thread. There's an input box labeled "Subject:" at the top. This is where you can change the title of the thread. Just put "[SOLVED]" in front of it.
 
Re: CAM status: ATA Status Error

Ordoban said:
Both at the same time?
This does seem a bit fishy:

Ordoban said:
Code:
  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: WP at LBA = 0x0fffffff = 268435455
Code:
  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455
You don't indicate the drive manufacturer / model, but it seems odd that 2 drives had an error on the same disk block, and that block just happened to be the "magic" last addressable LBA in pre-LBA48 mode. In theory, a drive should reject a command to access a block outside its capacity, but it may be that the model you're using barfs and logs a SMART error instead.
 
Thanks for reply, @Terry_Kennedy. Now I can relax, because it can't be a controller or FreeBSD kernel fault. One of this disks is thrown away and maybe the second will follow in near future. If someone likes to know the exact disk type, here is it:
Code:
Model Family:     Seagate Barracuda 7200.14 (AF)
Device Model:     ST3000DM001-9YN166
Serial Number:    (removed)
LU WWN Device Id: (removed)
Firmware Version: CC47
User Capacity:    3,000,592,982,016 bytes [3.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
 
Last edited by a moderator:
Re: CAM status: ATA Status Error


This does seem a bit fishy:


You don't indicate the drive manufacturer / model, but it seems odd that 2 drives had an error on the same disk block, and that block just happened to be the "magic" last addressable LBA in pre-LBA48 mode. In theory, a drive should reject a command to access a block outside its capacity, but it may be that the model you're using barfs and logs a SMART error instead.
My 2TB Seagate Barracuda 7200.14 is reporting some errors at the same LBA:
Code:
Error 1909 occurred at disk power-on lifetime: 39990 hours (1666 days + 6 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC  Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 01 ff ff ff 4f 00  3d+22:19:55.018  READ FPDMA QUEUED
  b0 da 00 00 4f c2 40 00  3d+22:19:54.986  SMART RETURN STATUS
  2f 00 01 10 00 00 00 00  3d+22:19:54.890  READ LOG EXT
  60 00 01 ff ff ff 4f 00  3d+22:19:52.121  READ FPDMA QUEUED
  e5 00 00 00 00 00 40 00  3d+22:19:52.094  CHECK POWER MODE
IMHO this is a firmware bug!
 
Re: CAM status: ATA Status Error

Looking at the lifetime (982 days) and the type of errors my first guess would be a disk that's close to dying.
This is another firmware bug in the The Seagate Barracuda 7200.14.
My 2TB is reporting:
Code:
# smartctl -A /dev/ada5 | grep Power_On_Hours
  9 Power_On_Hours  0x0032  055  055  000  Old_age  Always  -  40206
The disk was bought 2013-02-15 and the pool created on 2013-02-22, less than 2 years ago, but 40206 hours are more than 4 years. When the drive is idle it counts the time 5 time faster, at least.
 
My suggestion is to not throw any of these disks away (you can send them to me instead if you really want to get rid of them;)), because they may be quite ok. Of course it may be just my bad beginner's luck with FreeBSD, but it seems that FreeBSD sometimes doesn't handle disks properly and this causes problems. I'm still trying to figure out the exact causes, but I must finally bump my bug report even if I still don't know everything. In short, I had similar problem (and some more), I've got read errors, smartctl was unable to run even the shortest test of the disk because of read errors. Creating partition at the end of the disk also failed. I recovered it by dd /dev/zero'ing it (you can do this in OpenBSD for example, it doesn't have this problem it seems, but I've tried it for too short to be 100% sure), if it won't work for you, you can probably try some linux instead. Cheers
 
You can also do a firmware update
Code:
==> WARNING: A firmware update for this drive is available,
see the following Seagate web pages:
http://knowledge.seagate.com/articles/en_US/FAQ/207931en
http://knowledge.seagate.com/articles/en_US/FAQ/223651en

In my case this may be a bad idea. It's a rent server and i have no physical access to it. I can't even power off the drives if something fails. This error is not a problem for me anyway. ZFS does correcting any read error and this error message is very rare (one time since my first post).
 
Ordoban The firmware is the latest one. If you have disk with bad blocks you can try sysutils/hdrecover but some ZFS expert at ZFSguru.com suggest to let the file system to deal with bad blocks and probably is the best solution.
@petrek I have bought 2 of these drives less than two years ago, one stopped working 2 weeks ago: was not possible to read the disk even with the various utils in the "Ultimate Boot CD" on a different PC. The other one has the errors reported in this thread.
What I have learned so far: never put identical disks in a pool. The next time I will use disks of different brands.
 
When you have errors on a disk, there are many things you should check first before you can be 100% sure it's a real disk failure. First, the cable, make sure you use a 100% good cable, connected to a good working port. Preferably connect the disk through a USB if you have a proper connector cable (if you don't have, buy one if you can), in this way you can sometimes recover, or at least read the data from a disk that is no longer seen by a BIOS. Once you don't care about the data, don't try to read the disk anywhere, just fill the whole disk with zeroes using dd (on a different operating systems preferably, to avoid OS specific mistakes). The errors don't have to be vendor specific, just disk size specific, and this could indicate some OS/file system bug. For example, I have two disks identical in size to a byte, from Seagate and WD, and they both developed the same error, and I wasn't able to recover them even by creating a new partition table, only zeroing them helped, after that S.M.A.R.T. doesn't complain about any errors on them anymore. I don't think I'm so special that such things happen only to me, so it's better (and much cheaper) to check all of the above first. If you still have errors on a disk after that, make sure your environment is not at fault, that there's not too much humidity, vibrations, etc, to avoid similar breakage in the future.
 
Thank you petrek for the tips, but the disk is not recognized by the OS even if connected via USB on Windows 7 /Linux/FreeBSD on different PCs. When connected via USB the disk make some strange noises, I have opened the disk and I saw that the noise is due the parking/unparking the head in sequence, probably for errors. Now the disk is, obviously, unusable but I have appreciated the mechanical construction of it. Thanks again petrek I have bookmarked this thread for these tips.
 
Sorry to hear that. Let your disk rest in pieces ;) If your disk was not recognized it could be some firmware corruption also, there are some tools to write it again. As for parking, many OSes have wrong default setting for some disks, and will park even every few seconds, so this can quickly destroy the mechanical ones. I think on FreeBSD you can use ataidle from Ports to avoid it, but haven't tried it yet. S.M.A.R.T. should have info how many times the disk was parked.
 
Back
Top