Hard drive error questions

Good morning. I have two hard drives logging some errors. I've done some research, but can't find exactly what the errors mean. Both the drives that are logging errors are plugged in to a pci to sata expansion card. The errors are below:
Code:
Feb 24 05:09:35 nasghoul kernel: (ada0:ahcich2:0:0:0): FLUSHCACHE48. ACB: ea 00 00 00 00 40 00 00 00 00 00 00
Feb 24 05:09:35 nasghoul kernel: (ada0:ahcich2:0:0:0): CAM status: ATA Status Error
Feb 24 05:09:35 nasghoul kernel: (ada0:ahcich2:0:0:0): ATA status: 71 (DRDY DF SERV ERR), error: 04 (ABRT )
Feb 24 05:09:35 nasghoul kernel: (ada0:ahcich2:0:0:0): RES: 71 04 00 00 00 00 00 00 00 00 00
Feb 24 05:09:35 nasghoul kernel: (ada0:ahcich2:0:0:0): Retrying command, 0 more tries remain
I'd suspect it was the drive, it wasn't both drives plugged in to the same card. Thanks. --jake
 
They're refurbs from Amazon. I've ran the smart tests. No issues. I'm on 12.2-RELEASE btw.
Edit: Power On Time is 5943 hours, 247 days.
 
Did you try them on another controller? I had an old Promise ATA300 card that decided to die one day, the controller chip on the card got really hot (as in can't even touch it anymore
hot) and it gave all sorts of problems when there was more than one drive attached to it. So it could be the controller that's causing the problems.
 
I have not yet. I will try this weekend. Can you explain what the error codes mean? ATA status 71 for instance? I googled around a bit, but haven't found much. Thanks. --jake
 
Don't know, it's usually just some status from the driver. More often then not the fault is actually with the drive itself.

Edit: Power On Time is 5943 hours, 247 days.
That's fine, it could still be broken. I have drives that worked for years on end and drives that broke within a few months, power on time is just an indication of the age of the drive. What do 197 Current_Pending_Sector and 198 Offline_Uncorrectable tell you?
 
First thing is to check the cables. For SATA-III drives, be sure to use SATA-III-certified cables. Also, older cables don’t have clips, and sometimes they slowly lose contact due to vibration inside the PC case. Also check the drives’ power connectors.

BTW, can you please post the output from smartctl -a /dev/ada0?
 
  • Thanks
Reactions: a6h
Did you try them on another controller? I had an old Promise ATA300 card that decided to die one day, the controller chip on the card got really hot (as in can't even touch it anymore
hot) and it gave all sorts of problems when there was more than one drive attached to it. So it could be the controller that's causing the problems.
If it's PCI it's most likely ancient and pre AHCI so that wouldn't be too surprising.
 
If it's PCI it's most likely ancient and pre AHCI so that wouldn't be too surprising.
It had been running with 4 drives attached for a number of years. No stellar performance off course but it worked without problems. Until it decided it couldn't take it any more and just burned out. Started getting a LOT of status errors and time-outs. When I touched the controller chip I knew it's time was up. Bought a second hand LSI based SAS/SATA card to replace it. Much better performance (PCIx4 slot). It's still running with those 4 drives attached. Not the same 4 drives though, I think I replaced all of them at least once since then.
 
I think "DRDY DF SERV ERR" is some sort of communications error, not a head/platter error. So all the discussion above about cables and chips seems more important than smartctl.
 
I think "DRDY DF SERV ERR" is some sort of communications error, not a head/platter error. So all the discussion above about cables and chips seems more important than smartctl.
Yes, the status is a bit mask: The value “71” from the driver message is a hexadecimal number (0x71) that consists of:
  • bit 6 (0x40 = DRDY) “drive ready” – This is normal. It means that the drive is ready to receive commands.
  • bit 5 (0x20 = DF) “device fault” – This is not good.
  • bit 4 (0x10 = SERV) “overlapped mode service request” – This flag depends on the command, in this case FLUSHCACHE48. Not sure what it means in this context.
  • bit 0 (0x01 = ERR) – And error condition, further details are in the error value.
The error value in this case is 0x04 = ABRT. That means, the error condition was caused by an aborted command, in this case that was a FLUSHCACHE48 command. The ABRT error can be caused by an invalid command or by a device error.

Are these “WD Red” drives (e.g. WD40EFRX or WD60EFRX), by any chance?
 
Yes, the status is a bit mask: The value “71” from the driver message is a hexadecimal number (0x71) that consists of:
  • bit 6 (0x40 = DRDY) “drive ready” – This is normal. It means that the drive is ready to receive commands.
  • bit 5 (0x20 = DF) “device fault” – This is not good.
  • bit 4 (0x10 = SERV) “overlapped mode service request” – This flag depends on the command, in this case FLUSHCACHE48. Not sure what it means in this context.
  • bit 0 (0x01 = ERR) – And error condition, further details are in the error value.
The error value in this case is 0x04 = ABRT. That means, the error condition was caused by an aborted command, in this case that was a FLUSHCACHE48 command. The ABRT error can be caused by an invalid command or by a device error.

Are these “WD Red” drives (e.g. WD40EFRX or WD60EFRX), by any chance?
They are HGST Ultrastar 7K4000s that I bought as refurbs off Amazon to, hopefully, avoid the whole SMR debacle. Also, re-seating all the cables and the PCI card seems to have done the trick. Thanks, everyone.
 
Back
Top