ZFS too many errors on disk

mig_40

New Member


Messages: 3

Hello,

we use a Supermicro server running FreeBSD 11.1-RELEASE to provide iSCSI devices based on zvol's for VMware.
The server has one zpool of 6 vdev raidz2 with 6 SATA disks in each, the server has HBA LSI 9300-8i installed. Periodically, the problem arises that the pool goes into DEGRADED status. The output of the zpool status indicates that one of the disks has the status
Code:
gpt / stor8 FAULTED 6 4 0 too many errors
For the last time, several disks from different vdev have passed to the status
FAULTED and the pool itself became UNAVAIL. After rebooting the server, the status of the disks changed to ONLINE and the process resilvering started. After it was finished, we ran the scrub command, no errors were found.

Another point that is of concern is the error records for different disks in /var/log/messages, for example:
Code:
kernel: (pass12: mpr0: 0: 17: 0): ATA COMMAND PASS THROUGH (16). CDB: 85 07 20 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0 SMID 793 Aborting command 0xfffffe00015273f0
kernel: mpr0: Sending reset from mprsas_send_abort for target ID 17
kernel: (da11: mpr0: 0: 17: 0): WRITE (10). CDB: 2a 00 f5 2d 49 88 00 00 08 00 length 4096 SMID 893 terminated ioc 804b loginfo 31130000 scsi 0 state c xfer 0
kernel: mpr0: Unfreezing devq for target ID 17
kernel: (da11: mpr0: 0: 17: 0): WRITE (10). CDB: 2a 00 f5 2d 49 88 00 00 08 00
kernel: (da11: mpr0: 0: 17: 0): CAM status: CCB request completed with an error
kernel: (da11: mpr0: 0: 17: 0): Retrying command
kernel: (da11: mpr0: 0: 17: 0): WRITE (10). CDB: 2a 00 f5 2d 49 88 00 00 08 00
kernel: (da11: mpr0: 0: 17: 0): CAM status: SCSI Status Error
kernel: (da11: mpr0: 0: 17: 0): SCSI status: Check Condition
(da11: mpr0: 0: 17: 0): SCSI sense: UNIT ATTENTION asc: 29.0 (Power on, reset, or bus device reset occurred)
kernel: (da11: mpr0: 0: 17: 0): Retrying command (per sense data)
I would like to understand, with what it can be connected?
 

SirDice

Administrator
Staff member
Administrator
Moderator

Reaction score: 6,926
Messages: 28,850

It can have multiple reasons, dodgy cables, insufficient power and even disks that are on the verge of dying. One dodgy disk can hang up the entire bus.
 
OP
OP
M

mig_40

New Member


Messages: 3

I'm wondering why the system defines the disk as a bad one, but after rebooting it works and there are no new errors on it. Perhaps there are some configurable parameters responsible for this? And why one such disk can lead to the hang of the entire pool?
 

SirDice

Administrator
Staff member
Administrator
Moderator

Reaction score: 6,926
Messages: 28,850

The disk can be a bit dodgy. Not entirely broken but intermittent. It can hang up the bus if it pushes bogus/bad data on it. I've had this happen a couple of times. The disk was having problems, ZFS marked it as bad but the disk kept producing errors on the bus causing all other disks to stall too. I left the disk in there as I ordered a new one but due to all the problems it caused I eventually removed the disk and things went back to "normal", the raid set was still in degraded state but the access to the pool worked again.
 

ralphbsz

Daemon

Reaction score: 865
Messages: 1,394

I'm wondering why the system defines the disk as a bad one, but after rebooting it works and there are no new errors on it.
Perhaps the disk is intermittent?
Perhaps ZFS counts errors on disks, and if no errors occur for a while, the error count gets reset or shrinks back (it could be implemented using a "leaky bucket" technique")?

The important thing to remember is this: Whether a disk is good or bad is not a simple binary switch. A disk can be mostly good, with a few (physical) areas that give errors. A disk can be bad for a few seconds, hours, or days, and then go back to working fine. Due to firmware issues, a disk may be able to do most operations fine, but certain operations don't work well. It's shaded, multi-dimensional and time-dependent.

And why one such disk can lead to the hang of the entire pool?
Unfortunately, a single disk can cause a while SCSI (SAS) or SATA bus to hang. In a perfect world, where are disk and HBA and SAS expander firmware were error-free, that should probably not happen. In the real world (where firmware is written by humans, which are usually referred to as "wetware"), it does happen. Once the whole SCSI/SATA bus hangs, there is nothing ZFS can do but hang with the bus. In layman's terms, it means the computer is going down. Often a reboot or power cycle fixes it. Not always; I've seen disks that hang so badly they prevent the system from being booted. That's fun if you have 200 or 300 disks connected to a single computer, and one of them prevents booting, but you don't know which one: You will spend a few hours in the server room disconnecting cables and pulling disks until the system starts working again, and when you come out, you'll be freezing and your ears are ringing.
 

ab2k

Member

Reaction score: 20
Messages: 73

Hi, did you tried sysutils/smartmontools? With help of that great tool you will be able to speak to internal S.M.A.R.T. chip of your HDD and perform disk tests and read status and metrics of your disk.

P.S> also did you tried to replace a cable ?
 

phoenix

Administrator
Staff member
Administrator
Moderator

Reaction score: 1,205
Messages: 4,044

Spurious errors like that that affect multiple disks and disappear after a reboot generally points to either cabling issues or power issues.

Double-check all the cables, making sure they are plugged in nice and snug. If the same disk keeps showing up as faulted, see if that cable can be replaced.

Make sure all power supplies are running normally and providing enough power. Brown-outs on the bus can be fatal in the long run.

If all of that checks out, it may be the backplane the drives plug into having issues. Or maybe the HBA.
 
OP
OP
M

mig_40

New Member


Messages: 3

Hi,
This weekend we again faced the degradation of the array:
zpool status
Code:
pool: storage0
state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
        Sufficient replicas exist for the pool to continue functioning in a
        degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the device
        repaired.
scan: scrub repaired 0 in 101h23m with 0 errors on Sun Nov  5 17:35:46 2017
config:
NAME            STATE     READ WRITE CKSUM
   ...                 ...             ...     ...    ...
raidz2-4      DEGRADED     0     0     0
    gpt/stor25  ONLINE      0     0     0
    gpt/stor26  ONLINE      0     0     0
    gpt/stor27  FAULTED     6     4     0  too many errors
    gpt/stor28  ONLINE      0     0     0
    gpt/stor29  ONLINE      0     0     0
    gpt/stor30  FAULTED     10     4    0  too many errors
...                      ...            ...     ...    ...
cat /var/log/messages
Code:
Nov 11 01:54:19 kernel: (pass34:mpr0:0:57:0): ATA COMMAND PASS THROUGH(16). CDB: 85 07 20 00 00 00 00 00 00 00 00 00 00 40 27 00 length 0 SMID 231 Aborting command 0xfffffe00014f4c10
Nov 11 01:54:19 kernel: mpr0: Sending reset from mprsas_send_abort for target ID 57
Nov 11 01:54:19 kernel: (da32:mpr0:0:57:0): READ(16). CDB: 88 00 00 00 00 02 54 27 4a d0 00 00 00 08 00 00 length 4096 SMID 294 terminated ioc 804b loginfo 31130000 scsi 0 state c xfer 0
Nov 11 01:54:19 kernel: (da32:mpr0:0:57:0): READ(16). CDB: 88 00 00 00 00 02 3d 8f 8f 60 00 00 00 08 00 00 length 4096 SMID 595 terminated ioc 804b loginfo 31130000 scsi 0 state c xfer 0
Nov 11 01:54:19 kernel: (da32:mpr0:0:57:0): READ(16). CDB: 88 00 00 00 00 02 3d 8f 8f 58 00 00 00 08 00 00 length 4096 SMID 437 terminated ioc 804b loginfo 31130000 scsi 0 state c xfer 0
Nov 11 01:54:19 kernel: (da32:mpr0:0:57:0): READ(16). CDB: 88 00 00 00 00 02 54 27 4a d0 00 00 00 08 00 00
Nov 11 01:54:19 kernel: (da32:mpr0:0:57:0): READ(16). CDB: 88 00 00 00 00 02 3d 8f 8f 48 00 00 00 08 00 00 length 4096 SMID 483 terminated ioc 804b l(da32:mpr0:0:57:0): CAM status: CCB request completed with an error
Nov 11 01:54:19 kernel: (da32:oginfo 31130000 scsi 0 state c xfer 0
Nov 11 01:54:19 kernel: mpr0:0:57:0): Retrying command
Nov 11 01:54:19 kernel: (da32:mpr0:0:57:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 711 terminated ioc 804b loginfo 31130000 scsi 0 state c xfer 0
Nov 11 01:54:19 kernel: (da32:mpr0:0:57:0): READ(16). CDB: 88 00 00 00 00 02 3d 8f 8f 60 00 00 00 08 00 00
Nov 11 01:54:19 kernel: mpr0: (da32:mpr0:0:57:0): CAM status: CCB request completed with an error
Nov 11 01:54:19 kernel: Unfreezing devq for target ID 57
Nov 11 01:54:19 kernel: (da32:mpr0:0:57:0): Retrying command
Nov 11 01:54:19 kernel: (da32:mpr0:0:57:0): READ(16). CDB: 88 00 00 00 00 02 3d 8f 8f 58 00 00 00 08 00 00
Nov 11 01:54:19 kernel: (da32:mpr0:0:57:0): CAM status: CCB request completed with an error
Nov 11 01:54:19 kernel: (da32:mpr0:0:57:0): Retrying command
Nov 11 01:54:19 kernel: (da32:mpr0:0:57:0): READ(16). CDB: 88 00 00 00 00 02 3d 8f 8f 48 00 00 00 08 00 00
Nov 11 01:54:19 kernel: (da32:mpr0:0:57:0): CAM status: CCB request completed with an error
Nov 11 01:54:19 kernel: (da32:mpr0:0:57:0): Retrying command
Nov 11 01:54:19 kernel: (da32:mpr0:0:57:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
Nov 11 01:54:19 kernel: (da32:mpr0:0:57:0): CAM status: CCB request completed with an error
Nov 11 01:54:19 kernel: (da32:mpr0:0:57:0): Retrying command
Nov 11 01:54:20 kernel: (da32:mpr0:0:57:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
Nov 11 01:54:20 kernel: (da32:mpr0:0:57:0): CAM status: SCSI Status Error
Nov 11 01:54:20 kernel: (da32:mpr0:0:57:0): SCSI status: Check Condition
Nov 11 01:54:20 kernel: (da32:mpr0:0:57:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Nov 11 01:54:20 kernel: (da32:mpr0:0:57:0): Error 6, Retries exhausted
Nov 11 01:54:20 kernel: (da32:mpr0:0:57:0): Invalidating pack
Nov 11 01:54:20 kernel: (pass31:mpr0:0:54:0): ATA COMMAND PASS THROUGH(16). CDB: 85 07 20 00 00 00 00 00 00 00 00 00 00 40 27 00 length 0 SMID 1011 Aborting command 0xfffffe000153ad50
Nov 11 01:54:20 kernel: mpr0: Sending reset from mprsas_send_abort for target ID 54
Nov 11 01:54:21 kernel: (da29:mpr0:0:54:0): READ(16). CDB: 88 00 00 00 00 02 3d 8f 8f 80 00 00 00 08 00 00 length 4096 SMID 759 terminated ioc 804b loginfo 31130000 scsi 0 state c xfer 0
Nov 11 01:54:21 kernel: (da29:mpr0:0:54:0): READ(16). CDB: 88 00 00 00 00 02 3d 8f 8f 78 00 00 00 08 00 00 length 4096 SMID 639 terminated ioc 804b loginfo 31130000 scsi 0 state c xfer 0
Nov 11 01:54:21 kernel: (da29:mpr0:0:54:0): READ(16). CDB: 88 00 00 00 00 02 3d 8f 8f 80 00 00 00 08 00 00
Nov 11 01:54:21 kernel: (da29:mpr0:0:54:0): READ(16). CDB: 88 00 00 00 00 02 3d 8f 8f 70 00 00 00 08 00 00 length 4096 SMID 722 terminated ioc 804b l(da29:mpr0:0:54:0): CAM status: CCB request completed with an error
Nov 11 01:54:21 kernel: oginfo 31130000 scsi 0 state c xfer 0
Nov 11 01:54:21 kernel: (da29:mpr0:0:54:0): READ(16). CDB: 88 00 00 00 00 02 3d 8f 8f 68 00 00 00 08 00 00 length 4096 SMID 366 terminated ioc 804b l(da29:oginfo 31130000 scsi 0 state c xfer 0
Nov 11 01:54:21 kernel: mpr0:0:54:0): Retrying command
Nov 11 01:54:21 kernel: (da29:mpr0:0:54:0): READ(16). CDB: 88 00 00 00 00 02 3d ca 56 98 00 00 00 08 00 00 length 4096 SMID 816 terminated ioc 804b loginfo 31130000 scsi 0 state c xfer 0
Nov 11 01:54:21 kernel: (da29:mpr0:0:54:0): READ(16). CDB: 88 00 00 00 00 02 3d 8f 8f 78 00 00 00 08 00 00
Nov 11 01:54:21 kernel: (da29:mpr0:0:54:0): READ(16). CDB: 88 00 00 00 00 02 3d ca 56 90 00 00 00 08 00 00 length 4096 SMID 441 terminated ioc 804b l(da29:mpr0:0:54:0): CAM status: CCB request completed with an error
Nov 11 01:54:21 kernel: oginfo 31130000 scsi 0 state c xfer 0
Nov 11 01:54:21 kernel: (da29:mpr0:0:54:0): READ(16). CDB: 88 00 00 00 00 02 3d ca 56 88 00 00 00 08 00 00 length 4096 SMID 580 terminated ioc 804b l(da29:oginfo 31130000 scsi 0 state c xfer 0
Nov 11 01:54:21 kernel: mpr0: mpr0:0:Unfreezing devq for target ID 54
Nov 11 01:54:21 kernel: 54:0): Retrying command
Nov 11 01:54:21 kernel: (da29:mpr0:0:54:0): READ(16). CDB: 88 00 00 00 00 02 3d 8f 8f 70 00 00 00 08 00 00
Nov 11 01:54:21 kernel: (da29:mpr0:0:54:0): CAM status: CCB request completed with an error
Nov 11 01:54:21 kernel: (da29:mpr0:0:54:0): Retrying command
Nov 11 01:54:21 kernel: (da29:mpr0:0:54:0): READ(16). CDB: 88 00 00 00 00 02 3d 8f 8f 68 00 00 00 08 00 00
Nov 11 01:54:21 kernel: (da29:mpr0:0:54:0): CAM status: CCB request completed with an error
Nov 11 01:54:21 kernel: (da29:mpr0:0:54:0): Retrying command
Nov 11 01:54:21 kernel: (da29:mpr0:0:54:0): READ(16). CDB: 88 00 00 00 00 02 3d ca 56 98 00 00 00 08 00 00
Nov 11 01:54:21 kernel: (da29:mpr0:0:54:0): CAM status: CCB request completed with an error
Nov 11 01:54:21 kernel: (da29:mpr0:0:54:0): Retrying command
Nov 11 01:54:21 kernel: (da29:mpr0:0:54:0): READ(16). CDB: 88 00 00 00 00 02 3d ca 56 90 00 00 00 08 00 00
Nov 11 01:54:21 kernel: (da29:mpr0:0:54:0): CAM status: CCB request completed with an error
Nov 11 01:54:21 kernel: (da29:mpr0:0:54:0): Retrying command
Nov 11 01:54:21 kernel: (da29:mpr0:0:54:0): READ(16). CDB: 88 00 00 00 00 02 3d ca 56 88 00 00 00 08 00 00
Nov 11 01:54:21 kernel: (da29:mpr0:0:54:0): CAM status: CCB request completed with an error
Nov 11 01:54:21 kernel: (da29:mpr0:0:54:0): Retrying command
Nov 11 01:54:22 kernel: (da29:mpr0:0:54:0): READ(16). CDB: 88 00 00 00 00 02 3d ca 56 88 00 00 00 08 00 00
Nov 11 01:54:22 kernel: (da29:mpr0:0:54:0): CAM status: SCSI Status Error
Nov 11 01:54:22 kernel: (da29:mpr0:0:54:0): SCSI status: Check Condition
Nov 11 01:54:22 kernel: (da29:mpr0:0:54:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Nov 11 01:54:22 kernel: (da29:mpr0:0:54:0): Retrying command (per sense data)
Nov 11 01:54:22 ZFS: vdev state changed, pool_guid=8676916677365712439 vdev_guid=11642213808745507144
Nov 11 01:54:22 ZFS: vdev state changed, pool_guid=8676916677365712439 vdev_guid=11642213808745507144
...
Nov 11 20:04:11 kernel: mpr0: Sending reset from mprsas_send_abort for target ID 54
Nov 11 20:04:11 kernel: (da29:mpr0:0:54:0): READ(16). CDB: 88 00 00 00 00 01 5d 0f 32 b8 00 00 00 08 00 00 length 4096 SMID 673 terminated ioc 804b loginfo 31130000 scsi 0 state c xfer 0
Nov 11 20:04:11 kernel: (da29:mpr0:0:54:0): READ(10). CDB: 28 00 45 39 d9 a0 00 00 08 00 length 4096 SMID 442 terminated ioc 804b loginfo 31130000 scsi 0 state c xfer 0
Nov 11 20:04:11 kernel: (da29:mpr0:0:54:0): READ(16). CDB: 88 00 00 00 00 01 5d 59 9a e0 00 00 00 08 00 00 length 4096 SMID 514 terminated ioc 804b loginfo 31130000 scsi 0 state c xfer 0
Nov 11 20:04:11 kernel: (da29:mpr0:0:54:0): READ(16). CDB: 88 00 00 00 00 01 5d 0f 32 b8 00 00 00 08 00 00
Nov 11 20:04:11 kernel: (da29:mpr0:0:54:0): READ(16). CDB: 88 00 00 00 00 01 5d 59 9a 38 00 00 00 08 00 00 length 4096 SMID 508 terminated ioc 804b l(da29:mpr0:0:54:0): CAM status: CCB request completed with an error
Nov 11 20:04:11 kernel: (da29:oginfo 31130000 scsi 0 state c xfer 0
Nov 11 20:04:11 kernel: mpr0:0:54:0): Retrying command
Nov 11 20:04:11 kernel: (da29:mpr0:0:54:0): READ(16). CDB: 88 00 00 00 00 01 5d 59 99 d0 00 00 00 08 00 00 length 4096 SMID 698 terminated ioc 804b loginfo 31130000 scsi 0 state c xfer 0
Nov 11 20:04:11 kernel: (da29:mpr0:0:54:0): READ(10). CDB: 28 00 45 39 d9 a0 00 00 08 00
Nov 11 20:04:11 kernel: (da29:mpr0:0:54:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 280 terminated ioc 804b loginfo 3(da29:mpr0:0:54:0): CAM status: CCB request completed with an error
Nov 11 20:04:11 kernel: 1130000 scsi 0 state c xfer 0
Nov 11 20:04:11 kernel: mpr0: Unfreezing devq for target ID 54
Nov 11 20:04:11 kernel: (da29:mpr0:0:54:0): Retrying command
Nov 11 20:04:11 kernel: (da29:mpr0:0:54:0): READ(16). CDB: 88 00 00 00 00 01 5d 59 9a e0 00 00 00 08 00 00
Nov 11 20:04:11 kernel: (da29:mpr0:0:54:0): CAM status: CCB request completed with an error
Nov 11 20:04:11 kernel: (da29:mpr0:0:54:0): Retrying command
Nov 11 20:04:11 kernel: (da29:mpr0:0:54:0): READ(16). CDB: 88 00 00 00 00 01 5d 59 9a 38 00 00 00 08 00 00
Nov 11 20:04:11 kernel: (da29:mpr0:0:54:0): CAM status: CCB request completed with an error
Nov 11 20:04:11 kernel: (da29:mpr0:0:54:0): Retrying command
Nov 11 20:04:11 kernel: (da29:mpr0:0:54:0): READ(16). CDB: 88 00 00 00 00 01 5d 59 99 d0 00 00 00 08 00 00
Nov 11 20:04:11 kernel: (da29:mpr0:0:54:0): CAM status: CCB request completed with an error
Nov 11 20:04:11 kernel: (da29:mpr0:0:54:0): Retrying command
Nov 11 20:04:11 kernel: (da29:mpr0:0:54:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
Nov 11 20:04:11 kernel: (da29:mpr0:0:54:0): CAM status: CCB request completed with an error
Nov 11 20:04:11 kernel: (da29:mpr0:0:54:0): Retrying command
Nov 11 20:04:12 kernel: (da29:mpr0:0:54:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
Nov 11 20:04:12 kernel: (da29:mpr0:0:54:0): CAM status: SCSI Status Error
Nov 11 20:04:12 kernel: (da29:mpr0:0:54:0): SCSI status: Check Condition
Nov 11 20:04:12 kernel: (da29:mpr0:0:54:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Nov 11 20:04:12 kernel: (da29:mpr0:0:54:0): Error 6, Retries exhausted
Nov 11 20:04:12 kernel: (da29:mpr0:0:54:0): Invalidating pack
Nov 11 20:04:12 ZFS: vdev state changed, pool_guid=8676916677365712439 vdev_guid=14972506191944577353
Nov 11 20:04:12 ZFS: vdev state changed, pool_guid=8676916677365712439 vdev_guid=14972506191944577353
Hi, did you tried sysutils/smartmontools? With help of that great tool you will be able to speak to internal S.M.A.R.T. chip of your HDD and perform disk tests and read status and metrics of your disk.
Yes we use it:
smartctl -a /dev/da29(gpt/stor27)
Code:
smartctl 6.5 2016-05-07 r4318 [FreeBSD 11.1-RELEASE amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     HGST HUS726060ALE614
Serial Number:    K1H8X2JF
LU WWN Device Id: 5 000cca 255d223e3
Firmware Version: APGNW907
User Capacity:    6,001,175,126,016 bytes [6.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ACS-2, ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Mon Nov 13 12:19:39 2017 MSK
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART Status not supported: Incomplete response, ATA output registers missing
SMART overall-health self-assessment test result: PASSED
Warning: This result is based on an Attribute check.

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (  113) seconds.
Offline data collection
capabilities:                    (0x5b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        ( 842) minutes.
SCT capabilities:              (0x003d) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   100   100   016    Pre-fail  Always       -       0
  2 Throughput_Performance  0x0005   138   138   054    Pre-fail  Offline      -       100
  3 Spin_Up_Time            0x0007   137   137   024    Pre-fail  Always       -       497 (Average 462)
  4 Start_Stop_Count        0x0012   100   100   000    Old_age   Always       -       40
  5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000b   100   100   067    Pre-fail  Always       -       0
  8 Seek_Time_Performance   0x0005   128   128   020    Pre-fail  Offline      -       18
  9 Power_On_Hours          0x0012   100   100   000    Old_age   Always       -       4596
 10 Spin_Retry_Count        0x0013   100   100   060    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       40
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       230
193 Load_Cycle_Count        0x0012   100   100   000    Old_age   Always       -       230
194 Temperature_Celsius     0x0002   127   127   000    Old_age   Always       -       47 (Min/Max 22/53)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%      4596         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
 

ralphbsz

Daemon

Reaction score: 865
Messages: 1,394

None of the errors in your log are disk-specific: they are not errors that involve the platter or the head. Instead, they are transport errors, problems getting the data to/from the disk. Some aren't even proper "errors", but check conditions caused by unit attention: the disk tells the computer that there is an interesting thing that happened, in this case a power-on, reset, or bus problem. As Phoenix said above, that's probably caused by wiring and power, not by the disks.

Your SMART data shows no errors on the disk itself.

Check your power supply, and your data wiring. Might be a good idea to get a spare SAS or SATA cable, and try replacing cables; maybe one is defective, and even if it isn't, the act of unplugging and replugging them may cure the problem. Checking whether your power supply is defective or overloaded is more difficult, since spare power supplies are expensive and typically aren't sitting around; maybe try reducing the load on the power supply (disconnect things, if possible).

Good luck!
 
Top