Solved CAM status: SCSI Status Error

OP
OP
dvl@

dvl@

Aspiring Daemon
Developer

Reaction score: 63
Messages: 550

This morning, I rebooted the server and the system zpool came back online and resilvered successfully. Running zpool status -v originally listed many corrupted files, but now there is only one:

Code:
[dan@knew:~] $ sudo zpool status -v system
  pool: system
 state: ONLINE
status: One or more devices has experienced an error resulting in data
    corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
    entire pool from backup.
   see: http://illumos.org/msg/ZFS-8000-8A
  scan: resilvered 4.06G in 0h3m with 4 errors on Thu Oct 25 13:21:01 2018
config:

    NAME        STATE     READ WRITE CKSUM
    system      ONLINE       0     0     6
      raidz2-0  ONLINE       0     0     0
        da3p3   ONLINE       0     0     0
        da10p3  ONLINE       0     0     0
        da15p3  ONLINE       0     0     0
        da4p3   ONLINE       0     0     0
        da13p3  ONLINE       0     0     0
        da12p3  ONLINE       0     0     0
        da9p3   ONLINE       0     0     0
        da14p3  ONLINE       0     0     0
        da11p3  ONLINE       0     0     0
        da5p3   ONLINE       0     0     0
      raidz2-1  ONLINE       0     0    12
        da6p1   ONLINE       0     0     0
        da7p1   ONLINE       0     0     0
        da16p1  ONLINE       0     0     0
        da8p1   ONLINE       0     0     0
        da0p1   ONLINE       0     0     0
        da1p1   ONLINE       0     0     0
        da17p1  ONLINE       0     0     0
        da18p1  ONLINE       0     0     0
        da2p1   ONLINE       0     0     0
        da19p1  ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

        /usr/jails/dbclone/var/db/postgres/data10/pg_wal/0000000100000007000000F0
[dan@knew:~] $
Losing that file might be serious in a production system, but that jail is for testing only.
 
OP
OP
dvl@

dvl@

Aspiring Daemon
Developer

Reaction score: 63
Messages: 550

This morning I also switched the disk fans from being linked to the CPU fan ports on the m/b. Now they run full speed 100% of the time, but I which I got more airflow though them. I'll wait a bit longer and see what happens.

I also reseated the HBA card to which these drives are connected.
 
OP
OP
dvl@

dvl@

Aspiring Daemon
Developer

Reaction score: 63
Messages: 550

From all the messages overnight, this seemed to be the most interesting; gpart got involved and resized a partition.

Code:
Oct 25 07:12:57 knew kernel: (da18:mps2:0:10:0): READ(6). CDB: 08 00 00 28 01 00
Oct 25 07:12:57 knew kernel: (da18:mps2:0:10:0): CAM status: SCSI Status Error
Oct 25 07:12:57 knew kernel: (da18:mps2:0:10:0): SCSI status: Check Condition
Oct 25 07:12:57 knew kernel: (da18:mps2:0:10:0): SCSI sense: HARDWARE FAILURE asc:44,0 (Internal target failure)
Oct 25 07:12:57 knew kernel: (da18:mps2:0:10:0): Error 5, Retries exhausted
Oct 25 07:12:57 knew kernel: GEOM_PART: da18 was automatically resized.
Oct 25 07:12:57 knew kernel: Use `gpart commit da18` to save changes or `gpart undo da18` to revert them.
Oct 25 07:12:57 knew kernel: GEOM_PART: integrity check failed (da18, GPT)
Oct 25 07:12:57 knew ZFS: vdev state changed, pool_guid=15378250086669402288 vdev_guid=13603535286907309607
Oct 25 07:12:57 knew ZFS: vdev state changed, pool_guid=15378250086669402288 vdev_guid=13603535286907309607
Why would that happen?

It does not seem changed, when comparing da18 to other similar drives:

Code:
[dan@knew:~] $ gpart show da19 da18 da17 da16 da1
=>        34  9767541101  da19  GPT  (4.5T)
          34           6        - free -  (3.0K)
          40  9766000000     1  freebsd-zfs  (4.5T)
  9766000040     1541095        - free -  (752M)

=>        34  9767541101  da18  GPT  (4.5T)
          34           6        - free -  (3.0K)
          40  9766000000     1  freebsd-zfs  (4.5T)
  9766000040     1541095        - free -  (752M)

=>        34  9767541101  da17  GPT  (4.5T)
          34           6        - free -  (3.0K)
          40  9766000000     1  freebsd-zfs  (4.5T)
  9766000040     1541095        - free -  (752M)

=>        34  9767541101  da16  GPT  (4.5T)
          34           6        - free -  (3.0K)
          40  9766000000     1  freebsd-zfs  (4.5T)
  9766000040     1541095        - free -  (752M)

=>        34  9767541101  da1  GPT  (4.5T)
          34           6       - free -  (3.0K)
          40  9766000000    1  freebsd-zfs  (4.5T)
  9766000040     1541095       - free -  (752M)

[dan@knew:~] $
Oh but wait, da16 and da19 where also affected:

Code:
Oct 25 07:43:39 knew kernel: (da16:mps2:0:0:0): Error 5, Retries exhausted
Oct 25 07:43:39 knew kernel: GEOM_PART: da16 was automatically resized.
Oct 25 07:43:39 knew kernel: Use `gpart commit da16` to save changes or `gpart undo da16` to revert them.
Oct 25 07:43:39 knew kernel: GEOM_PART: integrity check failed (da16, GPT)
...
Oct 25 07:43:39 knew kernel: (da19:mps2:0:12:0): Error 5, Retries exhausted
Oct 25 07:43:39 knew kernel: GEOM_PART: da19 was automatically resized.
Oct 25 07:43:39 knew kernel: Use `gpart commit da19` to save changes or `gpart undo da19` to revert them.
Oct 25 07:43:39 knew kernel: GEOM_PART: integrity check failed (da19, GPT)
...
Oct 25 07:44:57 knew kernel: mps2: mpssas_prepare_remove: Sending reset for target ID 12
Oct 25 07:44:57 knew kernel: da19 at mps2 bus 0 scbus2 target 12 lun 0
Oct 25 07:44:57 knew kernel: da19: <ATA TOSHIBA MD04ACA5 FP2A> s/n 653BK12FFS9A detached
Oct 25 07:44:57 knew kernel: (da19:mps2:0:12:0): Periph destroyed
 
OP
OP
dvl@

dvl@

Aspiring Daemon
Developer

Reaction score: 63
Messages: 550

Here are all the gparts, collected together by similarly.

zpool zroot:

Code:
=>       40  468862048  ada0  GPT  (224G)
         40       2008        - free -  (1.0M)
       2048       1024     1  freebsd-boot  (512K)
       3072       1024        - free -  (512K)
       4096    8388608     2  freebsd-swap  (4.0G)
    8392704   41943040     3  freebsd-zfs  (20G)
   50335744  418526344        - free -  (200G)

=>       40  468862048  ada3  GPT  (224G)
         40       2008        - free -  (1.0M)
       2048       1024     1  freebsd-boot  (512K)
       3072       1024        - free -  (512K)
       4096    8388608     2  freebsd-swap  (4.0G)
    8392704   41943040     3  freebsd-zfs  (20G)
   50335744  418526344        - free -  (200G)
unused:

Code:
=>       34  937703021  ada1  GPT  (447G)
         34       2014        - free -  (1.0M)
       2048       1024     1  freebsd-boot  (512K)
       3072       1024        - free -  (512K)
       4096    8388608     2  freebsd-swap  (4.0G)
    8392704   41943040     3  freebsd-zfs  (20G)
   50335744  887367311        - free -  (423G)

=>       34  937703021  ada2  GPT  (447G)
         34       2014        - free -  (1.0M)
       2048       1024     1  freebsd-boot  (512K)
       3072       1024        - free -  (512K)
       4096    8388608     2  freebsd-swap  (4.0G)
    8392704   41943040     3  freebsd-zfs  (20G)
   50335744  887367311        - free -  (423G)
zpool system raidz2-0

Code:
=>        34  9767541101  da3  GPT  (4.5T)
          34        2014       - free -  (1.0M)
        2048        1024    1  freebsd-boot  (512K)
        3072        1024       - free -  (512K)
        4096     8388608    2  freebsd-swap  (4.0G)
     8392704  9759129600    3  freebsd-zfs  (4.5T)
  9767522304       18831       - free -  (9.2M)

=>        34  9767541101  da4  GPT  (4.5T)
          34        2014       - free -  (1.0M)
        2048        1024    1  freebsd-boot  (512K)
        3072        1024       - free -  (512K)
        4096     8388608    2  freebsd-swap  (4.0G)
     8392704  9759129600    3  freebsd-zfs  (4.5T)
  9767522304       18831       - free -  (9.2M)

=>        34  9767541101  da5  GPT  (4.5T)
          34        2014       - free -  (1.0M)
        2048        1024    1  freebsd-boot  (512K)
        3072        1024       - free -  (512K)
        4096     8388608    2  freebsd-swap  (4.0G)
     8392704  9759129600    3  freebsd-zfs  (4.5T)
  9767522304       18831       - free -  (9.2M)

=>        34  9767541101  da9  GPT  (4.5T)
          34        2014       - free -  (1.0M)
        2048        1024    1  freebsd-boot  (512K)
        3072        1024       - free -  (512K)
        4096     8388608    2  freebsd-swap  (4.0G)
     8392704  9759129600    3  freebsd-zfs  (4.5T)
  9767522304       18831       - free -  (9.2M)

=>        34  9767541101  da10  GPT  (4.5T)
          34        2014        - free -  (1.0M)
        2048        1024     1  freebsd-boot  (512K)
        3072        1024        - free -  (512K)
        4096     8388608     2  freebsd-swap  (4.0G)
     8392704  9759129600     3  freebsd-zfs  (4.5T)
  9767522304       18831        - free -  (9.2M)

=>        34  9767541101  da11  GPT  (4.5T)
          34        2014        - free -  (1.0M)
        2048        1024     1  freebsd-boot  (512K)
        3072        1024        - free -  (512K)
        4096     8388608     2  freebsd-swap  (4.0G)
     8392704  9759129600     3  freebsd-zfs  (4.5T)
  9767522304       18831        - free -  (9.2M)

=>        34  9767541101  da12  GPT  (4.5T)
          34        2014        - free -  (1.0M)
        2048        1024     1  freebsd-boot  (512K)
        3072        1024        - free -  (512K)
        4096     8388608     2  freebsd-swap  (4.0G)
     8392704  9759129600     3  freebsd-zfs  (4.5T)
  9767522304       18831        - free -  (9.2M)

=>        34  9767541101  da13  GPT  (4.5T)
          34        2014        - free -  (1.0M)
        2048        1024     1  freebsd-boot  (512K)
        3072        1024        - free -  (512K)
        4096     8388608     2  freebsd-swap  (4.0G)
     8392704  9759129600     3  freebsd-zfs  (4.5T)
  9767522304       18831        - free -  (9.2M)

=>        34  9767541101  da14  GPT  (4.5T)
          34        2014        - free -  (1.0M)
        2048        1024     1  freebsd-boot  (512K)
        3072        1024        - free -  (512K)
        4096     8388608     2  freebsd-swap  (4.0G)
     8392704  9759129600     3  freebsd-zfs  (4.5T)
  9767522304       18831        - free -  (9.2M)

=>        34  9767541101  da15  GPT  (4.5T)
          34        2014        - free -  (1.0M)
        2048        1024     1  freebsd-boot  (512K)
        3072        1024        - free -  (512K)
        4096     8388608     2  freebsd-swap  (4.0G)
     8392704  9759129600     3  freebsd-zfs  (4.5T)
  9767522304       18831        - free -  (9.2M)
zpool system raidz2-1
Code:
=>        34  9767541101  da0  GPT  (4.5T)
          34           6       - free -  (3.0K)
          40  9766000000    1  freebsd-zfs  (4.5T)
  9766000040     1541095       - free -  (752M)

=>        34  9767541101  da1  GPT  (4.5T)
          34           6       - free -  (3.0K)
          40  9766000000    1  freebsd-zfs  (4.5T)
  9766000040     1541095       - free -  (752M)

=>        34  9767541101  da2  GPT  (4.5T)
          34           6       - free -  (3.0K)
          40  9766000000    1  freebsd-zfs  (4.5T)
  9766000040     1541095       - free -  (752M)

=>        34  9767541101  da6  GPT  (4.5T)
          34           6       - free -  (3.0K)
          40  9766000000    1  freebsd-zfs  (4.5T)
  9766000040     1541095       - free -  (752M)

=>        34  9767541101  da7  GPT  (4.5T)
          34           6       - free -  (3.0K)
          40  9766000000    1  freebsd-zfs  (4.5T)
  9766000040     1541095       - free -  (752M)

=>        34  9767541101  da8  GPT  (4.5T)
          34           6       - free -  (3.0K)
          40  9766000000    1  freebsd-zfs  (4.5T)
  9766000040     1541095       - free -  (752M)

=>        34  9767541101  da16  GPT  (4.5T)
          34           6        - free -  (3.0K)
          40  9766000000     1  freebsd-zfs  (4.5T)
  9766000040     1541095        - free -  (752M)

=>        34  9767541101  da17  GPT  (4.5T)
          34           6        - free -  (3.0K)
          40  9766000000     1  freebsd-zfs  (4.5T)
  9766000040     1541095        - free -  (752M)

=>        34  9767541101  da18  GPT  (4.5T)
          34           6        - free -  (3.0K)
          40  9766000000     1  freebsd-zfs  (4.5T)
  9766000040     1541095        - free -  (752M)

=>        34  9767541101  da19  GPT  (4.5T)
          34           6        - free -  (3.0K)
          40  9766000000     1  freebsd-zfs  (4.5T)
  9766000040     1541095        - free -  (752M)

[dan@knew:~] $
 
OP
OP
dvl@

dvl@

Aspiring Daemon
Developer

Reaction score: 63
Messages: 550

Last night I reverted the fan swap I did the previous night. I also put those fans on full power. They were previously throttled via PWM on the M/B.

I also moved the SFP-8087 cable back to the original port on the HBA card.

Over the past 17 hours, no zpool issues.

There is this repeating message in /var/log/messages:

Code:
Oct 26 10:22:59 knew smartd[1068]: Device: /dev/da7 [SAT], 64 Currently unreadable (pending) sectors
I am sure that can be removed via a smartctl long test, but I'll wait another day or two before doing that.
 

ralphbsz

Daemon

Reaction score: 1,125
Messages: 1,783

Why would gpart repartition on the fly? I have no idea. That's too weird to be believed ... but I'm sure it's real.

Your real IO error 'internal target failure' is serious. The drive was incapable of communicating with other parts of itself. But since it got better, maybe you can ignore it. Still, your system keeps being a continuous source of amusement, which is NOT a good thing. Maybe eventually some combination of reseating, replacing cables, and getting cooling and power supply stabilized will make it run good; you might have already achieved that.
 
OP
OP
dvl@

dvl@

Aspiring Daemon
Developer

Reaction score: 63
Messages: 550

Code:
$ uptime
 2:37PM  up 1 day, 17:16, 1 user, load averages: 0.32, 0.32, 0.26
So far so good.
 
OP
OP
dvl@

dvl@

Aspiring Daemon
Developer

Reaction score: 63
Messages: 550

I now suspect the drive bay. Row 1, column 3.

From last night: https://gist.github.com/dlangille/d18502325d7f24194589657be262ea59

From this morning: https://gist.github.com/dlangille/b84993057299fa130f74c1c26f75b016

Current situation: the drive from r1.c3 was moved into the chassis and is sitting loose in there. Resilvering will continue to completion I hope.

When I swapped the cables between the top backplane and the 2nd from the top, the SCSI Status Error issue moved from mps2 to mps1

I think I want to buy a new chassis. ;)
 

diizzy

Well-Known Member

Reaction score: 61
Messages: 260

Seems like you've narrowed it down but another way of testing as you seem to get issues within a day or so would be to connect the HDDs or at least the ones throwing errors directly to the HBA, that would probably mean that you need another cable and leave the case open as the drives would be placed outside of case.
 
OP
OP
dvl@

dvl@

Aspiring Daemon
Developer

Reaction score: 63
Messages: 550

The two drives on the inside of the chassis (the existing device and its replacement) are both attached to the HBA via an SFF-8087 to 4 SATA Fan Out Cable.

The drives are just sitting bare-bones, inside the case like that.
 
OP
OP
dvl@

dvl@

Aspiring Daemon
Developer

Reaction score: 63
Messages: 550

It has now been 6 days without any errors. I still say it was the one connector (R1C3).

I have ordered a new chassis SuperMicro 846.

Thanks for all the support.
 
Top