ZFS Random™ data errors on ZFS Raids

0x46f

New Member


Messages: 8

Hello everyone,
I'm struggling with a problem for quite some time now. It started a few months ago when I installed my first ZFS raid (beyond stripe with 1 disk). Everything was good for a while until i suddenly got some strange errors when cp/mv/rsync-ing files from one place to another. It said something along the lines of Input/Output error. Usually this happens because of some hardware failure (as far as I can tell). So, I got new drives, installed everything but after a while I got new I/O errors. Then I replaced my SATA cables. I/O errors returned. Replaced the Controller (installed some PCIe Card with SAS). Errors returned. I even replaced the power supply thinking this might cause the errors. (It didn't.)
I gave up and switched back to Linux for some time. I never experienced any errors on my Fedora/Ubuntu installs and my FreeBSD Laptop+Server are also running fine.
A few days ago I bought some new harddrives for a bigger raidz1 but today I was greeted with 22 data errors. Whenever I try to read from a file, as soon as I get to the bad parts, I get an I/O error and the program crashes/terminates(in most cases). E.g. I cant calculate the checksum of the flagged files nor can I play the video beyond a certain timestamp.

There are a few things I have noticed so far:
Sometimes when copying a corrupted file, only the good parts are copied.
Sometimes when copying a corrupted file, the new file also is corrupted and is flagged in zpool status -v.
Some files I created yesterday are already corrupted. Someone in #freebsd told me I should scrub my pools on a regular basis. I did so before I created said files and afterwards. No data errors. They just appeared when I scrubbed my pool 30 mins ago (haven't even touched those files since).

I don't know what else I should do now. I replaced a lot of my hardware. Used different FreeBSD versions using different options in the installer. (sometimes with encryption but currently without.)

Could someone please help me?


(sry for bad English, I'm not a native speaker)
 

SirDice

Administrator
Staff member
Administrator
Moderator

Reaction score: 9,283
Messages: 33,826

What kind of hardware do you have? What's the brand/model of the SAS/SATA card for example?
 
OP
0

0x46f

New Member


Messages: 8

Code:
CPU FX-8350
Mainbord ASRock Extreme4 970
Current Harddirves Seagate ST6000VN0033  IronWolf 6 TB NAS intern (3x)
SAS card Avago Technologies (LSI) SAS2008
power supply Corsair RM 750x
 

SirDice

Administrator
Staff member
Administrator
Moderator

Reaction score: 9,283
Messages: 33,826

SAS card Avago Technologies (LSI) SAS2008
That's a good card. What version of the firmware does it have? It may be a firmware bug.
 
OP
0

0x46f

New Member


Messages: 8

18.00.00.00
here is the complete output of dmesg | grep mps
Code:
mps0: <Avago Technologies (LSI) SAS2008> port 0xd000-0xd0ff mem 0xfe9c0000-0xfe9c3fff,0xfe980000-0xfe9bffff irq 28 at device 0.0 on pci2
mps0: Firmware: 18.00.00.00, Driver: 21.02.00.00-fbsd
mps0: IOCCapabilities: 185c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,IR>
da2 at mps0 bus 0 scbus0 target 16 lun 0
da0 at mps0 bus 0 scbus0 target 14 lun 0
da1 at mps0 bus 0 scbus0 target 15 lun 0
 

SirDice

Administrator
Staff member
Administrator
Moderator

Reaction score: 9,283
Messages: 33,826

It looks like I have the same card:
Code:
# mpsutil show adapter
mps0 Adapter:
       Board Name: SMC2008-IT
   Board Assembly:
        Chip Name: LSISAS2008
    Chip Revision: ALL:
    BIOS Revision: 7.11.00.00
Firmware Revision: 7.00.00.00
  Integrated RAID: no

PhyNum  CtlrHandle  DevHandle  Disabled  Speed   Min    Max    Device
0       0001        0009       N         6.0     1.5    6.0    SAS Initiator
1                              N                 1.5    6.0    SAS Initiator
2                              N                 1.5    6.0    SAS Initiator
3                              N                 1.5    6.0    SAS Initiator
4                              N                 1.5    6.0    SAS Initiator
5                              N                 1.5    6.0    SAS Initiator
6                              N                 1.5    6.0    SAS Initiator
7                              N                 1.5    6.0    SAS Initiator
Code:
# dmesg | grep mps
mps0: <Avago Technologies (LSI) SAS2008> port 0x2000-0x20ff mem 0xc0000000-0xc0003fff,0xc0040000-0xc007ffff irq 16 at device 2.0 on pci0
mps0: Firmware: 07.00.00.00, Driver: 21.02.00.00-fbsd
mps0: IOCCapabilities: 1285c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay,HostDisc>
da0 at mps0 bus 0 scbus0 target 8 lun 0
But I only have one disk attached to it. No issues with it though.
 
OP
0

0x46f

New Member


Messages: 8

I don't really think this happens because of the card. It works without any problems on my linux installs and I just got some new I/O errors on my main SSD raid (the one I use for fbsd operating system). This raid is not connected to the SAS card. I was just downloading some videos and when I wanted to copy them to a backup server, rsync threw some I/O errors at me.
When I try to acces the original file (still on rooty):
Code:
└[~]> md5 /usr/home/me/Downloads/ex3/15ws-ex3-141208-1080p.mp4
md5: /usr/home/me/Downloads/ex3/15ws-ex3-141208-1080p.mp4: Input/output error
Could there be other parts of my system that are failing?

current storage configuration:
storage02 uses the SAS card
rooty+storage the default mainboard SATA
Code:
  pool: rooty
 state: ONLINE
status: One or more devices has experienced an error resulting in data
    corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
    entire pool from backup.
   see: http://illumos.org/msg/ZFS-8000-8A
  scan: scrub repaired 0 in 0 days 00:02:00 with 0 errors on Thu Nov 14 17:00:17 2019
config:

    NAME        STATE     READ WRITE CKSUM
    rooty       ONLINE       0     0     3
      raidz1-0  ONLINE       0     0     6
        ada2p4  ONLINE       0     0     0
        ada3p4  ONLINE       0     0     0
        ada4p4  ONLINE       0     0     0

errors: 1 data errors, use '-v' for a list

  pool: storage
 state: ONLINE
status: One or more devices has experienced an error resulting in data
    corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
    entire pool from backup.
   see: http://illumos.org/msg/ZFS-8000-8A
  scan: scrub repaired 0 in 0 days 03:24:00 with 1 errors on Wed Nov 13 23:26:49 2019
config:

    NAME        STATE     READ WRITE CKSUM
    storage     ONLINE       0     0     0
      mirror-0  ONLINE       0     0     0
        ada1    ONLINE       0     0     0
        ada0    ONLINE       0     0     0

errors: 1 data errors, use '-v' for a list

  pool: storage02
 state: ONLINE
status: One or more devices has experienced an error resulting in data
    corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
    entire pool from backup.
   see: http://illumos.org/msg/ZFS-8000-8A
  scan: scrub repaired 64K in 0 days 01:07:35 with 22 errors on Thu Nov 14 18:59:36 2019
config:

    NAME        STATE     READ WRITE CKSUM
    storage02   ONLINE       0     0    44
      raidz1-0  ONLINE       0     0    88
        da0     ONLINE       0     0     1
        da1     ONLINE       0     0     0
        da2     ONLINE       0     0     0

errors: 22 data errors, use '-v' for a list
Another thing I've noticed is that the number of errors depends on the user. When I call zpool status as root it says errors: 18 data errors, use '-v' for a list on storage02. Is this normal? Oo
 

inf3rno

Member

Reaction score: 6
Messages: 73

I personally don't trust Seagate since they sold me HDDs with a massive bug in the firmware. It was fun to fix their bricked drive with an USB - Tx/Rx adapter (I had no idea what I am doing, still I managed to fix it), but never again. I never trusted Asrock either, I read too many bad reviews of their motherboards. But I read mostly about desktop products, probably the server products of these companies are better. My guess would be that one of the Seagate HDDs cause these errors, but I am curious if you managed to debug this somehow. Another reason can be that you are using non-ECC memory. I would run a memtest at least, but it is russian roulette I think. You certainly need a backup and even that might not protect you from data loss.
 

kira12

Member

Reaction score: 1
Messages: 31

Hi,

first you should upgrade the SAS Firmware to 20-it. Thats the first issue. Then check or change your cables for testing. Now da0 has issue, swap cable with da2. Change the PS if you can do it.

best regards

P.S. you can update with mpsutils, then reboot
 
Top