Marvell 88SX7042 (mvs) Drive time out

I recently purchased a Rosewill RC-218 four port SATA card. This uses the Marvell 88SX7042 chipset, support for which I have seen mav commit in late 2010. I have four Samsung HD154UI 1.5TB drives connected to it in a Supermicro CSE-M35T1 backplane.

I backed up my existing 3 drive raidz and then created a newraidz2 pool using all seven drives. All seemed to be working fine so I began copying my backup data onto the new pool last night. Upon waking up this morning I see all of the drives have timed out were removed by mvs.

Code:
mvsch0: Timeout on slot 0
mvsch0: iec ffffffff sstat ffffffff serr ffffffff edma_s ffffffff dma_c ffffffff dma_s ffffffff rs 00000001 status ff
mvsch0: stopping EDMA engine failed
mvsch2: (ada0:Timeout on slot 0mvsch0:0:
mvsch2: 0:iec ffffffff sstat ffffffff serr ffffffff edma_s ffffffff dma_c ffffffff dma_s ffffffff rs 00000001 status ff0):
lost device
(ada0:mvsch0:0:0:0): Invalidating pack
mvsch2: stopping EDMA engine failed
(ada2:mvsch2:0:0:0): lost device
(ada2:mvsch2:0:0:0): Invalidating pack
(ada0:mvsch0:0:0:0): Synchronize cache failed
(ada0:mvsch0:0:0:0): removing device entry
(ada2:mvsch2:0:0:0): Synchronize cache failed
(ada2:mvsch2:0:0:0): removing device entry
mvsch1: Timeout on slot 0
mvsch1: iec ffffffff sstat ffffffff serr ffffffff edma_s ffffffff dma_c ffffffff dma_s ffffffff rs 00000001 status ff
mvsch1: stopping EDMA engine failed
mvsch3: (ada1:Timeout on slot 0mvsch1:0:
mvsch3: 0:iec ffffffff sstat ffffffff serr ffffffff edma_s ffffffff dma_c ffffffff dma_s ffffffff rs 00000001 status ff0):
lost device
(ada1:mvsch1:0:0:0): Invalidating pack
mvsch3: stopping EDMA engine failed
(ada3:mvsch3:0:0:0): lost device
(ada3:mvsch3:0:0:0): Invalidating pack
(ada1:mvsch1:0:0:0): Synchronize cache failed
(ada1:mvsch1:0:0:0): removing device entry
(ada3:mvsch3:0:0:0): Synchronize cache failed
(ada3:mvsch3:0:0:0): removing device entry

zpool status showed some read and write errors and I was instructed to run zpool clear once the devices had been reconnected. I had to reboot to get the drives recognized and online again - simply removing and re-inserting them in the hotswap bay did nothing. After a reboot zpool status shows the pool is health and happy, but this definitely has me nervous.

Code:
  pool: storage
 state: ONLINE
 scrub: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        storage     ONLINE       0     0     0
          raidz2    ONLINE       0     0     0
            ad0     ONLINE       0     0     0
            ad1     ONLINE       0     0     0
            ad12    ONLINE       0     0     0
            ada0    ONLINE       0     0     0
            ada1    ONLINE       0     0     0
            ada2    ONLINE       0     0     0
            ada3    ONLINE       0     0     0

errors: No known data errors

Thoughts on what could be going on?
 
Timeout messages look suspicious. There are too much "ff"s. It looks like something gone very wrong with controller. Unluckily, without access to Marvell documentation I have no idea what can it be. Just on luck - try to move controller into the different slot. I don't know about PCIe, but according to Linux sources some PCI-X cards from that family have compatibility issues with some slots.
 
mav,

It's a PCIe 4x controller, unfortunately I have no other 4x slots. I could move it to the 16x slot if you think that might do something?
 
You could try to use the ata disk driver instead of ahci.
I've seen a similar problem where I kept loosing a disk and had to revert to ata to make it work reliably.
(On totally different hardware I should say. In my case it was the onboard ports on a ASUS mb)
 
mav@ said:
Timeout messages look suspicious. There are too much "ff"s. It looks like something gone very wrong with controller. Unluckily, without access to Marvell documentation I have no idea what can it be. Just on luck - try to move controller into the different slot. I don't know about PCIe, but according to Linux sources some PCI-X cards from that family have compatibility issues with some slots.
In case it helps you, this showed up in dmesg last night.

Code:
mvsch0: Timeout on slot 0
mvsch0: iec 00000000 sstat 00000123 serr 00400000 edma_s 00002900 dma_c 0e000700 dma_s 00000008 rs 00000001 status c0
 
monstrfolk said:
use this patch

No, this is completely different controller with completely different driver. There is a chance that disabling NCQ may help this controller also, but it is a last resort.
 
I am using the same controller on FreeBSD 9. It works like a charm. I made a couple of benchmarks for those who are interested:


Setup

System
Code:
 Intel(R) Core(TM)2 Duo CPU     E8400  @ 3.00GHz (2999.73-MHz K8-class CPU)
 real memory  = 8589934592 (8192 MB)
 avail memory = 8252125184 (7869 MB)

HDD's
+ 4 HDD's on Sata Card
+ 4 HDD's on Mainboard.
+ All ports (Controller and Mainboard) are SATA-2
+ All HDD's are SATA-2
Code:
root> dmesg |grep SAMSUNG
ada0: <SAMSUNG HD103SJ 1AJ100E5> ATA-8 SATA 2.x device
ada1: <SAMSUNG HD103SJ 1AJ10001> ATA-8 SATA 2.x device
ada2: <SAMSUNG HD103UJ 1AA01118> ATA-7 SATA 2.x device
ada4: <SAMSUNG HD103UJ 1AA01118> ATA-7 SATA 2.x device
ada5: <SAMSUNG HD103SJ 1AJ10001> ATA-8 SATA 2.x device
ada6: <SAMSUNG HD103SJ 1AJ10001> ATA-8 SATA 2.x device
ada7: <SAMSUNG HD103SJ 1AJ10001> ATA-8 SATA 2.x device
ada8: <SAMSUNG HD103SJ 1AJ10001> ATA-8 SATA 2.x device

ZFS Pool
All HDD's are encrypted using geli 256bit and afterwards attached to the pool
Code:
root> zpool status
  pool: tank
 state: ONLINE
  scan: none requested
config:

        NAME          STATE     READ WRITE CKSUM
        tank          ONLINE       0     0     0
          mirror-0    ONLINE       0     0     0
            ada0.eli  ONLINE       0     0     0
            ada1.eli  ONLINE       0     0     0
          mirror-1    ONLINE       0     0     0
            ada5.eli  ONLINE       0     0     0
            ada6.eli  ONLINE       0     0     0
          mirror-2    ONLINE       0     0     0
            ada7.eli  ONLINE       0     0     0
            ada8.eli  ONLINE       0     0     0
          mirror-3    ONLINE       0     0     0
            ada2.eli  ONLINE       0     0     0
            ada4.eli  ONLINE       0     0     0

errors: No known data errors

root> zpool list
NAME   SIZE  ALLOC   FREE    CAP  DEDUP  HEALTH  ALTROOT
tank  3.62T  1.74T  1.88T    48%  1.00x  ONLINE  -

ZFS Properties
Code:
root> zfs get all
NAME  PROPERTY              VALUE                  SOURCE
tank  type                  filesystem             -
tank  creation              Fri Feb 17 19:15 2012  -
tank  used                  1.76T                  -
tank  available             1.81T                  -
tank  referenced            1.76T                  -
tank  compressratio         1.00x                  -
tank  mounted               yes                    -
tank  quota                 none                   default
tank  reservation           none                   default
tank  recordsize            128K                   default
tank  mountpoint            /tank                  default
tank  sharenfs              off                    default
tank  checksum              on                     default
tank  compression           off                    default
tank  atime                 off                    local
tank  devices               on                     default
tank  exec                  on                     default
tank  setuid                on                     default
tank  readonly              off                    default
tank  jailed                off                    default
tank  snapdir               hidden                 default
tank  aclmode               discard                default
tank  aclinherit            restricted             default
tank  canmount              on                     default
tank  xattr                 off                    temporary
tank  copies                1                      default
tank  version               5                      -
tank  utf8only              off                    -
tank  normalization         none                   -
tank  casesensitivity       sensitive              -
tank  vscan                 off                    default
tank  nbmand                off                    default
tank  sharesmb              off                    default
tank  refquota              none                   default
tank  refreservation        none                   default
tank  primarycache          all                    default
tank  secondarycache        all                    default
tank  usedbysnapshots       0                      -
tank  usedbydataset         1.76T                  -
tank  usedbychildren        6.43M                  -
tank  usedbyrefreservation  0                      -
tank  logbias               latency                default
tank  dedup                 off                    default
tank  mlslabel                                     -
tank  sync                  standard               default
tank  refcompressratio      1.00x                  -
tank  written               1.76T                  -


Geli
Using Software
Code:
ZFS filesystem version 5
ZFS storage pool version 28
cryptosoft0: <software crypto> on motherboard
GEOM_ELI: Device ada0.eli created.
GEOM_ELI: Encryption: AES-XTS 256
GEOM_ELI:     Crypto: software
GEOM_ELI: Device ada1.eli created.
GEOM_ELI: Encryption: AES-XTS 256
GEOM_ELI:     Crypto: software
GEOM_ELI: Device ada5.eli created.
GEOM_ELI: Encryption: AES-XTS 256
GEOM_ELI:     Crypto: software
GEOM_ELI: Device ada6.eli created.
GEOM_ELI: Encryption: AES-XTS 256
GEOM_ELI:     Crypto: software
GEOM_ELI: Device ada8.eli created.
GEOM_ELI: Encryption: AES-XTS 256
GEOM_ELI:     Crypto: software
GEOM_ELI: Device ada7.eli created.
GEOM_ELI: Encryption: AES-XTS 256
GEOM_ELI:     Crypto: software
GEOM_ELI: Device ada2.eli created.
GEOM_ELI: Encryption: AES-XTS 256
GEOM_ELI:     Crypto: software
GEOM_ELI: Device ada4.eli created.
GEOM_ELI: Encryption: AES-XTS 256
GEOM_ELI:     Crypto: software

HDD Alignment
Code:
root> zdb |grep ashift
            ashift: 12
            ashift: 12
            ashift: 12
            ashift: 12
 
Benchmark

Bonnie++
Code:
root> bonnie++ -u root -d bonnie/
[...]
Version  1.96       ------Sequential Output------ --Sequential Input- --Random-
Concurrency   1     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
fileserver      16G   104  89 54891  11 36734   8   278  92 183553  15 334.9   3
Latency              1352ms    2639ms    2399ms     359ms     235ms     537ms
Version  1.96       ------Sequential Create------ --------Random Create--------
fileserver          -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                 16 20358  84 +++++ +++ 21561  96 20565  94 +++++ +++ 21031  96
Latency             34951us     118us     145us   38459us      36us      77us
1.96,1.96,fileserver,1,1329562136,16G,,104,89,54891,11,36734,8,278,92,183553,15,334.9,3,16
,,,,,20358,84,+++++,+++,21561,96,20565,94,+++++,+++,21031,96,1352ms,2639ms,2399ms,359ms,235ms,
537ms,34951us,118us,145us,38459us,36us,77us
Code:
root> bonnie++ -u root -d bonnie/
[...]
Version  1.96       ------Sequential Output------ --Sequential Input- --Random-
Concurrency   1     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
fileserver      16G   108  95 53410  11 33628   8   263  91 173232  15 326.8   7
Latency              1306ms    4980ms    2568ms     427ms     251ms     554ms
Version  1.96       ------Sequential Create------ --------Random Create--------
fileserver          -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                 16 22212  90 30560  47 16210  96 19587  95  8727  81  5518  98
Latency             14153us   71791us     685us   35300us     200ms     684us
1.96,1.96,fileserver,1,1329592341,16G,,108,95,53410,11,33628,8,263,91,173232,15,326.8,7,16
,,,,,22212,90,30560,47,16210,96,19587,95,8727,81,5518,98,1306ms,4980ms,2568ms,427ms,251ms,554ms,
14153us,71791us,685us,35300us,200ms,684us

zpool iostat (bonnie write)
Code:
               capacity     operations    bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
tank        1.78T  1.85T    391    320  48.5M  29.2M
tank        1.78T  1.85T    262    361  32.6M  34.7M
tank        1.78T  1.85T    258    367  32.0M  36.4M
tank        1.78T  1.85T    472    352  58.5M  40.6M
tank        1.78T  1.85T    208    390  25.6M  38.9M
tank        1.78T  1.85T    219    422  26.8M  39.7M
tank        1.78T  1.85T    401    315  48.9M  29.2M
zpool iostat (bonnie read)
Code:
               capacity     operations    bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
tank        1.78T  1.85T  1.04K      0   132M      0
tank        1.78T  1.85T  1.03K      0   131M      0
tank        1.78T  1.85T  1.02K      0   130M      0
tank        1.78T  1.85T  1.05K      0   134M      0
tank        1.78T  1.85T  1.05K      0   134M      0
tank        1.78T  1.85T  1.06K      0   134M      0

dd (/dev/zero)
Code:
root> dd bs=1M count=4096 if=/dev/zero of=test
4096+0 records in
4096+0 records out
4294967296 bytes transferred in 61.024558 secs (70380965 bytes/sec)

root> dd bs=1M count=4096 if=/dev/zero of=test; sync
4096+0 records in
4096+0 records out
4294967296 bytes transferred in 42.033025 secs (102180780 bytes/sec)

dd (/dev/random)
Code:
root> dd bs=1M count=4096 if=/dev/random of=test
4096+0 records in
4096+0 records out
4294967296 bytes transferred in 120.814626 secs (35550061 bytes/sec)

root> dd bs=1M count=4096 if=/dev/random of=test; sync
4096+0 records in
4096+0 records out
4294967296 bytes transferred in 116.464596 secs (36877879 bytes/sec)

dd (/dev/urandom)
Code:
root> dd bs=1M count=4096 if=/dev/urandom of=test
4096+0 records in
4096+0 records out
4294967296 bytes transferred in 118.434588 secs (36264468 bytes/sec)

root> dd bs=1M count=4096 if=/dev/urandom of=test; sync
4096+0 records in
4096+0 records out
4294967296 bytes transferred in 115.124315 secs (37307213 bytes/sec)

cp
Code:
root> ll -h | grep ap18
-rw-r-----  1 root  wheel   9.9G Feb 18 12:56 ap18.mkv
Code:
root> time cp ap18.mkv ap18_2.mkv

real    4m17.238s
user    0m0.016s
sys     0m7.002s
Code:
root> time cp ap18.mkv ap18_2.mkv

real    3m57.114s
user    0m0.008s
sys     0m8.086s

The copying seems to be relatively slow. So I attached the according top output.
I assume that the cpu cannot cope all the encryption and might be the bottleneck here.
Code:
root> top -S
last pid:  2464;  load averages:  9.67,  8.59,  6.84            up 0+00:54:06  12:58:57
65 processes:  13 running, 51 sleeping, 1 waiting
CPU 0:  0.0% user,  0.0% nice, 98.7% system,  0.3% interrupt,  1.0% idle
CPU 1:  0.0% user,  0.0% nice, 98.4% system,  0.0% interrupt,  1.6% idle
Mem: 17M Active, 14M Inact, 6318M Wired, 72K Cache, 28M Buf, 1574M Free
Swap: 5120M Total, 5120M Free

  PID USERNAME  THR PRI NICE   SIZE    RES STATE   C   TIME   WCPU COMMAND
 2275 root        1  77    -     0K    16K RUN     1   3:25 17.19% g_eli[1] ada2
 2277 root        1  52    -     0K    16K RUN     0   3:32 16.80% g_eli[0] ada4
 2278 root        1  52    -     0K    16K RUN     1   3:02 15.87% g_eli[1] ada4
 2274 root        1  52    -     0K    16K RUN     0   3:06 15.58% g_eli[0] ada2
   11 root        2 155 ki31     0K    32K RUN     1  64:46 15.43% idle
 2169 root        1  31    -     0K    16K RUN     1   2:16  9.18% g_eli[1] ada6
 2160 root        1  33    -     0K    16K RUN     1   2:13  8.89% g_eli[1] ada0
 2166 root        1  32    -     0K    16K RUN     1   2:12  8.89% g_eli[1] ada5
 2163 root        1  26    -     0K    16K geli:w  1   2:08  8.69% g_eli[1] ada1
 2174 root        1  30    -     0K    16K RUN     0   2:04  8.59% g_eli[0] ada7
 2165 root        1  30    -     0K    16K RUN     0   2:09  8.50% g_eli[0] ada5
 2168 root        1  32    -     0K    16K RUN     0   2:04  8.40% g_eli[0] ada6
 2159 root        1  28    -     0K    16K CPU0    0   2:06  7.96% g_eli[0] ada0
 2172 root        1  28    -     0K    16K geli:w  1   2:05  7.76% g_eli[1] ada8
 2162 root        1  25    -     0K    16K geli:w  0   2:11  7.57% g_eli[0] ada1
 2171 root        1  26    -     0K    16K geli:w  0   1:59  7.47% g_eli[0] ada8
 2175 root        1  27    -     0K    16K geli:w  1   2:00  7.28% g_eli[1] ada7
   13 root        3  -8    -     0K    48K -       1   0:27  1.17% geom
 2463 root        1  21    0 10052K  2212K zio->i  1   0:03  1.07% cp
    0 root      153  -8    0     0K  2448K -       0   0:49  0.00% kernel
 
Back
Top