ZFS Don't ever buy from ACME hardware

gkontos · Dec 25, 2014

I am changing the title of this thread because I want to warn other people.
If the moderators think that this is not appropriate then please delete it.

Hello and Merry Christmas to all

I am facing some weird checksum errors during scrub. The configuration is the following:

Code:

Board:        Supermicro Motherboard X10DRi-T4+
Controller:  LSI SAS 9300-8i
HDD:         21X6TB Western Digital WD60EFRX
HDD:         2XIntel SATA 600GB Solid-State Drive SSDSC2BB600G401 DC S3500 (SWAP, ZIL, CACHE)
Chassis:    Supermicro 847BE1C-R1K28LPB 4U Storage Chassis
RAM:         64 GB

I installed initially FreeBSD 10.1-RELEASE created one pool consistent by 3 X7disk VDEVs in RAIDZ3. I used NFS to start copying some data. After copying around 3TB I initiated a scrub.
The result was the following: http://pastebin.com/rswgCY2A and http://pastebin.com/DQ2urGXk

I tried to flash the controller but the LSI utility did not recognize the controller. I installed FreeBSD 9.3-RELEASE and used LSI's mpslsi3 driver. I was able to flash the latest bios and firmware that way.

Code:

LSI Corporation SAS3 Flash Utility
Version 07.00.00.00 (2014.08.14)
Copyright (c) 2008-2014 LSI Corporation. All rights reserved

Adapter Selected is a LSI SAS: SAS3008(C0)

Controller Number              : 0
Controller                     : SAS3008(C0)
PCI Address                    : 00:82:00:00
SAS Address                    : 500605b-0-06ce-27e0
NVDATA Version (Default)       : 06.03.00.05
NVDATA Version (Persistent)    : 06.03.00.05
Firmware Product ID            : 0x2221 (IT)
Firmware Version               : 06.00.00.00
NVDATA Vendor                  : LSI
NVDATA Product ID              : SAS9300-8i
BIOS Version                   : 08.13.00.00
UEFI BSD Version               : 02.00.00.00
FCODE Version                  : N/A
Board Name                     : SAS9300-8i
Board Assembly                 : H3-25573-00E
Board Tracer Number            : SV32928040

I recreated the pool again and started writing data via NFS again. After 3 TB of data I started a scrub and I am still getting checksum errors though there are no messages regarding the drives anymore in /var/log/messages

Code:

pool: Pool
state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://illumos.org/msg/ZFS-8000-9P

  scan: scrub in progress since Thu Dec 25 08:46:21 2014
        2.28T scanned out of 5.54T at 816M/s, 1h9m to go
        11.9M repaired, 41.26% done
config:

NAME                     STATE     READ WRITE CKSUM
Pool                     ONLINE       0     0     0
  raidz3-0               ONLINE       0     0     0
    gpt/WD-WX41D94RN5A3  ONLINE       0     0    15  (repairing)
    gpt/WD-WX41D948YE1U  ONLINE       0     0    14  (repairing)
    gpt/WD-WX41D94RN879  ONLINE       0     0    16  (repairing)
    gpt/WD-WX21D947NC83  ONLINE       0     0    24  (repairing)
    gpt/WD-WX21D947NT77  ONLINE       0     0    15  (repairing)
    gpt/WD-WX41D948YAKV  ONLINE       0     0    19  (repairing)
    gpt/WD-WX21D9421SCV  ONLINE       0     0    20  (repairing)
  raidz3-1               ONLINE       0     0     0
    gpt/WD-WX21D9421F6F  ONLINE       0     0    16  (repairing)
    gpt/WD-WX41D948YPN4  ONLINE       0     0    14  (repairing)
    gpt/WD-WX21D947NE2K  ONLINE       0     0    22  (repairing)
    gpt/WD-WX41D948Y2PX  ONLINE       0     0    19  (repairing)
    gpt/WD-WX41D94RNAX7  ONLINE       0     0    17  (repairing)
    gpt/WD-WX21D947N1RP  ONLINE       0     0    12  (repairing)
    gpt/WD-WX21D94216X7  ONLINE       0     0    20  (repairing)
  raidz3-2               ONLINE       0     0     0
    gpt/WD-WX41D948YAHP  ONLINE       0     0    25  (repairing)
    gpt/WD-WX21D947N06F  ONLINE       0     0    18  (repairing)
    gpt/WD-WX21D947N3T1  ONLINE       0     0    21  (repairing)
    gpt/WD-WX41D94RNT7D  ONLINE       0     0     5  (repairing)
    gpt/WD-WX41D948Y9VV  ONLINE       0     0    18  (repairing)
    gpt/WD-WX41D94RNS62  ONLINE       0     0    24  (repairing)
    gpt/WD-WX21D9421ZP9  ONLINE       0     0    28  (repairing)
logs
  mirror-3               ONLINE       0     0     0
    gpt/zil0             ONLINE       0     0     0
    gpt/zil1             ONLINE       0     0     0
cache
  gpt/cache0             ONLINE       0     0     0
  gpt/cache1             ONLINE       0     0     0

errors: No known data errors

This is really driving me crazy since smartmon tools do not display any errors on the drives.

Any suggestions are most welcomed!!!

Thank you for your time,

Terry_Kennedy · Dec 25, 2014

gkontos said:
This is really driving me crazy since smartmon tools do not display any errors on the drives.

Any suggestions are most welcomed!!!

sysutils/smartmontools only shows errors on the drive side of the interface connector. If these were SAS drives, you might see an increasing value for "Non-medium error count" but still no specific info.

I'd check to see if there is a newer FreeBSD driver (not firmware) on the LSI site and try that. You may have to build a kernel without the mpr(4) device in order to load the LSI version.

I'm on vacation, or I'd dig through the CDBs to see what the driver was trying to do. If you don't get it solved by next week, post a followup and I'll look into it then.

One other thing - you've got an 8-port controller and 23 drives, so I'd also check to make sure whatever you're using for a storage expander (the Supermicro chassis?) is up-to-date, has good power & cabling, etc.

gkontos · Dec 25, 2014

OK. Some updates...

The system has been upgraded to 10.1-RELEASE again. I use the native FreeBSD mpr driver and not the LSI. So far scrubbing is going well and I see a lot of improvement in speed:

Code:

 scan: scrub in progress since Thu Dec 25 15:32:57 2014
        4.56T scanned out of 5.98T at 1.58G/s, 0h15m to go

However, this is disturbing (same report in 9.3):

Code:

ses1: 150.000MB/s transfers
ses2: 150.000MB/s transfers
da0: 150.000MB/s transfers
da5: 150.000MB/s transfers
da2: 150.000MB/s transfers
da7: 150.000MB/s transfers
da3: 150.000MB/s transfers
da1: 150.000MB/s transfers
da6: 150.000MB/s transfers
da4: 150.000MB/s transfers
da8: 150.000MB/s transfers
da13: 150.000MB/s transfers
da18: 150.000MB/s transfers
da10: 150.000MB/s transfers
da12: 150.000MB/s transfers
da17: 150.000MB/s transfers
da11: 150.000MB/s transfers
da16: 150.000MB/s transfers
da15: 150.000MB/s transfers
da20: 150.000MB/s transfers
da9: 150.000MB/s transfers
da14: 150.000MB/s transfers
da19: 150.000MB/s transfers

Terry_Kennedy · Dec 25, 2014

Those are almost certainly bogus. Our drivers have been making up numbers for years - none of the fixes for accurate reporting I did for BSD/OS made it over, and so when newer drivers were based on the older ones, they just copied over the same bogus values.

The BIOS utility on the LSI card can show you the actual negotiated transfer rate. It is somewhere in the Advanced / Phy menu, IIRC.

gkontos · Dec 25, 2014

Terry_Kennedy said:
Those are almost certainly bogus. Our drivers have been making up numbers for years - none of the fixes for accurate reporting I did for BSD/OS made it over, and so when newer drivers were based on the older ones, they just copied over the same bogus values.

The BIOS utility on the LSI card can show you the actual negotiated transfer rate. It is somewhere in the Advanced / Phy menu, IIRC.

Right, that's my feeling also. However, it appears that the native FreeBSD driver mpr(4) has been upgraded in 10.1 and the performance that I get is really super fast.
I am waiting for the scrub to finish and I am also transferring data at the same time. I really hope that this issue is related to an old firmware.

Terry_Kennedy · Dec 25, 2014

gkontos said:
Right, that's my feeling also. However, it appears that the native FreeBSD driver mpr(4)() has been upgraded in 10.1 and the performance that I get is really super fast.

That driver is pretty new - it doesn't exist in my 8.4 systems (I have 8.4 and 10.1 here).

Post a reply once the scrub completes and you've copied more data.

gkontos · Dec 25, 2014

Will do once I have more data to share. At the time being I am copying 20 TB of video and I am scrubbing as well at the same time. So far no errors but I will update this tomorrow.

Thanks

gkontos · Dec 26, 2014

Errors again!

Code:

 pool: Pool
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
  see: http://illumos.org/msg/ZFS-8000-9P
  scan: scrub in progress since Fri Dec 26 03:52:18 2014
        8.88T scanned out of 13.7T at 1.57G/s, 0h52m to go
        2M repaired, 64.65% done
config:

NAME                     STATE     READ WRITE CKSUM
Pool                     ONLINE       0     0     0
  raidz3-0               ONLINE       0     0     0
    gpt/WD-WX41D94RN5A3  ONLINE       0     0     4  (repairing)
    gpt/WD-WX41D948YE1U  ONLINE       0     0     0
    gpt/WD-WX41D94RN879  ONLINE       0     0     1  (repairing)
    gpt/WD-WX21D947NC83  ONLINE       0     0     6  (repairing)
    gpt/WD-WX21D947NT77  ONLINE       0     0     3  (repairing)
    gpt/WD-WX41D948YAKV  ONLINE       0     0     2  (repairing)
    gpt/WD-WX21D9421SCV  ONLINE       0     0     1  (repairing)
  raidz3-1               ONLINE       0     0     0
    gpt/WD-WX21D9421F6F  ONLINE       0     0     6  (repairing)
    gpt/WD-WX41D948YPN4  ONLINE       0     0     2  (repairing)
    gpt/WD-WX21D947NE2K  ONLINE       0     0     5  (repairing)
    gpt/WD-WX41D948Y2PX  ONLINE       0     0     6  (repairing)
    gpt/WD-WX41D94RNAX7  ONLINE       0     0     2  (repairing)
    gpt/WD-WX21D947N1RP  ONLINE       0     0     4  (repairing)
    gpt/WD-WX21D94216X7  ONLINE       0     0     3  (repairing)
  raidz3-2               ONLINE       0     0     0
    gpt/WD-WX41D948YAHP  ONLINE       0     0     5  (repairing)
    gpt/WD-WX21D947N06F  ONLINE       0     0     4  (repairing)
    gpt/WD-WX21D947N3T1  ONLINE       0     0     2  (repairing)
    gpt/WD-WX41D94RNT7D  ONLINE       0     0     2  (repairing)
    gpt/WD-WX41D948Y9VV  ONLINE       0     0     2  (repairing)
    gpt/WD-WX41D94RNS62  ONLINE       0     0     1  (repairing)
    gpt/WD-WX21D9421ZP9  ONLINE       0     0     3  (repairing)
logs
  mirror-3               ONLINE       0     0     0
    gpt/zil0             ONLINE       0     0     0
    gpt/zil1             ONLINE       0     0     0
cache
  gpt/cache0             ONLINE       0     0     0
  gpt/cache1             ONLINE       0     0     0

errors: No known data errors

Terry_Kennedy · Dec 26, 2014

Any console messages about command timeouts or other errors?

gkontos · Dec 26, 2014

None!

Terry_Kennedy · Dec 26, 2014

gkontos said:
None!

We don't really know if these errors are old ones that are just being detected now, or new ones. I'd suggest destroying the pool and re-creating it - the data you copied to the pool was only for testing, correct?

gkontos · Dec 26, 2014

Terry_Kennedy said:
We don't really know if these errors are old ones that are just being detected now, or new ones. I'd suggest destroying the pool and re-creating it - the data you copied to the pool was only for testing, correct?

When I switched from 10.1 to 9.3 I destroyed the pool and recreated it. The funny thing is that the checksum errors appear only on new data. For example. I transfer a few TB and start scrubbing. It will show errors, they will be corrected and application are not affected. If I scrub again then we have no errors.
Now, if I transfer a few more TB and scrub, it will start showing checksum errors only after a while, which I believe is the point where it starts examining the new data. If I scrub again no errors will be displayed.

I have send an email to SuperMicro describing the problem asking them if they have any firmware updates for their chassis.

Code:

root@storage:~ # zpool status

pool: Pool
state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
  see: http://illumos.org/msg/ZFS-8000-9P
  scan: scrub repaired 8.16M in 2h42m with 0 errors on Fri Dec 26 06:34:57 2014
config:

NAME                     STATE     READ WRITE CKSUM
Pool                     ONLINE       0     0     0
  raidz3-0               ONLINE       0     0     0
    gpt/WD-WX41D94RN5A3  ONLINE       0     0     9
    gpt/WD-WX41D948YE1U  ONLINE       0     0    11
    gpt/WD-WX41D94RN879  ONLINE       0     0     6
    gpt/WD-WX21D947NC83  ONLINE       0     0    15
    gpt/WD-WX21D947NT77  ONLINE       0     0    10
    gpt/WD-WX41D948YAKV  ONLINE       0     0    13
    gpt/WD-WX21D9421SCV  ONLINE       0     0    14
  raidz3-1               ONLINE       0     0     0
    gpt/WD-WX21D9421F6F  ONLINE       0     0    15
    gpt/WD-WX41D948YPN4  ONLINE       0     0    10
    gpt/WD-WX21D947NE2K  ONLINE       0     0    13
    gpt/WD-WX41D948Y2PX  ONLINE       0     0    16
    gpt/WD-WX41D94RNAX7  ONLINE       0     0    13
    gpt/WD-WX21D947N1RP  ONLINE       0     0    12
    gpt/WD-WX21D94216X7  ONLINE       0     0    11
  raidz3-2               ONLINE       0     0     0
    gpt/WD-WX41D948YAHP  ONLINE       0     0    11
    gpt/WD-WX21D947N06F  ONLINE       0     0    17
    gpt/WD-WX21D947N3T1  ONLINE       0     0    15
    gpt/WD-WX41D94RNT7D  ONLINE       0     0    19
    gpt/WD-WX41D948Y9VV  ONLINE       0     0     9
    gpt/WD-WX41D94RNS62  ONLINE       0     0    15
    gpt/WD-WX21D9421ZP9  ONLINE       0     0     8
logs
  mirror-3               ONLINE       0     0     0
    gpt/zil0             ONLINE       0     0     0
    gpt/zil1             ONLINE       0     0     0
cache
  gpt/cache0             ONLINE       0     0     0
  gpt/cache1             ONLINE       0     0     0

errors: No known data errors

Funny thing is that the error ration is almost constant. No applications are affected and it goes away after a zpool clear. The Pool is 7.83TB in size.

gkontos · Dec 27, 2014

Just a while ago:

Code:

(da0:mpr0:0:39:0): READ(10). CDB: 28 00 1c fd 19 a8 00 00 08 00 length 4096 SMID 442 terminated ioc 804b scsi 0 state c xfer 0
(da0:mpr0:0:39:0): READ(10). CDB: 28 00 1c fc e5 58 00 01 00 00 length 131072 SMID 870 terminated ioc 804b scsi 0 state c xfer 0
(da0:mpr0:0:39:0): READ(10). CDB: 28 00 1c fc e5 10 00 00 40 00 length 32768 SMID 728 terminated ioc 804b scsi 0 state c xfer 0
(da0:mpr0:0:39:0): READ(10). CDB: 28 00 1c fd 19 a8 00 00 08 00 length 4096 SMID 189 terminated ioc 804b scsi 0 state 0 xfer 0
(da0:mpr0:0:39:0): READ(10). CDB: 28 00 1c fc e5 10 00 00 40 00 length 32768 SMID 799 terminated ioc 804b scsi 0 state 0 xfer 0
(da0:mpr0:0:39:0): READ(10). CDB: 28 00 1c fc e5 58 00 01 00 00 length 131072 SMID 340 terminated ioc 804b scsi 0 state 0 xfer 0
(da0:mpr0:0:39:0): READ(10). CDB: 28 00 1c fc e5 58 00 01 00 00 
(da0:mpr0:0:39:0): CAM status: CCB request aborted by the host
(da0:mpr0:0:39:0): Retrying command
(da0:mpr0:0:39:0): READ(10). CDB: 28 00 1c fc e5 10 00 00 40 00 
(da0:mpr0:0:39:0): CAM status: CCB request aborted by the host
(da0:mpr0:0:39:0): Retrying command
(da0:mpr0:0:39:0): READ(10). CDB: 28 00 1c fd 19 a8 00 00 08 00 
(da0:mpr0:0:39:0): CAM status: CCB request aborted by the host
(da0:mpr0:0:39:0): Retrying command
da0 at mpr0 bus 0 scbus7 target 39 lun 0
da0: <ATA WDC WD60EFRX-68M 0A82> s/n      WD-WX41D94RN5A3 detached
(da0:mpr0:0:39:0): Periph destroyed

Terry_Kennedy · Dec 27, 2014

gkontos said:

Just a while ago:

Code:

(da0:mpr0:0:39:0): READ(10). CDB: 28 00 1c fd 19 a8 00 00 08 00 length 4096 SMID 442 terminated ioc 804b scsi 0 state c xfer 0
(da0:mpr0:0:39:0): READ(10). CDB: 28 00 1c fc e5 58 00 01 00 00 length 131072 SMID 870 terminated ioc 804b scsi 0 state c xfer 0
(da0:mpr0:0:39:0): READ(10). CDB: 28 00 1c fc e5 10 00 00 40 00 length 32768 SMID 728 terminated ioc 804b scsi 0 state c xfer 0
(da0:mpr0:0:39:0): READ(10). CDB: 28 00 1c fd 19 a8 00 00 08 00 length 4096 SMID 189 terminated ioc 804b scsi 0 state 0 xfer 0
(da0:mpr0:0:39:0): READ(10). CDB: 28 00 1c fc e5 10 00 00 40 00 length 32768 SMID 799 terminated ioc 804b scsi 0 state 0 xfer 0
(da0:mpr0:0:39:0): READ(10). CDB: 28 00 1c fc e5 58 00 01 00 00 length 131072 SMID 340 terminated ioc 804b scsi 0 state 0 xfer 0
(da0:mpr0:0:39:0): READ(10). CDB: 28 00 1c fc e5 58 00 01 00 00
(da0:mpr0:0:39:0): CAM status: CCB request aborted by the host
(da0:mpr0:0:39:0): Retrying command
(da0:mpr0:0:39:0): READ(10). CDB: 28 00 1c fc e5 10 00 00 40 00
(da0:mpr0:0:39:0): CAM status: CCB request aborted by the host
(da0:mpr0:0:39:0): Retrying command
(da0:mpr0:0:39:0): READ(10). CDB: 28 00 1c fd 19 a8 00 00 08 00
(da0:mpr0:0:39:0): CAM status: CCB request aborted by the host
(da0:mpr0:0:39:0): Retrying command
da0 at mpr0 bus 0 scbus7 target 39 lun 0
da0: <ATA WDC WD60EFRX-68M 0A82> s/n      WD-WX41D94RN5A3 detached
(da0:mpr0:0:39:0): Periph destroyed

These look like they were all on target 39. It might be useful to run a smartctl -t long test on that drive to see if it experienced an infant mortality failure unrelated to the other issue(s). You may need to power cycle the drive or put it in another system if it doesn't come back after doing a # camcontrol rescan all.

Unfortunately, the firmware in the LSI controller is translating SCSI commands (coming from the FreeBSD CAM layer) to SAS for your drives, and then translating the returned responses back into SCSI. So it is difficult to say what is going on by looking at the SCSI read CDB in the error message. And of course, you have SAS expander(s) hiding things from the controller as well.

Other LSI drivers like mfi(4) provide some useful console messages when they see something interesting on the controller. For example:

Code:

Dec 1 03:42:37 hostname kernel: mfi0: 4970 (boot + 61s/0x0002/info) - Inserted: PD 05(e0x20/s5)
Dec 1 03:42:37 hostname kernel: mfi0: 4971 (boot + 61s/0x0002/info) - Inserted: PD 05(e0x20/s5) Info: enclPd=20, scsiType=0, portMap=00, sasAddr=5000c5003b55189d,0000000000000000
Dec 1 03:42:37 hostname kernel: mfi0: 4972 (boot + 62s/0x0042/info) - Dedicated Hot Spare created on PD 05(e0x20/s5) (ded,rev,ea,ac=1)
Dec 1 03:42:37 hostname kernel: mfi0: 4973 (470738445s/0x0020/info) - Time established as 12/01/14 8:40:45; (63 seconds since power on)
Dec 1 03:42:37 hostname kernel: mfi0: 4974 (470738482s/0x0008/info) - Battery temperature is normal
Dec 1 03:42:37 hostname kernel: mfi0: 4975 (470738482s/0x0008/info) - Current capacity of the battery is above threshold
Dec 1 03:42:37 hostname kernel: mfi0: 4976 (470738482s/0x0008/info) - Battery started charging

Do you have any log messages like this?

gkontos · Dec 27, 2014

Terry, the weirdest thing just happened. The drive does not appear anymore. I even rebooted the system and I can only see 20 drives now instead of 21.

Regarding the console messages, no I don't. Nothing at all.

Terry_Kennedy · Dec 27, 2014

gkontos said:
Terry, the weirdest thing just happened. The drive does not appear anymore. I even rebooted the system and I can only see 20 drives now instead of 21.

Can you power cycle the system remotely? A vanishing drive generally indicates a problem with that drive (assuming all cabling, expanders, etc. are undamaged). I've had many WD drives (mostly RE4's) disappear when they failed. Sometimes they'd come back after a power cycle, but would show SMART errors.

gkontos · Dec 27, 2014

Power cycling did not help. I even powered off the system for 10 min from the IPMI. Then I went into the LSI BIOS changed a setting, moved it back and saved all settings. The drive all of a sudden appeared again. No smart errors though. At this point I think that replacing the controller should be the first step, right?

Terry_Kennedy · Dec 27, 2014

It is odd that all the errors appeared on one drive. Normally I would say to check cabling, power supply, etc. but since you're in a different country that isn't really practical. Try re-adding the drive to the pool, continue testing, and see if the same drive drops out. If possible, you might want to swap it with another drive elsewhere in the shelf to see if the problem stays with the drive or the slot it was in.

gkontos · Jan 18, 2015

UPDATE!

The controller has been replaced. I tried FreeBSD 9.3-RELEASE (LSI driver & Native driver), FreeBSD 10.1-RELEASE (Native driver), CentOS 6.6 (Native, ZFS on linux). In all cases I get checksum errors again when scrubbing the pool.

This is what the guy from the reseller support suggested (obviously I am talking to an idiot here) :

Hi,

I had checked with Supermicro engineer and he don't have same
experience. And that is not easy to simulate your condition. I can
not find the solution to fix your problem now.

I am asking if new firmware version for SAS3 expander available and waiting for the answer.

Can you confirm if all 21 HDDs are set as a ZFS pool? what is the rules setting for this pool?

I have suggestions to try,
1. Set less number of HDD as the pool to see if still have same problem.
2. Chang the rules for your pool, maybe just some function cause the problem.

3. Move 10 HDDs to the rear side slots, that will use both SAS3
expanders to make performance better and have quick response time. That maybe can work.
4. Use other Linux and ZFS pool version to try.

Thanks
Nick

Hi David,

I am surprised to read what your comments on the email. I know you are upset for the computer that does not function as what your software expected. But that does not mean our hardware has quality issue. You picked this totally new model (new motherboard, new CPU, new 12Gb SAS3 controller and new SAS3 expander) and try to set up the free software that is not many people used. That will have possible risk for compatibility issue. I had contacted with Supermicro's engineers, so far still not find who has close experience to talk.

We did not know what software will be used for this machine. You had no software bought from us. But I still try to think all possible to give you my suggestions. I have big confidence that I am a very good tech support guy.

another suggestion: Can you check if can find new software driver for the LSI 9300-8i for the OS you used. Maybe new driver can fix it.

Thanks
Nick

I have contacted SuperMicro directly again. If they don't respond within 2 days, the machine goes back and we are looking for a different vendor.

Dear all,

On 11/21/2014 we purchased the attached quotation from Acmemicro (http://www.acmemicro.com/). We are experiencing some serious issues with the machine (ZFS checksum errors) and we have received no support from them so far. We have repeatedly ask them to bring us in touch with a SuperMicro engineer to troubleshoot our problem.

So far, we have used SuperMicro products to build our servers with out any problems.

I know that your policy is to provide support only from your authorized resellers. However, it appears that we can not get any support from them. So, we are left with 2 choices here. We either we get some support directly from you or we ship back the product and look for an alternative vendor for our storage needs.

Thank you,

Terry_Kennedy · Jan 18, 2015

gkontos said:
I have contacted SuperMicro directly again. If they don't respond within 2 days, the machine goes back and we are looking for a different vendor.

I have always had excellent support from Supermicro. And they definitely know what FreeBSD is (despite lumping it under "Linux" in their charts). Of course, I am the reseller, but I have received good support regardless of whether I identified myself or not. (One of my rules for selecting suppliers is to see how they treat "the little guy" before I begin a reseller relationship with them.)

For many years, I have done my own builds from scratch (example), because I kept running into the same sort of unintelligent responses from "integrators".

If I were less skilled at this or wanted to expend somewhat more money and a lot less time, I'd probably purchase a preconfigured system from a place like iXsystems.

User23 · Jan 19, 2015

At least LSI posts some compatibility documents (can't find any on SM site for backplanes).

https://www.lsi.com/downloads/Public/Host Bus Adapters/Host Bus Adapters Common Files/LSI_12Gb_SAS_SATA_HBA_Compatibility_List.pdf

The controller and the backplanes are tested, but the 6TB Reds are not. I would check it with a mix of different HDDs. Because of the replaced controller I would blame the HDDs and the driver first.

User23 · Jan 19, 2015

Terry_Kennedy said:
I have always had excellent support from Supermicro. And they definitely know what FreeBSD is (despite lumping it under "Linux" in their charts). Of course, I am the reseller, but I have received good support regardless of whether I identified myself or not. (One of my rules for selecting suppliers is to see how they treat "the little guy" before I begin a reseller relationship with them.)

I like to sign that

gkontos · Jan 19, 2015

User23 said:
The controller and the backplanes are tested, but the 6TB Reds are not. I would check it with a mix of different HDDs. Because of the replaced controller i would blame the HDDs and the driver first.

The problem most likely, like others have mentioned, is in the firmware version of SAS expanders or the chassis. Unfortunately, SuperMicro does not have any updates on their website. I can't just buy 21 different drives to play with unless I have solid evidence that they are faulty. So far, the smartmon tools indicate otherwise.

User23 · Jan 19, 2015

SM is offering only a small set up backplanes, used in all their chassis products. If a firmware related bug only hits you I would be surprised, as long as more than you and 21 other people using this chassis with this backplane.

gkontos · Jan 19, 2015

User23 said:
SM is offering only a small set up backplanes, used in all their chassis products. If a firmware related bug only hits you i would be surprised, as long as more than you and 21 other people using this chassis with this backplane.

This is a brand new chassis that hit the market only a couple of months ago. I am not sure how many people are using it for their ZFS storage needs. So far, I have not meet any on the FreeBSD mailing lists. In any case, replacing 21 drives without any indication is an expense that can not be justified. If SM suggests that this is the way we should go then maybe that is what we need to do.

But the issue here so far has been the fact that we don't get any support. The response "try to set up the free software that is not many people used" is strictly unprofessional.