Poor transfer rates on SATA drives on LSI SAS3008 controller

This is a brand-new computer, so I don't know if this is an issue with this controller or some other kind of interaction within the system. Suggestions for troubleshooting are welcome.

On a fresh install of FreeBSD 14.0-RELEASE, I'm only getting about 70MB/s while doing a dd if=/dev/zero of=/dev/dax bs=8M to drives connected to this controller. When the same disks are attached to a normal SATA controller, they'll do about 200MB/s. It's not a total bandwidth limitation, because if I run multiple instances of dd(1) against multiple drives, each will do 70-80MB/s.

gstat -p shows 100% busy during these tests.

I don't know if it's related, but mprutil keeps causing this line to show up in syslog:
Code:
Dec 28 01:01:00 marathon kernel: mpr0: mpr_user_pass_thru: user reply buffer (64) smaller than returned buffer (68)


Code:
Dec 28 00:52:11 marathon kernel: mpr0: <Avago Technologies (LSI) SAS3008> port 0xf000-0xf0ff mem 0xfcc40000-0xfcc4ffff,0
xfcc00000-0xfcc3ffff at device 0.0 on pci1
Dec 28 00:52:11 marathon kernel: mpr0: Firmware: 16.00.12.00, Driver: 23.00.00.00-fbsd
Dec 28 00:52:11 marathon kernel: mpr0: IOCCapabilities: 7a85c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay
,MSIXIndex,HostDisc,FastPath,RDPQArray>

Dec 28 00:52:11 marathon kernel: mpr0: Found device <881<SataDev,Direct>,End Device> <6.0Gbps> handle<0x0009> enclosureHandle<0x0001> slot 3
Dec 28 00:52:11 marathon kernel: mpr0: At enclosure level 0 and connector name (    )
Dec 28 00:52:11 marathon kernel: mpr0: Found device <881<SataDev,Direct>,End Device> <6.0Gbps> handle<0x000a> enclosureHandle<0x0001> slot 2
Dec 28 00:52:11 marathon kernel: mpr0: At enclosure level 0 and connector name (    )
Dec 28 00:52:11 marathon kernel: mpr0: Found device <881<SataDev,Direct>,End Device> <6.0Gbps> handle<0x000b> enclosureHandle<0x0001> slot 0
Dec 28 00:52:11 marathon kernel: mpr0: At enclosure level 0 and connector name (    )
Dec 28 00:52:11 marathon kernel: mpr0: Found device <881<SataDev,Direct>,End Device> <6.0Gbps> handle<0x000c> enclosureHandle<0x0001> slot 1
.... etc

Dec 28 00:52:11 marathon kernel: da0: <ATA WDC WD80EZAZ-11T 0A83> Fixed Direct Access SPC-4 SCSI device
Dec 28 00:52:11 marathon kernel: da0: Serial Number JEKUR9VZ
Dec 28 00:52:11 marathon kernel: da0: 600.000MB/s transfers
Dec 28 00:52:11 marathon kernel: da0: Command Queueing enabled
Dec 28 00:52:11 marathon kernel: da0: 7630885MB (15628053168 512 byte sectors)
Dec 28 00:52:11 marathon kernel: da1 at mpr0 bus 0 scbus0 target 1 lun 0
Dec 28 00:52:11 marathon kernel: da1: <ATA WDC WD80EZAZ-11T 0A83> Fixed Direct Access SPC-4 SCSI device
Dec 28 00:52:11 marathon kernel: da1: Serial Number 2SG0JGSF
Dec 28 00:52:11 marathon kernel: da1: 600.000MB/s transfers
Dec 28 00:52:11 marathon kernel: da1: Command Queueing enabled
Dec 28 00:52:11 marathon kernel: da1: 7630885MB (15628053168 512 byte sectors)
Dec 28 00:52:11 marathon kernel: da3 at mpr0 bus 0 scbus0 target 3 lun 0
Dec 28 00:52:11 marathon kernel: da3: <ATA WDC WD80EDAZ-11T 0A81> Fixed Direct Access SPC-4 SCSI device
Dec 28 00:52:11 marathon kernel: da3: Serial Number VGH4P1JG
..... etc

Code:
# sysctl dev.mpr.0 | less
dev.mpr.0.prp_page_alloc_fail: 0
dev.mpr.0.prp_pages_free_lowwater: 0
dev.mpr.0.prp_pages_free: 0
dev.mpr.0.use_phy_num: 1
dev.mpr.0.dump_reqs_alltypes: 0
dev.mpr.0.spinup_wait_time: 3
dev.mpr.0.chain_alloc_fail: 0
dev.mpr.0.enable_ssu: 1
dev.mpr.0.max_io_pages: -1
dev.mpr.0.max_chains: 16384
dev.mpr.0.chain_free_lowwater: 16384
dev.mpr.0.chain_free: 16384
dev.mpr.0.io_cmds_highwater: 8
dev.mpr.0.io_cmds_active: 1
dev.mpr.0.msg_version: 2.5
dev.mpr.0.driver_version: 23.00.00.00-fbsd
dev.mpr.0.firmware_version: 16.00.12.00
dev.mpr.0.max_evtframes: 32
dev.mpr.0.max_replyframes: 2048
dev.mpr.0.max_prireqframes: 128
dev.mpr.0.max_reqframes: 2048
dev.mpr.0.msix_msgs: 1
dev.mpr.0.max_msix: 96
dev.mpr.0.disable_msix: 0
dev.mpr.0.debug_level: 0x3,info,fault
dev.mpr.0.%parent: pci1
dev.mpr.0.%pnpinfo: vendor=0x1000 device=0x0097 subvendor=0x1014 subdevice=0x0457 class=0x010700
dev.mpr.0.%location: slot=0 function=0 dbsf=pci0:1:0:0
dev.mpr.0.%driver: mpr
dev.mpr.0.%desc: Avago Technologies (LSI) SAS3008

Code:
# mprutil show adapter
mpr0 Adapter:
       Board Name: N2215
   Board Assembly: H3-25480-04D
        Chip Name: LSISAS3008
    Chip Revision: ALL
    BIOS Revision: 18.00.00.00
Firmware Revision: 16.00.12.00
  Integrated RAID: no
         SATA NCQ: ENABLED
 PCIe Width/Speed: x8 (8.0 GB/sec)
        IOC Speed: Full
      Temperature: 71 C

PhyNum  CtlrHandle  DevHandle  Disabled  Speed   Min    Max    Device
0       0001        0009       N         6.0     3.0    12     SAS Initiator
1       0002        000a       N         6.0     3.0    12     SAS Initiator
2       0003        000b       N         6.0     3.0    12     SAS Initiator
3       0004        000c       N         6.0     3.0    12     SAS Initiator
4       0005        000d       N         6.0     3.0    12     SAS Initiator
5       0006        000e       N         6.0     3.0    12     SAS Initiator
6       0007        000f       N         6.0     3.0    12     SAS Initiator
7       0008        0010       N         6.0     3.0    12     SAS Initiator

Code:
# mprutil show iocfacts
          MsgVersion: 2.5
           MsgLength: 17
            Function: 0x3
       HeaderVersion: 50,00
           IOCNumber: 0
            MsgFlags: 0x0
               VP_ID: 0
               VF_ID: 0
       IOCExceptions: 0
           IOCStatus: 0
          IOCLogInfo: 0x0
       MaxChainDepth: 128
             WhoInit: 0x4
       NumberOfPorts: 1
      MaxMSIxVectors: 96
       RequestCredit: 9856
           ProductID: 0x2221
     IOCCapabilities: 0x7a85c <ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay,MSIXIndex,HostDisc,FastPath,RDPQArray>
           FWVersion: 16.00.12.00
 IOCRequestFrameSize: 32
 MaxChainSegmentSize: 8
       MaxInitiators: 32
          MaxTargets: 1024
     MaxSasExpanders: 42
       MaxEnclosures: 43
       ProtocolFlags: 0x3 <ScsiTarget,ScsiInitiator>
  HighPriorityCredit: 104
MaxRepDescPostQDepth: 65504
      ReplyFrameSize: 32
          MaxVolumes: 0
        MaxDevHandle: 1106
MaxPersistentEntries: 128
        MinDevHandle: 9
 CurrentHostPageSize: 0
 
So I've done a bit of testing in some other OSes, with interesting results:

Linux - TrueNAS-SCALE - works fine, speeds as expected
Linux - Ubuntu Desktop 23.10 - Prints a bunch of angry messages (A bunch of array out-of-bounds stuff, I think it was) on startup, works and speeds are as expected
FreeBSD - TrueNAS-Core - works, but transfer rates are limited ~70MB/s
Windows 10 22H2 Out-of-box-driver - works, but transfer rates are limited ~70MB/s
 
> Dec 28 01:01:00 marathon kernel: mpr0: mpr_user_pass_thru: user reply buffer (64) smaller than returned buffer (68)

You can ignore this error. It's meaningless. It's just saying that whatever passthru commands you are using aren't reading the last 4 bytes of the buffer... mprutil never looks at those.

But the thing is, I thought that print was removed in 14.0. so that might indicate something isn't matching your expectation (or the FreeNAS-core is 13.x and still has the message).

You should be getting 200MB/s from this setup. We have something similar at work, and get 80-90MB/s from single actuator drives. But that's 1MB random access use pattern. We see closer to the 200MB/s sequentially. That's with SAS drives, though, not SATA drives.

What does sysctl kern.cam.da.0 show?
What does camcontrol tags repot? It should be 32 (but may be 255). If it's not, then that's a problem to fix.
What does syctl kern.maxphys report? I think it defaults to 1MB. Does dd bs=1m work any better?

100% busy for a single dd might indicate that something is saturated. But we know there's not a hardware issue since Linux can go faster (or at least it's a lot less likely).

Are the multiple dd's to one drive or multiple drives?

Warner
 
> But the thing is, I thought that print was removed in 14.0. so that might indicate something isn't matching your expectation (or the FreeNAS-core is 13.x and still has the message).

That print is on FreeBSD 14.0-RELEASE, not FreeNAS/TrueNAS-Core.

> Are the multiple dd's to one drive or multiple drives?
When I'm testing multiple dd's, I'm testing with one dd to each drive.

> What does sysctl kern.cam.da.0 show?
Code:
# sysctl kern.cam.da.0
kern.cam.da.0.trim_ticks: 0
kern.cam.da.0.trim_goal: 0
kern.cam.da.0.sort_io_queue: -1
kern.cam.da.0.unmapped_io: 1
kern.cam.da.0.rotating: 1
kern.cam.da.0.flags: 0x10ee50<ROTATING,WAS_OTAG,SCTX_INIT,CAN_RC16,PROBED,ANNOUCNED,CAN_ATA_DMA,CAN_ATA_LOG,UNMAPPEDIO>
kern.cam.da.0.p_type: 0
kern.cam.da.0.error_inject: 0
kern.cam.da.0.max_seq_zones: 0
kern.cam.da.0.optimal_nonseq_zones: 0
kern.cam.da.0.optimal_seq_zones: 0
kern.cam.da.0.zone_support: None
kern.cam.da.0.zone_mode: Not Zoned
kern.cam.da.0.trim_lbas: 0
kern.cam.da.0.trim_ranges: 0
kern.cam.da.0.trim_count: 0
kern.cam.da.0.minimum_cmd_size: 6
kern.cam.da.0.delete_max: 1048576
kern.cam.da.0.delete_method: NONE

> What does syctl kern.maxphys report? I think it defaults to 1MB.
Code:
# sysctl kern.maxphys
kern.maxphys: 1048576

> Does dd bs=1m work any better?
Code:
# dd if=/dev/zero of=/dev/da0 bs=1m count=8192
8192+0 records in
8192+0 records out
8589934592 bytes transferred in 111.559817 secs (76998464 bytes/sec)

> What does camcontrol tags repot? It should be 32 (but may be 255). If it's not, then that's a problem to fix.
Code:
# camcontrol tags da0
(pass0:mpr0:0:0:0): device openings: 255

Code:
# camcontrol tags da0 -v
(pass0:mpr0:0:0:0): dev_openings  255
(pass0:mpr0:0:0:0): dev_active    0
(pass0:mpr0:0:0:0): allocated     0
(pass0:mpr0:0:0:0): queued        0
(pass0:mpr0:0:0:0): held          0
(pass0:mpr0:0:0:0): mintags       2
(pass0:mpr0:0:0:0): maxtags       255

I tried forcing the tags to 32, but this made no difference.
Code:
# camcontrol tags da0 -N 32
(pass0:mpr0:0:0:0): tagged openings now 32
(pass0:mpr0:0:0:0): device openings: 32

Perhaps relevant, this issue only seems to affect bulk writes, not reads.
Code:
# dd if=/dev/da1 of=/dev/null bs=1M count=8192
8192+0 records in
8192+0 records out
8589934592 bytes transferred in 45.093077 secs (190493422 bytes/sec)
 
So I've done a bit of testing in some other OSes, with interesting results:

Linux - TrueNAS-SCALE - works fine, speeds as expected
Linux - Ubuntu Desktop 23.10 - Prints a bunch of angry messages (A bunch of array out-of-bounds stuff, I think it was) on startup, works and speeds are as expected
FreeBSD - TrueNAS-Core - works, but transfer rates are limited ~70MB/s
Windows 10 22H2 Out-of-box-driver - works, but transfer rates are limited ~70MB/s
Is there any way you can measure the queue depth on the disk? The 70MB/s number is suspiciously what might happen if the disk can only work on one IO at a time, and misses a whole revolution between two IOs. (Where IO size on the disk is not necessarily the IO size that comes from userspace).

Another question: Could it be that on some operating systems, the disk gets used in "verify after every write" mode? One of my favorite horrors stories is from the early 2000s: A big computer customer deliberately set their data center temperature to be REALLY cold (about 10 or 12 degrees C), with the idea being that it would make the big computer more reliable and efficient. What really happened is that the disks noticed the extremely cold temperature, and for safety went into "verify after every write" mode. And with small IOs, having to wait one extra revolution for every 512byte or 4KiB write was catastrophic, and their system became laughably slow.
 
Is there any way you can measure the queue depth on the disk? The 70MB/s number is suspiciously what might happen if the disk can only work on one IO at a time, and misses a whole revolution between two IOs. (Where IO size on the disk is not necessarily the IO size that comes from userspace).

Another question: Could it be that on some operating systems, the disk gets used in "verify after every write" mode? One of my favorite horrors stories is from the early 2000s: A big computer customer deliberately set their data center temperature to be REALLY cold (about 10 or 12 degrees C), with the idea being that it would make the big computer more reliable and efficient. What really happened is that the disks noticed the extremely cold temperature, and for safety went into "verify after every write" mode. And with small IOs, having to wait one extra revolution for every 512byte or 4KiB write was catastrophic, and their system became laughably slow.
I disconnected one drive from the LSI controller and plugged it into the motherboard, where it was picked up as ada1

Code:
# dd if=/dev/zero of=/dev/ada1 bs=1m count=10240
10240+0 records in
10240+0 records out
10737418240 bytes transferred in 52.340632 secs (205144987 bytes/sec)

I ran camcontrol tags ada1 -v several times during this operation, and always got the same result:

Code:
# camcontrol tags ada1 -v
(pass4:ahcich5:0:0:0): dev_openings  31
(pass4:ahcich5:0:0:0): dev_active    1
(pass4:ahcich5:0:0:0): allocated     1
(pass4:ahcich5:0:0:0): queued        0
(pass4:ahcich5:0:0:0): held          0
(pass4:ahcich5:0:0:0): mintags       2
(pass4:ahcich5:0:0:0): maxtags       32

If I run against the LSI controller, the results are identical, but the write speed is much lesser:

Code:
# dd if=/dev/zero of=/dev/da0 bs=1m count=10240
10240+0 records in
10240+0 records out
10737418240 bytes transferred in 141.008964 secs (76147062 bytes/sec)

Code:
# camcontrol tags da0 -v
(pass0:mpr0:0:0:0): dev_openings  254
(pass0:mpr0:0:0:0): dev_active    1
(pass0:mpr0:0:0:0): allocated     1
(pass0:mpr0:0:0:0): queued        0
(pass0:mpr0:0:0:0): held          0
(pass0:mpr0:0:0:0): mintags       2
(pass0:mpr0:0:0:0): maxtags       255

Both controllers do about 200MB/s on sequential read, and both report dev_active 1 during that operation.

Code:
# dd if=/dev/da0 of=/dev/null bs=1m count=10240
10240+0 records in
10240+0 records out
10737418240 bytes transferred in 54.279190 secs (197818322 bytes/sec)

# camcontrol tags da0 -v
(pass0:mpr0:0:0:0): dev_openings  254
(pass0:mpr0:0:0:0): dev_active    1
(pass0:mpr0:0:0:0): allocated     1
(pass0:mpr0:0:0:0): queued        0
(pass0:mpr0:0:0:0): held          0
(pass0:mpr0:0:0:0): mintags       2
(pass0:mpr0:0:0:0): maxtags       255

Code:
# dd if=/dev/ada1 of=/dev/null bs=1m count=10240
10240+0 records in
10240+0 records out
10737418240 bytes transferred in 52.217519 secs (205628656 bytes/sec)

# camcontrol tags ada1 -v
(pass4:ahcich5:0:0:0): dev_openings  31
(pass4:ahcich5:0:0:0): dev_active    1
(pass4:ahcich5:0:0:0): allocated     1
(pass4:ahcich5:0:0:0): queued        0
(pass4:ahcich5:0:0:0): held          0
(pass4:ahcich5:0:0:0): mintags       2
(pass4:ahcich5:0:0:0): maxtags       32

If I write to a file on a UFS filesystem on ada0, I get about 130MB/s (which I have no objection to, given I don't know where on the disk that's being written, plus filesystem updates, and this is a different, older, smaller drive), and lots of queue activity. dev_active was observed varying between 11 and 20.

Code:
# dd if=/dev/zero of=./junk bs=1m count=10240
10240+0 records in
10240+0 records out
10737418240 bytes transferred in 82.434079 secs (130254603 bytes/sec)


# camcontrol tags ada0 -v
(pass8:ahcich4:0:0:0): dev_openings  18
(pass8:ahcich4:0:0:0): dev_active    14
(pass8:ahcich4:0:0:0): allocated     14
(pass8:ahcich4:0:0:0): queued        0
(pass8:ahcich4:0:0:0): held          0
(pass8:ahcich4:0:0:0): mintags       2
(pass8:ahcich4:0:0:0): maxtags       32

Now this is interesting..... I created a UFS partition and filesystem on da0, and did the same write operation there..... dev_active was seen varying between 7 and 18 -- and the write performance was great.....

Code:
# dd if=/dev/zero of=./junk bs=1m count=10240
10240+0 records in
10240+0 records out
10737418240 bytes transferred in 56.042128 secs (191595476 bytes/sec)


# camcontrol tags da0 -v
(pass0:mpr0:0:0:0): dev_openings  241
(pass0:mpr0:0:0:0): dev_active    14
(pass0:mpr0:0:0:0): allocated     14
(pass0:mpr0:0:0:0): queued        0
(pass0:mpr0:0:0:0): held          0
(pass0:mpr0:0:0:0): mintags       2
(pass0:mpr0:0:0:0): maxtags       255
 
If I write to a file on a UFS filesystem on ada0, I get about 130MB/s (which I have no objection to, given I don't know where on the disk that's being written, plus filesystem updates, and this is a different, older, smaller drive), and lots of queue activity. dev_active was observed varying between 11 and 20.

...

Now this is interesting..... I created a UFS partition and filesystem on da0, and did the same write operation there..... dev_active was seen varying between 7 and 18 -- and the write performance was great.....
The key to getting good disk performance *ON RANDOM SMALL IOs* is to have the IOs queued pretty deeply, and the queue going all the way to the disk drive itself (so it can sort the queue and execute the IOs in the optimal order). The funny thing here is: it sometimes works for you, and it sometimes doesn't. And I don't know why.

One other thing one could check (if one had infinite spare time) is whether one of the IO stacks breaks the IOs up into smaller pieces than the other. Doing that would require reading the code, or enabling kernel tracing low down in the IO stack. Or getting a SATA analyzer hooked up.
 
Hrm.

I wonder if this is the reason the camdd() program exists.

Code:
# camdd -i file=/dev/zero -o pass=/dev/da1 -m 10G
10737418240 bytes read from /dev/zero
10737418240 bytes written to pass1
55.5901 seconds elapsed
184.21 MB/sec

# camcontrol tags da1 -v
(pass1:mpr0:0:1:0): dev_openings  252
(pass1:mpr0:0:1:0): dev_active    3
(pass1:mpr0:0:1:0): allocated     3
(pass1:mpr0:0:1:0): queued        0
(pass1:mpr0:0:1:0): held          0
(pass1:mpr0:0:1:0): mintags       2
(pass1:mpr0:0:1:0): maxtags       255
 
Is it possible that the 'mpr' driver is not ideal for your controller? I ran into similar trouble with a PERC H730, which is said to be a rebranded LSI 3108. FreeBSD 13 will select the 'mfi' driver automatically, but performance is poor and various errors appear system logs. All of these troubles go away if I manually specify to use the 'mrsas' driver.
 
I tried following the instructions for the mrsas() driver, but it would not pick up this card. Either I screwed something up, or the SAS3008 is only supported by mpr() in FreeBSD-14.
 
Did I get these right?
  1. problem appears only on seq write, not on read
  2. problem appears only with disk on SAS controller, not with same disk on mainboard SATA
  3. problem appears only with this model disk, not with other brand disk on same SAS controller
  4. problem appears only with FBSD14, not with Linux
In that case it would be something that FBSD does to this controller, when initializing or during write, and that bears an incompatibility with this model of disk.
These controllers are intelligent, they do things on their own behalf in the SATA-SAS translation layer.

Glancing thru the thread again, it seems point 3. above has not yet been tested. So next step would be, get some scrap disks/SSDs, plug them to the SAS controller and see if any of them gets to decent write speeds. If not, it would be a genuine problem with the controller (or the controller/mainboard combo) on FBSD.
Depending on the results, a next step could be fire up dtrace and get as much timing information from the device driver as there is to get.
 
problem appears only on seq write, not on read
Also does not appear during non-(pure)sequential write - Mount a filesystem on the disk and write to the filesystem, and the drive performs great. It's only sequential writes to the bare disk object that are incorrectly slow - the camdd program works as a workaround for this.

It's been speculated that command-queue depth might be related - camdd runs a queue depth of four, while dd straight to the disk device runs a queue depth of one.

problem appears only with disk on SAS controller, not with same disk on mainboard SATA
Correct. Same disk plugged into the motherboard works as expected

problem appears only with this model disk, not with other brand disk on same SAS controller
I haven't tested any other disk models. As I've just retired my old NAS I've got some spare Seagate and WD drives floating around now. I'll do some testing with those and report back.

problem appears only with FBSD14, not with Linux
Correct. dd to bare disk device on TrueNAS-SCALE produced expected speeds. The controller is now in-service on TrueNAS-SCALE and the 8-disk raidz2 array can sustain write speeds over 1GB/sec.

I'm not familiar with dtrace. Can you provide some example commands that might get the information you're looking for?
 
Are those WD80EZAZ-11T drives by any chance coming from WD MyBooks? If yes (and IIRC 'EZAZ' were mainly used in MyBooks and later sold cheaply at ebay et al) there are several drawbacks to those drives:
1. 5400rpm (or 5400-ish; there were some variants with slightly higher peak but variable rpm available)
2. there were SMR-variants of those drives; IIRC the 256MB cache versions were SMR, 128MB PMR
3. IIRC back when those were new there were discussions about different firmware variants, mainly for power saving to make those MyBooks more tolerant to the low power budget from many USB-connectors (especially on laptops)

Given that the 'EZAZ' drives were rated at 180MB/s by WD, those 200MB/s figures might indicate there is some caching going on that inflates those numbers. Linux notoriously caches everything, which also results in comical results like multiple-100MB/s write speed to USB drives, so I'd take those results with a grain of salt...

Also using bs=8MB won't give any advantages on spinning rust with blocksizes of 512b or 4k max - maybe those large blocks completely overwhelm the drives firmware (and the puny SATA queue) and cause extensive throttling on the transfer. Try bs=1MB or even smaller to verify if the rates change or are constant.
IIRC /dev/zero is cpu-bound on FreeBSD (can't find anything on that, please correct me if this artifact in the back of my head is wrong!), so for very slow CPUs this might also pose a bottleneck and can falsify such a dd 'benchmark'.

But if those drives really came out of MyBooks, I'd also get some "proper" SATA hard drives, i.e. something that spins at full 7200rpm and isn't 4-5 years old and possibly has crippled firmware for use via USB and won't play well with SATA Tunneling on a SAS controller, and run some identical tests on those.


Regarding mrsas/mpr drivers: mrsas is for raid controllers (e.g. 3108); the mpr is for the plain HBA (3008) as you have one; so nothing to change on that side.

I'm running those LSI SAS 3008 HBAs almost everywhere (currently 8 hosts, 2 of them with 2 of those HBAs, so 10 in total) and they perform as expected. E.g. I'm running a bunch of 1.92 and 3.84TB SAS SSDs on them and they easily saturate the available bandwidth (either per drive or for the PCIe slot) even for large (multi-TB) data transfers.
I'm not using 14.0-R anywhere on production systems (currently only on my laptop), so those are all 13.2-R; If you are still certain this is an OS issue, maybe you could do some tests with 13.2-R to verify there is no regression (don't think so), but my bet would be on that specific drive variant...


edit:
Also does not appear during non-(pure)sequential write - Mount a filesystem on the disk and write to the filesystem, and the drive performs great. It's only sequential writes to the bare disk object that are incorrectly slow - the camdd program works as a workaround for this.
Which might also point to the wrong (too large) blocksize used with dd that causes the disks/firmware to stall. Try bs=512 or bs=4K and report the results.
 
Yes, the drives are all shucked. I don't remember reading anything about SMR on them, and I haven't noticed SMR's performance impacts. The 5400 RPM doesn't bother me in the slightest - these are bulk storage drives.

That said, I haven't done a proper random-write test on one of these, either. I'll arrange to do that in the near future.

Try bs=1MB or even smaller to verify if the rates change or are constant.
I've tried a bunch of blocksizes, I think 4k, 256k, 512k, 1m, 2m, and 8m. It never gets any faster than the 70-80MB/s being discussed here.

Given that the 'EZAZ' drives were rated at 180MB/s by WD, those 200MB/s figures might indicate there is some caching going on that inflates those numbers.
Keep in mind that simple dd tests are writing at the very outer edge of the disk, where it's fastest. It's not unreasonable to think that 180MB/s spec was intended as more of an 'average write performance'.
 
I'm not familiar with dtrace. Can you provide some example commands that might get the information you're looking for?
That's not that simple. dtrace is rather a language on it's own. There are test points in the source code, and dtrace can attach to them and measure time passing between them. So one could probably figure where the time is actually spent, and compare this to other scenarios. This can be done on a regular, live, even productive system - but, it requires some basic understanding on how the concerning source code is organized, and a little C coding experiece.

Example commands are here: https://wiki.freebsd.org/DTrace/One-Liners
And, no, I'm not looking for specific output, as I'm not really in the mood to now go and start reading the mps driver code. ;)
But, if this happens reproducibly with any disk, it might justify a PR - and that might attract people who were/are actively working in the driver code.

OTOH, as this all figures together, it might even be a problem that one could live with. You may want to find out how things behave under ZFS (which can be quite different than dd), and specifically test resilver.
 
5400rpm (or 5400-ish; there were some variants with slightly higher peak but variable rpm available)
[/QUOTE]
Compared to 7200 rpm, this affects rotational latency, therefore affects random IOs and non-queued IOs, if they lose a revolution between IOs. Should not affect purely sequential IOs that are queue overlapped enough to start the second sequential IO immediately when the first one finishes. And the sequential bandwidth is mostly limited by the head bit frequency, not the platter speed.

2. there were SMR-variants of those drives; IIRC the 256MB cache versions were SMR, 128MB PMR
SMR should not affect very large groups of sequential writes. In particular, if one does purely sequential writes that cover a whole zone (zone = aligned 256 MB part of the disk) or better multiple zones, SMR has no effect at all. Where SMR hurts is random updating writes. (And it obviously helps, a lot, with disk capacity).
3. IIRC back when those were new there were discussions about different firmware variants, mainly for power saving to make those MyBooks more tolerant to the low power budget from many USB-connectors (especially on laptops)
If the disk itself were incapable of sustaining large runs of sequential IOs (due to power throttling), how could it show good performance on Linux or on a SATA connector? That is, unless FreeBSD with the LSI controller does something to the drive which slows it down.
Given that the 'EZAZ' drives were rated at 180MB/s by WD, those 200MB/s figures might indicate there is some caching going on that inflates those numbers.
As starslab already said, the sequential speed varies from outer to inner diameter; the 180 MB/s rating might very well be an average number. And for large enough tests (a few GB, meaning a test that runs longer than 10 seconds), caching becomes a small effect.
Also using bs=8MB won't give any advantages on spinning rust with blocksizes of 512b or 4k max - maybe those large blocks completely overwhelm the drives firmware (and the puny SATA queue) and cause extensive throttling on the transfer. Try bs=1MB or even smaller to verify if the rates change or are constant.
That is incorrect: For IOs that come from user space, a bs=many MB is nearly always more efficient than many small (sequential) IOs at a bs=512 or 4K. Why? Because there is significant overhead in performing an IO, for example the user/system transition (going into the kernel), setting up the memory maps, or (god forbid) skipping DMA because the block is too small. Think of it this way: At 200 MB/s, each individual 4K IO would have to be executed in 25us (microseconds), and doing anything that fast repeatable is hard. Even worse: Due to scheduling snafu, it's always possible that a long sequence of sequential small IOs gets interrupted, and that is very expensive: An occasional blip that causes a whole revolution (100 ms) to be lost in a sequence of IOs each of which can only take 25 us is catastrophic for performance.

But ultimately, you're partially right. The difference between a sequential and overlapped (queue depth >= 2) of 1 MB IOs and single 8 MB IOs is very small, single digit percent (and yes, my day job involved measuring stuff like this). On a well tuned system, larger IOs continue to be more efficient, until the point is reached where IO parallelism overhelms the memory subsystem. For example, if you have 400 disk drives attached to a single machine, each disk drive is doing 8 MB IOs, and each disk drive has 10 IOs per disk queued up, then you have 32 GB of RAM pinned down for IO, and you better have at least 64 gig of RAM in the box, or else something will blow up. And if you have a machine with enough hardware for this (I used to run on 1/4 TB machines with a half dozen PCI slots), you probably also have multiple CPU sockets, and you need to get very careful to make sure the path from PCI slot to RAM to CPU doesn't cross any CPU bridges multiple times.

Internally, there is another complexity, which is memory management of the LSI card driver and its DMA. When I used the previous-generation LSI HBAs on large Linux machines (about 8 years ago), they performed all IO in 32 KB chunks, mapping each chunk in memory separately. That is true even if the user-space application and the kernel drivers (which we had instrumented and tuned) offered the LSI card a physically contiguous memory range of several MB. For large IOs, this did not create a problem at all, UNLESS the LSI card gets confused (typically during error handling if an IO fails), and starts accessing the 32 KB chunks out of order. For small IOs, this sets the natural boundary at which IOs are too small to be efficient, and we found that 32 KB was really the smallest practical IO size. But note that most of my experience with this tuning was on Linux (and half of it was big-endian Linux on PowerPC), so FreeBSD's LSI drivers might work differently.
 
  • Thanks
Reactions: sko
id I get these right?
Problem appears only with this model disk, not with other brand disk on same SAS controller

For sure this issue affects multiple vendors and models:

In addition to the WD drives previously discussed,

This disk gets 194.78MB/sec read via camdd, 193.35MB/sec write via camdd
193.53MB/sec read via dd, 73.67MB/sec write via dd
Code:
pass0: <ST1000DM003-9YN162 CC4D> ATA8-ACS SATA 3.x device
pass0: 600.000MB/s transfers, Command Queueing Enabled

protocol              ATA8-ACS SATA 3.x
device model          ST1000DM003-9YN162
firmware revision     CC4D
serial number         S1D0XDL9
WWN                   5000c5004aa3a10f
additional product id
cylinders             16383
heads                 16
sectors/track         63
sector size           logical 512, physical 4096, offset 0
LBA supported         268435455 sectors
LBA48 supported       1953525168 sectors
PIO supported         PIO4
DMA supported         WDMA2 UDMA6
media RPM             7200
Zoned-Device Commands no

Feature                      Support  Enabled   Value           Vendor
read ahead                     yes    yes
write cache                    yes    no
flush cache                    yes    yes
Native Command Queuing (NCQ)   yes        32 tags
NCQ Priority Information       no
NCQ Non-Data Command           no
NCQ Streaming                  no
Receive & Send FPDMA Queued    no
NCQ Autosense                  no
SMART                          yes    yes
security                       yes    no
power management               yes    yes
microcode download             yes    yes
advanced power management      yes    yes    128/0x80
automatic acoustic management  no    no
media status notification      no    no
power-up in Standby            no    no
write-read-verify              yes    no    0/0x0
unload                         no    no
general purpose logging        yes    yes
free-fall                      no    no
sense data reporting           no    no
extended power conditions      no    no
device statistics notification no    no
Data Set Management (DSM/TRIM) no
Trusted Computing              no
encrypts all user data         no
Sanitize                       no
Host Protected Area (HPA)      yes      no      1953525168/1953525168
HPA - Security                 yes      no
Accessible Max Address Config  no


And this disk gets 167.13MB/sec reads via camdd, 163.33MB/sec writes via camdd
166.90MB/sec reads via dd, 61.95MB/sec writes via dd
Code:
pass0: <ST4000DM000-1F2168 CC54> ACS-2 ATA SATA 3.x device
pass0: 600.000MB/s transfers, Command Queueing Enabled

protocol              ACS-2 ATA SATA 3.x
device model          ST4000DM000-1F2168
firmware revision     CC54
serial number         Z3031TNN
WWN                   5000c5007a391441
additional product id
cylinders             16383
heads                 16
sectors/track         63
sector size           logical 512, physical 4096, offset 0
LBA supported         268435455 sectors
LBA48 supported       7814037168 sectors
PIO supported         PIO4
DMA supported         WDMA2 UDMA6
media RPM             5900
Zoned-Device Commands no

Feature                      Support  Enabled   Value           Vendor
read ahead                     yes    yes
write cache                    yes    no
flush cache                    yes    yes
Native Command Queuing (NCQ)   yes        32 tags
NCQ Priority Information       no
NCQ Non-Data Command           no
NCQ Streaming                  no
Receive & Send FPDMA Queued    no
NCQ Autosense                  no
SMART                          yes    yes
security                       yes    no
power management               yes    yes
microcode download             yes    yes
advanced power management      yes    yes    128/0x80
automatic acoustic management  no    no
media status notification      no    no
power-up in Standby            yes    no
write-read-verify              yes    no    0/0x0
unload                         no    no
general purpose logging        yes    yes
free-fall                      no    no
sense data reporting           no    no
extended power conditions      no    no
device statistics notification no    no
Data Set Management (DSM/TRIM) no
Trusted Computing              no
encrypts all user data         no
Sanitize                       no
Host Protected Area (HPA)      yes      no      7814037168/7814037168
HPA - Security                 yes      no
Accessible Max Address Config  no
 
During maintenance I did a short check on my machine. But I don't have a SAS3008, this is a SAS2008. And it is REL. 13.2. And since my disks are encrypted, it has to run thru geli.

Code:
mps0: <Avago Technologies (LSI) SAS2008> port 0xb000-0xb0ff mem 0xf9a40000-0xf9a4ffff,0xf9a00000-0xf9a3ffff irq 42 at device 0.0 numa-domain 0 on pci8
mps0: Firmware: 20.00.07.00, Driver: 21.02.00.00-fbsd
mps0: IOCCapabilities: 1285c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay,HostDisc>

ST3000DM008-2DM166 - that is the old CMR version of the desktop disk:
Code:
root@:/ # dd if=/dev/da10.elip19 of=/dev/null bs=1m count=1000
1000+0 records in
1000+0 records out
1048576000 bytes transferred in 5.393762 secs (194405317 bytes/sec)
root@:/ # dd of=/dev/da10.elip19 if=/dev/zero bs=1m count=1000
1000+0 records in
1000+0 records out
1048576000 bytes transferred in 5.598524 secs (187295067 bytes/sec)

HUS726040AL - a 2017 Ultrastar:
Code:
root@:/ # dd if=/dev/da10.elip19 of=/dev/null bs=1m count=1000
1000+0 records in
1000+0 records out
1048576000 bytes transferred in 5.204278 secs (201483489 bytes/sec)
root@:/ # dd of=/dev/da10.elip19 if=/dev/zero bs=1m count=1000
1000+0 records in
1000+0 records out
1048576000 bytes transferred in 5.184640 secs (202246646 bytes/sec)

So, no such strangeness here.
 
Back
Top