UFS Unexpectedly slow perf of rotating drives

zeRusski · Mar 7, 2021

Hi Forum. I'm trying to figure out if my expectations are somehow out of tune or I have something misconfigured in my FreeBSD 12.2 release box or misconfigured (alignment or whatever) drives.

Dell PowerEdge r720 Server with PERC H710 mini controller that I re-flashed to IT mode which essentially makes it an LSI 2208 chip. According to specs it has 8 internal ports each capable of 6Gbps. If memory serves R720 is PCIe 3, so we shouldn't bottleneck there. And I run two SFF-8087 cables to each port on the backplane, so we cover all ports there (not that it matters for a single disk perf).

Now, 8 drives are all enterprise grade spinners. Stickers on em actually claim 12Gbps. In fact, when I plug them in dmesg shows me 600MB/s:

Code:

da0 at mps0 bus 0 scbus0 target 4 lun 0
da0: <SEAGATE ST8000NM0075 E003> Fixed Direct Access SPC-4 SCSI device
da0: Serial Number ZA10P89D0000R622H5ZQ
da0: 600.000MB/s transfers
da0: Command Queueing enabled
da0: 7630885MB (15628053168 512 byte sectors, DIF type 2)

However, I've not seen anything even approaching 200MB/s out of them. Inplace rsync consistently tops at around 180MB/s.

Here's diskinfo which confirms I can't go faster than about 200MB/s. But why?

Code:

sudo diskinfo -tv da0
da0
        512             # sectorsize
        8001563222016   # mediasize in bytes (7.3T)
        15628053168     # mediasize in sectors
        4096            # stripesize
        0               # stripeoffset
        972801          # Cylinders according to firmware.
        255             # Heads according to firmware.
        63              # Sectors according to firmware.
        SEAGATE ST8000NM0075    # Disk descr.
        ZA10P89D0000R622H5ZQ    # Disk ident.
        No              # TRIM/UNMAP support
        7200            # Rotation rate in RPM
        Not_Zoned       # Zone Mode

Seek times:
        Full stroke:      250 iter in   4.839723 sec =   19.359 msec
        Half stroke:      250 iter in   3.385431 sec =   13.542 msec
        Quarter stroke:   500 iter in   4.759106 sec =    9.518 msec
        Short forward:    400 iter in   2.502540 sec =    6.256 msec
        Short backward:   400 iter in   1.919993 sec =    4.800 msec
        Seq outer:       2048 iter in   0.132948 sec =    0.065 msec
        Seq inner:       2048 iter in   0.202820 sec =    0.099 msec

Transfer rates:
        outside:       102400 kbytes in   0.470930 sec =   217442 kbytes/sec
        middle:        102400 kbytes in   0.565274 sec =   181151 kbytes/sec
        inside:        102400 kbytes in   0.946842 sec =   108149 kbytes/sec

I'm kinda at a loss here. Why so slow? Where should I even begin to look?

jb_fvwm2 · Mar 7, 2021

So the drives are in a Raid setup? If so, a small overhead from the Raid.
[ without an elaboration of 'slow' ]
Do you notice a slowdown vs a single spinning disk or Zfs raid in some way?

Alexey V. Gubin · Mar 7, 2021

200 MB/sec sustained is pretty normal for a single spinning hard drive. 12 Gbps is the bus speed, not the speed with which it reads from platters.

zeRusski · Mar 7, 2021

jb_fvwm2 said:
So the drives are in a Raid setup? If so, a small overhead from the Raid.
[ without an elaboration of 'slow' ]
Do you notice a slowdown vs a single spinning disk or Zfs raid in some way?

no RAID. jbod style as god intended. Every disk is separate. FWIW RAID0 is slower, ZFS is much slower

I've never even mentioned RAID. What would give that idea?

zeRusski · Mar 7, 2021

well, I guess my expectations have been miscallibrated .. spec for the disk suggests best sustainable throughput of 249MB/s. Suddenly 180MB/s no longer looks entirely unreasonable, sigh. My inner meter has been completely off when it comes to hardware - topic I've only recently had to dive in. Everything trips me up: network perf, bandwidth and speed, memory consumption, CPU bottlenecks, spinning drives, SSDs etc etc etc. I have no intuition that I can rely on at all and what I have has been proven wrong time and again

ralphbsz · Mar 7, 2021

There is a series of bottlenecks. First, the interface itself, which is about 600 MByte/sec. That's the speed from the computer to the RAM buffers on the disk. But the RAM buffers can't get data this fast through the head and to the platter. That bottleneck depends on the position on the platter; it's higher on the outside edge of the disk (because the physical bits are further apart, so they can be written faster). That's the theoretical speed of 249 MByte/s from the disk spec. But that speed only applies for the duration of a single track, if the head doesn't have to move. The moment you start shifting the head back and forth, the speed drops. Every head movement will cost you a "settling time" while the head centers itself on the new track, which used to be 1/2 ms, and at these speeds, you lose nearly a MB during that time. And if your accesses are not purely sequential, then you need to move the head a significant distance, which takes anywhere from 1 to 10 or 15 ms, and you have to wait for the correct data to rotate around. That's why large block random accesses (which is probably what you're measuring with your rsync) drop significantly. With small files, the speed of a disk can drop to 1 MByte/s.

richardtoohey2 · Mar 7, 2021

zeRusski said:
what I have has been proven wrong time and again

I'm going on the same journey of discovery. SATA2, SATA3, SAS, M.2, SSD, NVMe etc. And things like Samsung's "TurboWrite" (just another cache). And then UFS, ZFS, hardware RAID (and the different levels and different caches etc.)

I always thought 6 Gb/s was super-fast but now know it's "only" about 600 MB/s which isn't that much when you've got tens of gigs of data.

And then the CPUs have boost modes, multiple cores, etc. so that can be skewing what you think you are measuring.

Lots to learn!

Mjölnir · Mar 7, 2021

Ha! I recently had to re-learn that a byte can have more than 8 bits... Totally forgot about padding bits.
Vladilen, did you insert the I/O scheduler gsched(8)? It will not speed up sustained sequential access (single I/O client), but it gives a significant boost for concurrent access patterns, espc. random access, but also concurrent mixed access patterns. Of course, if your workload is one client/disk, it does not help.

VladiBG · Mar 7, 2021

Seagate Enterprise Capacity 8TB HDD Testing: HD Tune, SiSoft Sandra, Anvil Storage Utility, ATTO 2.47

Mjölnir · Mar 7, 2021

Your rotating disk is actually pretty fast compared to my SSD, seq. outer/inner are of the same magnitude! If you need speed (foremost: random access), go 15k RPM disks -> SSD -> NVMe devices
You wrote: RAID0 is slower. This means either it was misconfigured, or your workload is so very special, that the overhead of striping is non-neglectable.
Besides caching, which usually does not help with constant sustained writes, the only way to speed up rotating devices is striping, to fill the time to reposition the spindle with useful work. Since you don't want that, you'll have to live with the disk device's physical limits.

zeRusski · Mar 8, 2021

Mjölnir said:
Ha! I recently had to re-learn that a byte can have more than 8 bits... Totally forgot about padding bits.
Vladilen, did you insert the I/O scheduler gsched(8)? It will not speed up sustained sequential access (single I/O client), but it gives a significant boost for concurrent access patterns, espc. random access, but also concurrent mixed access patterns. Of course, if your workload is one client/disk, it does not help.

I've not even heard of such things as this scheduler but now that I think about it I shouldn't be surprised. I've not tried it. IIUC it does seem to arbiter multiple clients talking to the same disk, which isn't quite the case of my load unless you count part of the algo that can be sped up with multiple threads. In latter case I guess you might have "multiple clients" for the same drive ... kind of. It maybe worth testing it. Thanks for mentioning

zeRusski · Mar 8, 2021

Mjölnir said:
Your rotating disk is actually pretty fast compared to my SSD, seq. outer/inner are of the same magnitude! If you need speed (foremost: random access), go 15k RPM disks -> SSD -> NVMe devices
You wrote: RAID0 is slower. This means either it was misconfigured, or your workload is so very special, that the overhead of striping is non-neglectable.
Besides caching, which usually does not help with constant sustained writes, the only way to speed up rotating devices is striping, to fill the time to reposition the spindle with useful work. Since you don't want that, you'll have to live with the disk device's physical limits.

I shouldn't have brought that up! I have so many "experiments" going I sometimes mix things up. I've only tried putting SSDs in RAID0 and ZFS "kinda RAID0" and first was slower, second slower still. That's what I described in other threads. And even then e.g. RAID0 with `gstripe` only striped two SSDs cause I only had two pairs of the same size. I'm almost sure that if I put my 8x spinners in RAID0 I'd see a boost. I've not done it simply because I actually do need those for long term storage, while SSDs are kind of a scratch pad.

We could actually run a cool experiment on a 24 drive SuperMicro I have here by striping all of them

But that beast is so crazy loud I can't imagine having it on for any sustainable time. In fact I'll be returning it soon. Its proven not to be up to task anyway.

SirDice · Mar 8, 2021

zeRusski said:
If memory serves R720 is PCIe 3, so we shouldn't bottleneck there.

More importantly, it's in a PCIe x8 slot. Verify if that slot isn't set to x4 or even lower, you can often change this in the BIOS of the machine itself.

Mjölnir · Mar 8, 2021

zeRusski said:
I've not even heard of such things as this scheduler but now that I think about it I shouldn't be surprised. I've not tried it. IIUC it does seem to arbiter multiple clients talking to the same disk, which isn't quite the case of my load unless you count part of the algo that can be sped up with multiple threads. In latter case I guess you might have "multiple clients" for the same drive ... kind of. It maybe worth testing it. Thanks for mentioning

IIRC I've mentioned that in an e-mail? You can find the service script to insert the gsched(8) in the thread "Userland Programming..."->"Useful scripts", last page IIRC. I should really dive into ports(7) maintainer stuff & make that my 1st port...

zeRusski · Mar 8, 2021

SirDice said:
More importantly, it's in a PCIe x8 slot. Verify if that slot isn't set to x4 or even lower, you can often change this in the BIOS of the machine itself.

good point. This is however very Dell much proprietary (IIUC) placement of the card. Its not even a card you insert into a PCIe slot, its their own fancy concoction really based on some LSI chip that you flat (hat style?) attach to a special place on the motherboard

Saves you a PCIe slot I suppose.

UFS Unexpectedly slow perf of rotating drives

Administrator