Other I/O queue and parallel operations

How can I find out, from the operating system, the maximum parallel queue a device can handle?
I mean, I want to discover how many operations I can enueue on storage in parallel, and so far I do this using iostat: I start the program and execute a banchmarking (e.g., bonnie++) and try to catch the max value in the qlen field:

Code:
% iostat -d -x -t da -w 2
...
device            kw/s  ms/r  ms/w  ms/o  ms/t qlen  %b
ada0           70988.4     0    10     0    10    9  80

In the above the queue length is handling 9 requests.
This is a poor approach according to me because (i) I could miss the real value and (ii) I'm not sure the qlen field is the device queue and not the kernel queue.
camcontrol should be the right tool to use, I suspect, but I think it starts with "default" sensible values:

Code:
% sudo camcontrol tags da0 -v
...
(pass4:mpt0:0:0:0): mintags       2
(pass4:mpt0:0:0:0): maxtags       255


Suggestions?
 
To figure this out, you need to think through the whole storage software stack.

At the bottom is a disk device (spinning rust, SSD, or virtual disk created by an array controller below). Get the manufacturers manual for the disk, it will have the information. I think for SCSI disks, there is a mode page that shows the current max queue depth, but I don't remember which one. For SCSI disks as of ~10 years ago, 16 to 64 IOs per port is typical. For SCSI-connected disk arrays (RAID arrays), the number can be significantly bigger. For SATA disks, I don't know how to set or query the queue depth, and I think they typically handle 8 to 32. This it the layer you can query with camcontrol (or equivalently, on Linux with sg_utils).

Next layer up is the HBA. They typically don't store IO queues of their own. But they need to be able to manage the pending IOs of the disks connected to them. A typical limit for the LSI Logic (Broadcom, Avago) HBAs was 600 IOs a few years ago. To find those limits, you need to get documentation from the HBA vendor, and that is typically hard for an individual (large customers can reach directly into the vendor's engineering departments).

A lot of that limit comes from interaction between the HBA and the kernel's memory management: while an IO is in process on the device, the kernel and the HBA have to cooperate to pin the IO buffers, so they are reachable through PCI DMA from the HBA (and therefore the disk). With lots of outstanding IOs (say 600 IOs of 1MiB each), that means that hundreds of thousands of VM pages have to be pinned, and the kernel needs data structures for that.

If the kernel can't send an IO to the device (either because of a limitation of the device or the HBA, or because the kernel has configurable limits on how many IOs it wants to send), it holds it in a separate queue in the kernel. Again, the size of that queue is configurable. In Linux, that's done through the /sys file system, and I used to know exactly what the various settings mean. In FreeBSD, there are some sysctl settings for that. Note that this is a queue of IOs that have not yet been sent to the device; in addition, the kernel obviously keeps track of all IOs that are already pending on the device.

So far we have discussed the SCSI/SATA midlayer in the kernel, and all the hardware (disks and HBAs) below it. The next question is: what is using that IO subsystem? In most cases, people use the storage layer through a file system. File systems typically issue IOs either synchronously for user-space application requests (typically reads, or sync writes or metadata writes), and asynchronously for read-ahead and write-behind. Again, file systems have tuning parameters for how many asynchronous IOs they're willing to issue. For that, you need to read the UFS and ZFS sources (or better, read Kirk McKusick's book).

Finally, there are applications that bypass built-in file systems and go directly to the disk layer. Those applications are typically either databases, custom file systems, or performance testers. Tuning the queue depth for those can be a little harder, since the typical queue management mechanisms are designed around the regular path through the file system.
 
For exactly this reason I wrote a little software, which reads files from a device in parallel (typically one thread for core) and calculates the hashes.

It is interesting to see how the operating system is delivering data in parallel from the media, to see when the CPU capacity is saturated with respect to the disk subsystem.
Using very light or hardware accelerated hashes is essentially a test of actual parallelism capability. Of course it makes sense on NVMe and ramdisk systems, less on SSDs, not at all on magnetic disks.
 
At the bottom is a disk device (spinning rust, SSD, or virtual disk created by an array controller below). Get the manufacturers manual for the disk, it will have the information. I think for SCSI disks, there is a mode page that shows the current max queue depth, but I don't remember which one. For SCSI disks as of ~10 years ago, 16 to 64 IOs per port is typical. For SCSI-connected disk arrays (RAID arrays), the number can be significantly bigger. For SATA disks, I don't know how to set or query the queue depth, and I think they typically handle 8 to 32. This it the layer you can query with camcontrol (or equivalently, on Linux with sg_utils).

This is an excellent explanation, but I was really wondering if there was a way to query the storage layer (and hence the kernel) to get the queue size from the device itself. Assuming you don't have the manual of the device, or you are not provided with enough documentation, how can you tune the queue size (if there is any need to queue) without being nasty to the device? That's why I was thinking about iostat.
 
For exactly this reason I wrote a little software, which reads files from a device in parallel (typically one thread for core) and calculates the hashes.

Why not use an already established file system/disk benchmarking tool? I don't understand what is the value added by your own program.
 
As far as I know there is not something like that (real world benchmarking).
Of course maybe 1000 others do it much better.
This is not a benchmarking tool, but a "snapshot" archiver where I put the things that I need as storage manager.
One of them is choosing hardware and OS based on the expected workload
Is it fastest to read and elaborare one file at time (small or big?), one thead, or is better to read "in parallel" (small and big files), N threads?

How much better?
And how much (if any) with hdd, ssd, nvme, ramdisk?
On Bsd? On Linux? On Windows? On NASs?
And how many cores saturate bandwith and latency?
And hyperthreads?
CPU load?
ARC efficiency, and ARC compression?

And HT with HW accelerated software?
Does it matter, or not, and how?
 
One of them is choosing hardware and OS based on the expected workload

Correct, but not sure to understand your reply here: even when you have choosen your stack to suport you with your workload, you need to tune it to the best. And in my opinion, you need to becnhmark it again to see if real results are as expected or not.
 
This is an excellent explanation, but I was really wondering if there was a way to query the storage layer (and hence the kernel) to get the queue size from the device itself. Assuming you don't have the manual of the device, or you are not provided with enough documentation, how can you tune the queue size (if there is any need to queue) without being nasty to the device? That's why I was thinking about iostat.

I spent 10 minutes this morning looking for the SCSI mode page that shows the queue depth of SCSI devices. Strangely, I couldn't find it (both using the SCSI standard, and Seagate's published SCSI documentation). The problem is that these documents are many hundreds of pages long, and looking for the term "queue" or "queueing" finds hundreds of matches. I'll do some more searching when I have time.

For the Linux kernel parameters, they can be found in /sys/block/.../queue/nr_requests (which then has to be tempered by various .../max_... parameters). I think there are a few other places, but right now I don't have a Linux system with physical (non-emulated) SCSI disks on it.
 
Back
Top