L(q) for the zroot is around 10-20. Is that a lot?
Depends. As SirDice said: for a system that is trying to optimize the throughput of each disk, and the disks are spinning disks (not SSDs) that is not a lot. The underlying reason is this: disks are one of the few things in the universe that "work better" (get more efficient) the more overloaded they are. That's because if you give the IOs to the disk one at a time, the disk has to move the head to the correct track for each IO, and wait for a partial rotation for each of them. This is by its nature slow, since these are mechanical processes: A seek takes on average 10ms, and for normal 7200 RPM disks, waiting on average half a rotation takes another 4ms. Compared to that, the actual read/write time is small, typically a fraction of a ms to 10ms (for the largest IOs that occur in practice, about 2MB).
Now in contrast, if you give the disk 20 IOs at once, and tell it "do them in any order, just be efficient about it", the disk will make a map of all IOs that need to be done (on the platter), and decide on finding an optimal path that minimizes the seek distance and waiting for rotations. This typically makes the total throughput of the disk better by roughly a factor of 2 or 3. So yes, overloading a disk (giving it its of IOs to do, meaning large L(q) = queue length) makes it MUCH faster, in the sense that each individual IO takes half the time.
There is a price to pay: These 10 or 20 IOs were probably submitted by 10 or 20 different programs / processes / threads, which may or may not be interconnected. If you are a human user or program/process/thread that is waiting for any one IO to finish, you don't know where in the queue that IO is, and you may have to wait on average for 5 to 10 other IOs to get done.
The real question here is this: What are these 10-20 IOs in the queue? If a few are coming from latency-critical interactive workload, while most are from throughput-sensitive batch processing workload, then you can see that the interactive user will be unhappy if impatient.
I'd say average about 50-60ms.
That is awfully long. A normal spinning disk should on average take 10ms per IO (reads perhaps a little faster), maybe twice as much for very large IOs (MB or so). SSDs should take about 1ms per IO. I wonder where that 50-60ms per IO comes from.
I get that it's optimized for throughput. What's the way to optimize it for latency?
That is REALLY hard to do in general. On FreeBSD, the only thing to adjust is the nice parameter. Linux has way more knobs to tune, but unless you put weeks of effort into it (BTDT), the results will be bad or mixed. And highly tuning your IO subsystem for one workload tends to screw over other workloads.
The other thing which makes this hard is that you need to be super clear on your goals: Are you optimizing to minimize the latency for ONE workload, and ignoring all other workloads? Or maximizing the throughput for ONE workload (independent of latency, which only makes sense if that workload is multi-threaded or parallel, but most important ones are these days), while ignoring all other workloads? Or do you need some combination of "good for one, decent for the others"? Or are you doing an optimization of $/<some metric>, to spent the least amount of money to meet your SLOs?
For amateur operations, the best way to perform this optimization is to only run one workload at a time, and banish "unimportant" or "only throughput relevant" workloads to off times (like the middle of the night).