kern.sched.quantum: Creepy, sadistic scheduler

Occasionally I noticed that the system would not quickly process the tasks i need done, but instead prefer other, long-running tasks. I figured it must be related to the scheduler, and decided it hates me.

A closer look shows the behaviour as follows (single CPU):

Lets run an I/O-active task, e.g, postgres VACUUM that would continuousely read from big files (while doing compute as well [1]):
[FONT=Courier New]
Code:
pool        alloc   free   read  write   read  write
cache           -      -      -      -      -      -
  ada1s4    7.08G  10.9G  1.58K      0  12.9M      0
[/FONT]

Now start an endless loop:
[FONT=Courier New]# while true; do :; done [/FONT]

And the effect is:
[FONT=Courier New]
Code:
pool        alloc   free   read  write   read  write
cache           -      -      -      -      -      -                     
  ada1s4    7.08G  10.9G      9      0  76.8K      0
[/FONT]

The VACUUM gets almost stuck! This figures with WCPU in "top":

[FONT=Courier New]
Code:
  PID USERNAME   PRI NICE   SIZE    RES STATE    TIME    WCPU COMMAND    
85583 root        99    0  7044K  1944K RUN      1:06  92.21% bash       
53005 pgsql       52    0   620M 91856K RUN      5:47   0.50% postgres
[/FONT]

Hacking on kern.sched.quantum makes it quite a bit better:
[FONT=Courier New]
Code:
# sysctl kern.sched.quantum=1     
kern.sched.quantum: 94488 -> 7874

pool        alloc   free   read  write   read  write                     
cache           -      -      -      -      -      -                     
  ada1s4    7.08G  10.9G    395      0  3.12M      0  
                      
  PID USERNAME   PRI NICE   SIZE    RES STATE    TIME    WCPU COMMAND    
85583 root        94    0  7044K  1944K RUN      4:13  70.80% bash       
53005 pgsql       52    0   276M 91856K RUN      5:52  11.83% postgres
[/FONT]


Now, as usual, the "root-cause" questions arise: What exactly does this "quantum"? Is this solution a workaround, i.e. actually something else is wrong, and has it tradeoff in other situations? Or otherwise, why is such a default value chosen, which appears to be ill-deceived?

The docs for the quantum parameter are a bit unsatisfying - they say its the max num of ticks a process gets - and what happens when they're exhausted? If by default the endless loop is actually allowed to continue running for 94k ticks (or 94ms, more likely) uninterrupted, then that explains the perceived behaviour - buts thats certainly not what a scheduler should do when other procs are ready to run.

11.1-RELEASE-p7, kern.hz=200. Switching tickless mode on or off does not influence the matter. Starting the endless loop with "nice" does not influence the matter.


[1]
A pure-I/O job without compute load, like "dd", does not show this behaviour. Also, when other tasks are running, the unjust behaviour is not so stongly pronounced.
 
Your question is almost as informative as an answer !

I am not experienced enough to give you an answer to this by heart.
But, I may suggest you take a look at "The Design and Implementation
of the FreeBSD Operating System". There I see there is a description of how the
scheduler works. [I just skimmed over it, but i have an old realease of the book (5.2),
so whatever I say might be outdated].

Bye
 
Hi Nicola!

Your question is almost as informative as an answer !

Thanks. But that is only the "observation" part. There is another part: if this perceived behaviour is indeed "as designed" for the default behaviour, then this is not okay for a time-sharing system.

I am not experienced enough to give you an answer to this by heart.

Oh, we can take a closer look at the data. Lets suppose that the default kern.sched.quantum=94488 is indeed the time-slice for a compute-job in µs. Then we get 10.58 such slices per second.
And we see the read throughput of postgres being 9 reads per second, transferring 75kB/s. postgres blocksize is 8k - I'd say that math figures.

So, we know with high probability what happens: The piglet (aka endless loop) is allowed to run 95ms, then postgres is allowed to do one single I/O, and then the piglet runs again for 95ms.
But then - I would say this is inacceptable.

So, then comes the more difficult part: sit over the code and fix it.

But, I may suggest you take a look at "The Design and Implementation
of the FreeBSD Operating System". There I see there is a description of how the
scheduler works.

Yepp, I know. That book is very good. But it probably describes the old scheduler. The scheduler has been replaced somwhere in Rel.7. For the new scheduler there is a presentation from BSDCon 09 which is difficult to find and utterly useless, and there is a paper on design criteria from BSDCon 03: "ULE: A Modern Scheduler for FreeBSD".

The matter is a different one: this thing seems to work the way it does for almost ten years now, and people seem happy with it. So, at first, I would like to ask: why? Therefore, I posted this to the hackers list - but sadly, the instrumentality there decided to delete the post.

Then I found a VERY lengthy (and mind-boggling) old discussion from 2011, and somewhere deep in this heap somebody describes:
With nCPU compute-bound processes running, with SCHED_ULE, any other process that is interactive (which to me means frequently waiting for I/O) gets ABYSMAL performance -- over an order of magnitude worse than it gets with SCHED_4BSD under the same conditions. -- https://lists.freebsd.org/pipermail/freebsd-stable/2011-December/064984.html

This quite exactly describes my observation. So, I wonder what has been done about this issue from 2011.
 
This topic you raised is very interesting.

I am expecially interested on this because my intention is to move my Fintech startup to FreeBSD. And my application is based on PostgreSQL !

I can't fork on this topic right now, I am pulling out already all of my hairs on financial calcuations, a big^2 mess.

Anyway, I hope i can give an hint on some available fresh doc to study;)
There is a 2014 edition of "The Design and Implementation
of the FreeBSD Operating System". There you may find hopefully updated info !

If you discover something let us know !
 
Hi Nicola,

I'd think it is highly unlikely that this could practically influence an e-biz server (and even if, there seem to be workarounds) - so just relax and don't worry about that. It might probably influence certain scientific numbercrunching setups, and even then, it seems to depend on the CPU being weaker than the I/O, while on most contemporary implementations it's the other way round.

Nevertheless, Unix is designed in a way that it can cope with almost all combinations of hardware, even very unbalanced ones (as long as they work reliably) - so if this is not the case somewhere, it may point to a design weakness that should be discussed. And in any case, we get a chance to look a bit deeper into the underlying technology...

Greetings and wish You Success!
P.
 
I wonder if it comes from the fact the postgres is using select() to sleep [1] rather than usleep(). ULE boosts processes that opportunistically sleep(); I don’t know if that extends to select(), but they are different beasts, so I wonder if that might be part of the problem.

You don’t have vacuum_cost_delay set to something non-zero, do you?
 
I wonder if it comes from the fact the postgres is using select() to sleep [1] rather than usleep(). ULE boosts processes that opportunistically sleep(); I don’t know if that extends to select(), but they are different beasts, so I wonder if that might be part of the problem.

You don’t have vacuum_cost_delay set to something non-zero, do you?

No. But we can exclude postgres completely from the issue; i can reproduce it as well with something else that continuously reads files *and* needs some compute power, e.g. "lz4" instead of the postgres VACUUM.

BTW: I started a discussion of the matter here also: https://lists.freebsd.org/pipermail/freebsd-stable/2018-April/088678.html
 
Back
Top