What is sysctl kern.sched.preempt_thresh

Hi

I have wonder what sysctl kern.sched.preempt_thresh is doing. My guess is something with system throughput vs interactivity. In all guides I see this:

sysctl kern.sched.preempt_thresh=224
Default: 80

But it accepts other values, and the number people is writing is a bit odd 224 (my guess=7*32). Why not 256. For me higher numbers gives a even more responsive desktop system. I have tried with different numbers And it accepts numbers from 1 to ? very high.

I am worried about everyone copying the same number for ages without questioning. I have for now set it to sysctl kern.sched.preempt_thresh=384 (256*1,5). And the system is even more responsive with that

Have an nice day
 
Most of those values are a trade-off.
Like set it low when you are running a database server.
Set it high when you want to play youtube files.
Set it in the middle when you want to do both.

A think with a higher value your CPU will do more context switching. And each context switch wastes useful
CPU time.
 
Thank you

It is the same I see. But what does it do. Timeslices, values for max time per thread ??.

The most strange thing is that every guide since a loooong time suggest the value 224 without questioning. It means it is not optimal for most systems. I am using my system for sound production, so a high value is most likely good.
 
Note, you can give as root individual processes also RT-characteristics.
I would use rtprio rarely. Mostly because very often I will just move problems to another place on a well optimized system. But giving the main program a little higher value than default, and stopping unneeded services might be beneficial.
 
So it is timeslicing. How are the values calculated ?? And what are meaningful values. Probably not extreme (but it accepts 1 and values over 12000).
Unfortunately, I'm not familiar with FreeBSD code, so I can't answer 100% accurate. Command sysctl -d kern.sched.preempt_thresh gives: kern.sched.preempt_thresh: Maximal (lowest) priority for preemption. My guess is: the lower value you are comfortable, the better. If you set it too high, switching between tasks probably will be rarer, and can have an opposite effect than expected. I would prefer the value below 500. Maybe the value depends on the amount of processes either?
 
Not really time slicing; preemption. As in "process A is running, process B needs to run, scheduler does some math, some checking and says Yes, B can start running so I'll stop A and start B"
Timeslicing in general/simple terms is more of "every process gets at least X amount of time to run".
The ULE scheduler code is the definitive source for how things are calculated and exactly what different sysctls do, but scheduler code is typically non trivial to understand.

I agree with thindil and his "guess". I don't have a feel for what values are good or bad, but I think they also depend greatly on your specific use case/workload. A system running a desktop so you can browse the web is a lot different than a server system spewing the content for the YouTube videos you are watching.

Another thing to keep in mind is the hardware. Capability of the hardware runs ahead of the software (typically). Faster CPUs, more/faster RAM, sw has to catch up to fully utilize, so the defaults in sw maybe good for last gen/a few generations ago hardware but are merely ok for current gen. The value of 224 has been around for a while as a recommendation for desktop use, so it's likely from 3 or 4 generations of hardware ago and maybe bumping it is appropriate. If you look at the mailing lists there have been discussions as to "is 0 a better default value".
 
Not really time slicing; preemption. As in "process A is running, process B needs to run, scheduler does some math, some checking and says Yes, B can start running so I'll stop A and start B"
Timeslicing in general/simple terms is more of "every process gets at least X amount of time to run".
The ULE scheduler code is the definitive source for how things are calculated and exactly what different sysctls do, but scheduler code is typically non trivial to understand.

I agree with thindil and his "guess". I don't have a feel for what values are good or bad, but I think they also depend greatly on your specific use case/workload. A system running a desktop so you can browse the web is a lot different than a server system spewing the content for the YouTube videos you are watching.

Another thing to keep in mind is the hardware. Capability of the hardware runs ahead of the software (typically). Faster CPUs, more/faster RAM, sw has to catch up to fully utilize, so the defaults in sw maybe good for last gen/a few generations ago hardware but are merely ok for current gen. The value of 224 has been around for a while as a recommendation for desktop use, so it's likely from 3 or 4 generations of hardware ago and maybe bumping it is appropriate. If you look at the mailing lists there have been discussions as to "is 0 a better default value".
Would 0 not mean the scheduler is almost out of the way. Does not sound healthy. But for sure responsive. I tried it for a few minutes.
 
Unfortunately, I'm not familiar with FreeBSD code, so I can't answer 100% accurate. Command sysctl -d kern.sched.preempt_thresh gives: kern.sched.preempt_thresh: Maximal (lowest) priority for preemption. My guess is: the lower value you are comfortable, the better. If you set it too high, switching between tasks probably will be rarer, and can have an opposite effect than expected. I would prefer the value below 500. Maybe the value depends on the amount of processes either?
That is why I want someone to explain how it is calculated. Because otherwise it is just guessing, and of course to test the result with different values. The tryout with 0 was interesting.
 
It somehow reminds me of swappiness in linux. For desktop use they recommend a very low swappiness (10 or less). But the system is just as responsive with more healthy values like 25, 40 and even 80. The worst for desktop is the default 60. Right now I try with kern.sched.preempt_thresh=40 and the system seems to behave well.
 
A link to one of the emails about a value of 0. Gets a bit technical, but the folks involved in the email are pretty knowledgable about the kernel.

To me this is a key paragraph on the value and what it does.
>[I] Is PRI_MIN_KERN=80 really a good default value for the preemption threshold?[/I] Yeah, a good question... I am not really sure about this. In my opinion it would be better to set preempt_thresh to at least PRI_MAX_KERN, so that all threads running in kernel are allowed to preempt userland threads. But that would also allow kernel threads (with priorities between PRI_MIN_KERN and PRI_MAX_KERN) to preempt other kernel threads as well, not sure if that's always okay. The same argument applies to higher values for preempt_thresh as well.

To my knowledge this type of discussion has been around forever, on all systems I've ever worked with.
Too much depends on a specific use case (hardware, how many processes, what the user expects) to say "This is the best number".
 
A link to one of the emails about a value of 0. Gets a bit technical, but the folks involved in the email are pretty knowledgable about the kernel.

To me this is a key paragraph on the value and what it does.
>[I] Is PRI_MIN_KERN=80 really a good default value for the preemption threshold?[/I] Yeah, a good question... I am not really sure about this. In my opinion it would be better to set preempt_thresh to at least PRI_MAX_KERN, so that all threads running in kernel are allowed to preempt userland threads. But that would also allow kernel threads (with priorities between PRI_MIN_KERN and PRI_MAX_KERN) to preempt other kernel threads as well, not sure if that's always okay. The same argument applies to higher values for preempt_thresh as well.

To my knowledge this type of discussion has been around forever, on all systems I've ever worked with.
Too much depends on a specific use case (hardware, how many processes, what the user expects) to say "This is the best number".
I agree fully with the last part. Different systems needs different configurations. That is why i started this thread, questioning the 224. And maybe the extremes (0 or 255) are maybe OK in some cases (they seems to make a fast desktop), but I would like to use less extreme values until I fully understand what is going on. And many servers are probably well with 80.
 
  • Like
Reactions: mer
If you are using a USB mouse, take a look at your Xorg.0.log (over in /var/log). Check it for messages like:
[3079719.510] (EE) event4 - SIGMACHIP Usb Mouse, class 0/0, rev 1.10/1.10, addr 32: client bug: event processing lagging behind by 18ms, your system is too slow

and occasional mouse glitches.
A lot of people report them, not sure if it's an X server issue, USB infrastructure or sysmouse issue, not sure if it's even "real", but it would be interesting to see if frequency of the messages or the "lagging behind by..." value changes.
 
If you are using a USB mouse, take a look at your Xorg.0.log (over in /var/log). Check it for messages like:
[3079719.510] (EE) event4 - SIGMACHIP Usb Mouse, class 0/0, rev 1.10/1.10, addr 32: client bug: event processing lagging behind by 18ms, your system is too slow

and occasional mouse glitches.
A lot of people report them, not sure if it's an X server issue, USB infrastructure or sysmouse issue, not sure if it's even "real", but it would be interesting to see if frequency of the messages or the "lagging behind by..." value changes.
I have no such problems. I have looked :) My mouse is a bit secial as it is wireless with the same reciever as the keybord (a nice Logitch). All this is not because my machine is slow. But I like it to be running so I feel it is fast. And im am always trying to understand what my configuration is doing. I do not like black magic :).
 
  • Like
Reactions: mer
After a little bit more testing I think a good value for kern.sched.preempt_thresh is 151. It is also the lowest value the author suggested for interactivity. Suggest you try it out. It also makes sure that not all processes runs with high priority. So reduces general load if the previous value was 223 :)


151PRI_MAX_INTERACTTime-share threads in interactive category.
ULE puts threads with these priorities onto a real-time queue.
Distinguished value: PUSER (120, PRI_MIN_TIMESHARE)

 
I think it's going to wind up being a case of looking at the chart, thinking about it and trying different values. I'm going to set mine to 171 and leave it for a bit (week or so) and use the systems, then set it to 151 and try again.

Thanks.
 
I think it's going to wind up being a case of looking at the chart, thinking about it and trying different values. I'm going to set mine to 171 and leave it for a bit (week or so) and use the systems, then set it to 151 and try again.

Thanks.
Lets write and share how it goes. The values must be tried on different systems. Maybe they will behave differently:

My system:
I7-8700 non K, 48GB ram and gtx-1080.

On other systems a value 171 might be necessary.
 
In order to make sense of these values, you need first to understand how timesharing was originally designed. There was only one core, and a HZ value (the HZ value is the timer interrupt frequency).
When the kernel starts a process, that process gets into control and owns the cpu. There is no way to take away the cpu from the process, until either
  • there is an interrupt occuring from somewhere, or
  • the process does a system call and thereby gives control back to the kernel.
When an interrupt occurs, the kernel will put the process on hold and service the interrupt. Thats why the timer interrupt is important, because otherwise, in the absence of device i/o, processes could run forever.
Occasionally the scheduler would look at the acitivity patterns of the processes and calculate a priority for each of them (visible in ps axlH). This priority is then used to decide on the next process to run, so that smooth interactive processing can go along with compute-intensive tasks.

The mechanism was not so well-suited for multiprocessor systems, and therefore preemption was introduced. Preemption allows the scheduler to interrupt and switch processes much more often, without having to wait for the (rather expensive) timer interrupt, and even while they execute in kernel mode.

Then there are three special cases: the idprio and rtprio processes and the kernel processes. All of these have a fixed priority. Together, the user processes with their ad-hoc calculated priority, plus these three kinds, form a contiguous scale of priorities from somewhere -99 to +155 (lower number means higher priority), or normalized from 0 to 255 (you never know which one you're currently dealing with), as follows (give or take one or two - if you need the exact numbers, figure them out yourself):
Code:
  -100  ..  -52   interrupt processing
   -51  ..  -21   [CMD]rtprio[/CMD] tasks
   -20  ..  +19   kernel tasks
   +20  .. +120   user processes (dynamic assigned
  +124  .. +154   [CMD]idprio[/CMD] tasks
           +155   the idle process

The kern.sched.preempt_thresh is then the cutoff value in that sequence beyond with you do not allow preemption (but in the internal 0..255 scale).

So, 224 is basically the starting point of the idprio scale. The code itself has three default values: normally 80, with FULL_PREEMTION 255 (that does afaik not work well), and wihout PREEMPTION 0 (that didn't work at all for me). Somewhere it came out that 162 might be a good value, so I am running with this (but forgot the detailed reason).

Now for the logic (for the non-native inglisch speakers): to preempt somebody means to queue-jump them.
So, if our own prioritiy number is smaller than this threshold (meaning we have more priority), then we are allowed to preempt others.

If this then does any good - well, that depends....

am worried about everyone copying the same number for ages without questioning.
And this is exactly the problem with the scheduler. From what I figured, the ULE scheduler was written because the old one was just unsatisfying for SMP. It was written by a guy who had this great design idea (which it is), and it was then put into the system, it solved the imminent problems, and end of story. It was never really honed to the system, some of the tuneables are rather lab-testing instrumentation, and there are cornercases where it behaves just bad (which are solveable).
I once talked to the author, and when he figured that I really wanted to go into depth of the stuff, he went immediately U-boot. Certainly he has found other interesting things to put his attention to.
 
In order to make sense of these values, you need first to understand how timesharing was originally designed. There was only one core, and a HZ value (the HZ value is the timer interrupt frequency).
When the kernel starts a process, that process gets into control and owns the cpu. There is no way to take away the cpu from the process, until either
  • there is an interrupt occuring from somewhere, or
  • the process does a system call and thereby gives control back to the kernel.
When an interrupt occurs, the kernel will put the process on hold and service the interrupt. Thats why the timer interrupt is important, because otherwise, in the absence of device i/o, processes could run forever.
Occasionally the scheduler would look at the acitivity patterns of the processes and calculate a priority for each of them (visible in ps axlH). This priority is then used to decide on the next process to run, so that smooth interactive processing can go along with compute-intensive tasks.

The mechanism was not so well-suited for multiprocessor systems, and therefore preemption was introduced. Preemption allows the scheduler to interrupt and switch processes much more often, without having to wait for the (rather expensive) timer interrupt, and even while they execute in kernel mode.

Then there are three special cases: the idprio and rtprio processes and the kernel processes. All of these have a fixed priority. Together, the user processes with their ad-hoc calculated priority, plus these three kinds, form a contiguous scale of priorities from somewhere -99 to +155 (lower number means higher priority), or normalized from 0 to 255 (you never know which one you're currently dealing with), as follows (give or take one or two - if you need the exact numbers, figure them out yourself):
Code:
  -100  ..  -52   interrupt processing
   -51  ..  -21   [CMD]rtprio[/CMD] tasks
   -20  ..  +19   kernel tasks
   +20  .. +120   user processes (dynamic assigned
  +124  .. +154   [CMD]idprio[/CMD] tasks
           +155   the idle process

The kern.sched.preempt_thresh is then the cutoff value in that sequence beyond with you do not allow preemption (but in the internal 0..255 scale).

So, 224 is basically the starting point of the idprio scale. The code itself has three default values: normally 80, with FULL_PREEMTION 255 (that does afaik not work well), and wihout PREEMPTION 0 (that didn't work at all for me). Somewhere it came out that 162 might be a good value, so I am running with this (but forgot the detailed reason).

Now for the logic (for the non-native inglisch speakers): to preempt somebody means to queue-jump them.
So, if our own prioritiy number is smaller than this threshold (meaning we have more priority), then we are allowed to preempt others.

If this then does any good - well, that depends....


And this is exactly the problem with the scheduler. From what I figured, the ULE scheduler was written because the old one was just unsatisfying for SMP. It was written by a guy who had this great design idea (which it is), and it was then put into the system, it solved the imminent problems, and end of story. It was never really honed to the system, some of the tuneables are rather lab-testing instrumentation, and there are cornercases where it behaves just bad (which are solveable).
I once talked to the author, and when he figured that I really wanted to go into depth of the stuff, he went immediately U-boot. Certainly he has found other interesting things to put his attention to.
Thank you for the detailed explanation. It seems you are using a value in the middle of the two 151 and 171. That makes a sort of sense. I hope all of us knows a bit more after this conversation :)
 
Thank you for this overviev, both historical and current. I also did find this https://people.freebsd.org/~jeff/FBSD_14_10ULE_aj.jm_cx10_09.pdf witch explains almost the same things.
 
Back
Top