kernel support for heterogeneous cores

Has the kernel ever been optimized for architectures like big.LITTLE, DynamIQ, or even Intel P/E Cores? I've noticed the RPI 5/Orange Pi 5 uses the Cortex A76; a successor to the A75 which introduced DynamIQ, but I'm not sure where FreeBSD stands with this type of architecture, or how well it performs. Also, the M1 Mac Minis are getting pretty cheap too; porting FreeBSD to it is looking enticing nowadays.

Any thoughts?
 
Has the kernel ever been optimized for architectures like big.LITTLE, DynamIQ, or even Intel P/E Cores? I've noticed the RPI 5/Orange Pi 5 uses the Cortex A76; a successor to the A75 which introduced DynamIQ, but I'm not sure where FreeBSD stands with this type of architecture, or how well it performs. Also, the M1 Mac Minis are getting pretty cheap too; porting FreeBSD to it is looking enticing nowadays.

Any thoughts?
AFAIK, not yet. But works seem to be ongoing.
I think this review (D45393) is the starting point.
 
The review mentions the Ryzen 9 3900X; isn't Zen a homogenous architecture? I wonder how they're going to handle ULEs scheduling with P/E cores on modern processors/SoCs. FreeBSD being a multi-purpose OS; I also wonder how the kernel will determine which type of workload will run on either core. In the context of the desktop (and maybe embedded?), at least, this will probably be easier to evaluate. I wish I could find some info on how XNU handles this in macOS. Maybe we could steal some implementation details from it.
 
Yes, this alone does not help heterogeneous cores.
But I think this would be the prerequisite of heterogeneous cores.

This significantly increases the number of run cueues. Schedulers would be able to map queues for the type of cores, not only categories of priorities (i.e., realtime, regular, idle). Schedulers supporting heterogeneous cores would want such a fine grained run cueues.

And FreeBSD foundation already knows the existence of requirements via call for ideas (CFI) on 2021. But supporting heterogeneous cores without fully supports from hardware manufacturers is quite a difficult task.
 
Does anybody know of programmer-level documentation on Intel's thread directors?

And as far as I know other OSes solve this very interesting/difficult problem by detecting know software running and applying hardcoded rules for those situations. That won't be of much use for typical FreeBSD loads.
 
Take a look at this article here regarding the XNU kernel. Is there anything we could possible draw from its implementation? It seems its scheduler was enhanced to be aware of heterogenous cores alongside the debut of Apple Silicon. Unfortunately I don't have the chops the sift through darwin code.
 
That just says that the scheduler relies on QoS classes, which are explicitly given to executables - and threads. This doesn't work very well, e.g. a QoS class is assigned when using the real-time audio framework, but it makes DAWs ignore efficiency cores - although you certainly would want to use them when you have hundreds of tracks. Efficiency cores are as real-time capable as performance cores. They are slower, but that is a different concept.

Also keep in mind that Linux, with a Wine-like layer, is consistently faster than Windows for Windows applications including games.

I have the impression that smart scheduling on big-litlle architectures that has a measurable benefit is the exception rather than the rule, no matter how many lines of code you throw into the scheduler.
 
How QoS classes are actually treated would be rely on the scheduler's policy.
For example, schedule (including map to cores) relies on QoS class at the first place, then, monitor how many part of ticks are wasted (i.e., request switch to syscall) and forcibly remap to another types of cores would be an idea (not sure it works well or not, though, just an example).

If a thread mapped to p-core almost immediately invokes syscall and contexts are switched, it would be better remapped to e-core, OTOH, if a thread mapped to e-core almost always forcibly switched context by timer interrupt by scheduler, it could be better remapped to p-core.

Note that it would work only when fragmented ticks are NOT reused by other thread, as it would be too simple.
 
I have the impression that smart scheduling on big-litlle architectures that has a measurable benefit is the exception rather than the rule, no matter how many lines of code you throw into the scheduler.
My impression has been that the benefit is to a large extent power saving: If the whole active workload consists of a small number of threads that are not using much CPU, then the slow/efficient cores do the work, at lower power consumption. On the other hand, if there are many threads, and all are CPU bound, then all cores will run at full. For this to work, one only needs to know the number of runnable threads and expected CPU dwell time.
 
My impression has been that the benefit is to a large extent power saving: If the whole active workload consists of a small number of threads that are not using much CPU, then the slow/efficient cores do the work, at lower power consumption. On the other hand, if there are many threads, and all are CPU bound, then all cores will run at full. For this to work, one only needs to know the number of runnable threads and expected CPU dwell time.

But what do you do for -say- a web browser. It has contradicting properties:
  • It should use P-cores because the user is directly waiting and it's all javascript-slow code both in the web pages and in the browser mechanism. Need to power through that for snappy user experience
  • at the same time it has burst-like CPU demands like a background process

Assuming you don't hardcode how to deal with a web browser and a method to recognize a web browser, how do you code that?

And if you do, how does that change when a laptop is on battery. Or on low battery?
 
But what do you do for -say- a web browser. It has contradicting properties:
  • It should use P-cores because the user is directly waiting and it's all javascript-slow code both in the web pages and in the browser mechanism. Need to power through that for snappy user experience
  • at the same time it has burst-like CPU demands like a background process

Assuming you don't hardcode how to deal with a web browser and a method to recognize a web browser, how do you code that?

And if you do, how does that change when a laptop is on battery. Or on low battery?
This makes me so glad I've never worked on mobile computing or power efficiency. Seems too difficult.
 
This makes me so glad I've never worked on mobile computing or power efficiency. Seems too difficult.

Yeah. I have a hard enough time imagining schemes for just speed, but once power efficiency comes in scheduling for mixed cores is hung up in tradeoffs that in my imagination the scheduler can't decide on with the information available.
 
Well, the lack of information on part of the scheduler is a problem.

Intel's thread director is supposed to supply information to the scheduler. I don't know what information that is and I find no documentation.

I also think that some information needed can't possibly come from the CPU, especially about the nature of the process and thread at hand. For all we know the CPU supplies trivial information such as whether a given thread uses SIMD instructions during a timeslice.
 
Well, the lack of information on part of the scheduler is a problem.

Intel's thread director is supposed to supply information to the scheduler. I don't know what information that is and I find no documentation.

I also think that some information needed can't possibly come from the CPU, especially about the nature of the process and thread at hand. For all we know the CPU supplies trivial information such as whether a given thread uses SIMD instructions during a timeslice.

I managed to find this whitepaper about it. Look at references 15 and 16. The PDFs have to be downloaded so I can't link them here. Would that be a good starting point?
 
Well, the lack of information on part of the scheduler is a problem.

Intel's thread director is supposed to supply information to the scheduler. I don't know what information that is and I find no documentation.

I also think that some information needed can't possibly come from the CPU, especially about the nature of the process and thread at hand. For all we know the CPU supplies trivial information such as whether a given thread uses SIMD instructions during a timeslice.
I wonder whether information can be salvaged out of the FreeBSD 14.0-RELEASE/main ULE scheduler Thread Director awareness PR series that went uncommented/unreviewed a year ago:

D44454: coredirector - Intel TD/HFI driver - Part2: Enable thermal interrupt handler for Local APIC's
D44455: coredirector - Intel TD/HFI driver - Part3: Add CPU core performance/efficiency score variable to SMP's cpu_group struct.
D44456: coredirector - Intel TD/HFI driver - Part4: Add coredirector driver's source-code & Makefile.
D44457: coredirector - Intel TD/HFI driver - Part5: Add kernel configuration file example for NOTES file.
D44458: coredirector - Intel TD/HFI driver - Part6: Add coredirector's man file & Makefile.
D44459: coredirector - Intel TD/HFI driver - Part7: Add kerneldoc's Doxyfile

... or the author can still be contacted and asked where the TD information that enabled the development of that code, especially part 4, which seems to do the TD specific Register analysis and conversion it into scheduler data, can be retrieved...
 
Back
Top