CPU cores groups

YuryG · Mar 29, 2017

Well, I have long standing question. Today we have "Hyper-threading", "semi-cores" and similar technologies that make SMT not really independent through cores. For example, on my AMD FX(tm)-8300 CPU I have 8 "semi-"cores. That is, each of 4 couples share some execution blocks and cache. The same for Hyper-threading from Intel. But I do not see any groupings in sysctl (kern.ccpu, to say, is quite plain) or powerd or anything. Tasks preempts freely across all CPUs (if not made not to do it explicitly).
So, my "semi-"theoretical question, is there any way to make FreeBSD notice that not full independence of cores?

User23 · Mar 30, 2017

Maybe "kern.sched.topology_spec" is the sysctl variable you searching for.

Code:

sysctl kern.sched.topology_spec

YuryG · Mar 31, 2017

Yes, I've seen it, but is there any reasonable values to that sysctl (I couldn't even get from it weather to group CPU cores with shared resources or better without them?), or is it used by the FreeBSD kernel sufficiently efficient?

Terry_Kennedy · Mar 31, 2017

YuryG said:
Yes, I've seen it, but is there any reasonable values to that sysctl (I couldn't even get from it weather to group CPU cores with shared resources or better without them?), or is it used by the FreeBSD kernel sufficiently efficient?

The sched_ule(4) scheduler (the default for quite a few FreeBSD versions now) is topology-aware:

manpage said:
o Thread CPU affinity.
o CPU topology awareness, including for hyper-threading.

Whether or not your system will benefit from hyper-threading depends on both your workload and the CPU model involved. Different generations / types of CPUs have varying amounts of independence between the virtual CPUs. If your workload doesn't have more simultaneous runnable processes than your system has CPU cores, then more cores (either real or hyper-threaded) won't help. On my systems, I generally disable hyper-threading as they have 8 or more real cores.

User23 · Mar 31, 2017

The Bulldozer CPU is "special" with its module blocks. No HTT but 2 cores share 2 integer "clusters" and only 1 floating point unit.
https://de.wikipedia.org/wiki/AMD_FX#/media/File:AMD_Bulldozer_block_diagram_(8_core_CPU).PNG

A 4x thread FPU workload pinned on Core 0,2,4,6 could perform better than 4x threads across all cores hitting the same module blocks again and again. For a jail you could use cpuset() to force the threads on cores you like, but only for userland processes.

YuryG · Mar 31, 2017

Yes, I know this speciality. Also the shared L2, L3 cache among module, so preempting task from one module to another will give additional cache miss.
As I see in top -PIHS, long living threads jump across all cores... So, only userland "manual" optimization is possible?

YuryG · Mar 31, 2017

And this output doesn't make my more optimistic about effectiveness of core scheduling:

Code:

kern.sched.topology_spec: <groups>
 <group level="1" cache-level="0">
  <cpu count="8" mask="ff">0, 1, 2, 3, 4, 5, 6, 7</cpu>
  <children>
   <group level="2" cache-level="2">
    <cpu count="8" mask="ff">0, 1, 2, 3, 4, 5, 6, 7</cpu>
   </group>
  </children>
 </group>
</groups>

Terry_Kennedy · Mar 31, 2017

YuryG said:
And this output doesn't make my more optimistic about effectiveness of core scheduling:

What does your system report during boot in the "FreeBSD/SMP" lines? My guess is that it thinks you have 8 full cores, since that is what AMD claims for that CPU.

Here's what the sysctl reports on a dual Xeon X5680 system with hyper-threading off:

Code:

kern.sched.topology_spec: <groups>
 <group level="1" cache-level="0">
  <cpu count="12" mask="fff">0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11</cpu>
  <children>
   <group level="2" cache-level="2">
    <cpu count="6" mask="3f">0, 1, 2, 3, 4, 5</cpu>
   </group>
   <group level="2" cache-level="2">
    <cpu count="6" mask="fc0">6, 7, 8, 9, 10, 11</cpu>
   </group>
  </children>
 </group>
</groups>

Pretty simple - 6 cores in each socket. Presumably it avoids migrating threads between sockets to avoid caching performance loss. Look what happens if I turn hyper-threading on in that same system:

Code:

kern.sched.topology_spec: <groups>
 <group level="1" cache-level="0">
  <cpu count="24" mask="ffffff">0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23</cpu>
  <children>
   <group level="2" cache-level="2">
    <cpu count="12" mask="fff">0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11</cpu>
    <children>
     <group level="3" cache-level="1">
      <cpu count="2" mask="3">0, 1</cpu>
      <flags><flag name="THREAD">THREAD group</flag><flag name="SMT">SMT group</flag></flags>
     </group>
     <group level="3" cache-level="1">
      <cpu count="2" mask="c">2, 3</cpu>
      <flags><flag name="THREAD">THREAD group</flag><flag name="SMT">SMT group</flag></flags>
     </group>
     <group level="3" cache-level="1">
      <cpu count="2" mask="30">4, 5</cpu>
      <flags><flag name="THREAD">THREAD group</flag><flag name="SMT">SMT group</flag></flags>
     </group>
     <group level="3" cache-level="1">
      <cpu count="2" mask="c0">6, 7</cpu>
      <flags><flag name="THREAD">THREAD group</flag><flag name="SMT">SMT group</flag></flags>
     </group>
     <group level="3" cache-level="1">
      <cpu count="2" mask="300">8, 9</cpu>
      <flags><flag name="THREAD">THREAD group</flag><flag name="SMT">SMT group</flag></flags>
     </group>
     <group level="3" cache-level="1">
      <cpu count="2" mask="c00">10, 11</cpu>
      <flags><flag name="THREAD">THREAD group</flag><flag name="SMT">SMT group</flag></flags>
     </group>
    </children>
   </group>
   <group level="2" cache-level="2">
    <cpu count="12" mask="fff000">12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23</cpu>
    <children>
     <group level="3" cache-level="1">
      <cpu count="2" mask="3000">12, 13</cpu>
      <flags><flag name="THREAD">THREAD group</flag><flag name="SMT">SMT group</flag></flags>
     </group>
     <group level="3" cache-level="1">
      <cpu count="2" mask="c000">14, 15</cpu>
      <flags><flag name="THREAD">THREAD group</flag><flag name="SMT">SMT group</flag></flags>
     </group>
     <group level="3" cache-level="1">
      <cpu count="2" mask="30000">16, 17</cpu>
      <flags><flag name="THREAD">THREAD group</flag><flag name="SMT">SMT group</flag></flags>
     </group>
     <group level="3" cache-level="1">
      <cpu count="2" mask="c0000">18, 19</cpu>
      <flags><flag name="THREAD">THREAD group</flag><flag name="SMT">SMT group</flag></flags>
     </group>
     <group level="3" cache-level="1">
      <cpu count="2" mask="300000">20, 21</cpu>
      <flags><flag name="THREAD">THREAD group</flag><flag name="SMT">SMT group</flag></flags>
     </group>
     <group level="3" cache-level="1">
      <cpu count="2" mask="c00000">22, 23</cpu>
      <flags><flag name="THREAD">THREAD group</flag><flag name="SMT">SMT group</flag></flags>
     </group>
    </children>
   </group>
  </children>
 </group>
</groups>

It has detected a much more complex topology and recognizes the pairs of hyper-threaded logical CPUs in each core.

If your CPU reports it has 8 fully-functional cores and no hyper-threading, then I'd expect to see the output you reported. I don't think this sort of shared-core module thing has been repeated on any other x86-64 CPU family (although I could certainly be mistaken). Some recent ARM CPUs have cores of varying abilities (popular for mobile phones and tablets), but the ARM platform seems fragmented enough that there are separate FreeBSD images for different boards so it is presumably a lot easier to handle special-case CPUs.

YuryG · Mar 31, 2017

Surely it reports just 8 cores... But this modules/ cores feature is well known for AMD's Bulldozer...

Code:

FreeBSD/SMP: Multiprocessor System Detected: 8 CPUs
FreeBSD/SMP: 1 package(s) x 8 core(s)
 cpu0 (BSP): APIC ID: 16
 cpu1 (AP): APIC ID: 17
 cpu2 (AP): APIC ID: 18
 cpu3 (AP): APIC ID: 19
 cpu4 (AP): APIC ID: 20
 cpu5 (AP): APIC ID: 21
 cpu6 (AP): APIC ID: 22
 cpu7 (AP): APIC ID: 23

The question is, may I improve that knowledge of system topology for OS by some tunables, may be?

Terry_Kennedy · Mar 31, 2017

YuryG said:
Surely it reports just 8 cores... But this modules/ cores feature is well known for AMD's Bulldozer...

Yup

The question is, may I improve that knowledge of system topology for OS by some tunables, may be?

You can force some topologies by setting kern.smp.topology (see /usr/src/sys/kern/kern_smp.c), but the 8-core Bulldozer layout is not one of them. And you might end up hurting performance, since you probably only want to avoid scheduling operations on both cores of a module if those operations use something the module only has one of, instead of one per core (mainly floating-point instructions). There is probably some way to achieve something like cpuset(4) affinity at the kernel level (the simplistic approach would be to simply force the available CPU mask to 01010101b, though if you have more than 4 runnable threads that will probably kill performance).

Given that there are a lot of those processors out there and I didn't find any discussion since February 2012, this is apparently a complex issue with the potential for a lot of work with small returns. Even Microsoft just provided a pair of hotfixes for Windows 7, deferring further work to a larger rewrite for Windows 8.