Solved FreeBSD 11.1: Limit the number of cpus

PacketMan

Aspiring Daemon

Thanks: 139
Messages: 873

#2
In this day and age of quicker faster better, I never thought I would see technology allowing us to limit how many CPUs are used. I'll be reading this discussion thread with lots of curiosity. :)
 

usdmatt

Daemon

Thanks: 476
Messages: 1,307

#3
A quick look around doesn't show any obvious setting to fully disable cores / limit the number the system uses. On Linux that setting seems to be part of cpu hotplug support / functionality to disable/enable processors on-the-fly. I'm not sure what the use cases for actually booting a system with cores disabled are though.

Depending on what you want to do, cpuset() can be used to specify exactly which processors a single process, command or group of processes can use.
 
OP
OP
D

Daniel J Blueman

New Member

Thanks: 3
Messages: 3

#4
Adding (eg) hint.lapic.40.disabled="1" for every APIC ID to disable to /boot/loader.conf did the trick.

Since the workload is I/O bound (storage server) and can saturate at most 2-3 cores, and all PCIe devices are on the first NUMA node, onlining only the cores on the first NUMA node demonstrates ~10% higher throughput for cached data due to the deep NUMA hierarchy.
 

Eric A. Borisch

Well-Known Member

Thanks: 212
Messages: 338

#5
Fascinating. Do you lose half your memory (you mentioned NUMA, assuming you have a 2-socket system) when doing this? You don't happen to be running geli, do you?
 
OP
OP
D

Daniel J Blueman

New Member

Thanks: 3
Messages: 3

#6
All the memory is available, we're just preventing cores from starting.

This was on a 4-socket Opteron system with 8 NUMA nodes, so locality of unbound threads would typically be quite poor, resulting in significantly more cache-coherent traffic. If it were Linux, I'd use 'perf stat' to see the average instructions per clock, and compare without disabling cores on the 2nd to 8th NUMA nodes.

However, you must only disable those cores if it won't make the workload compute-bound; you can use 'vmstat 2' during application benchmarking to verify this. In this case, 8 cores is already overkill for a storage server. In HPC applications, I'd pin the application threads accounting for the NUMA topology.
 

Eric A. Borisch

Well-Known Member

Thanks: 212
Messages: 338

#7
Aha. I was wondering if you had perhaps tried binding ( cpuset -l 0,1 -x QQQ) the irqs for the controller card & network via to the desired (PCIe controlling) node(s), leaving the other CPUs still available to do other menial tasks (cron jobs, etc) rather than have those (admittedly, low intensity) tasks force context switches. Cron and other tasks could be explicitly bound as well to the other not-running-io CPUs.

I asked about geli as by default it spawns (N geli devices) * (M CPU) threads for IO, which for high N*M isn't really the best choice. But that doesn't sound like you were hitting that issue.

Thanks for the description!

edit: fixed bbcode strike-thru; noted that geli statement is by default (adjustable)
 

Eric A. Borisch

Well-Known Member

Thanks: 212
Messages: 338

#8
If it were Linux, I'd use 'perf stat' to see the average instructions per clock, and compare without disabling cores on the 2nd to 8th NUMA nodes.
Install sysutils/intel-pcm and use pcm.x; it gives you an updating output like this:

Code:
 EXEC  : instructions per nominal CPU cycle
 IPC   : instructions per CPU cycle
 FREQ  : relation to nominal CPU frequency='unhalted clock ticks'/'invariant timer ticks' (includes Intel Turbo Boost)
 AFREQ : relation to nominal CPU frequency while in active state (not in power-saving C state)='unhalted clock ticks'/'invariant timer ticks while in C0-state'  (includes Intel Turbo Boost)
 L3MISS: L3 cache misses
 L2MISS: L2 cache misses (including other core's L2 cache *hits*)
 L3HIT : L3 cache hit ratio (0.00-1.00)
 L2HIT : L2 cache hit ratio (0.00-1.00)
 L3MPI : number of L3 cache misses per instruction
 L2MPI : number of L2 cache misses per instruction
 READ  : bytes read from main memory controller (in GBytes)
 WRITE : bytes written to main memory controller (in GBytes)
 IO    : bytes read/written due to IO requests to memory controller (in GBytes); this may be an over estimate due to same-cache-line partial requests
 TEMP  : Temperature reading in 1 degree Celsius relative to the TjMax temperature (thermal headroom): 0 corresponds to the max temperature
 energy: Energy in Joules


 Core (SKT) | EXEC | IPC  | FREQ  | AFREQ | L3MISS | L2MISS | L3HIT | L2HIT | L3MPI | L2MPI | TEMP

   0    0     0.25   1.19   0.21    1.00    1529 K   2402 K    0.36    0.37    0.00    0.00     77
   1    0     0.33   1.30   0.25    1.00    1542 K   2583 K    0.40    0.36    0.00    0.00     77
   2    0     0.20   1.08   0.19    1.00    1407 K   2293 K    0.39    0.40    0.00    0.00     78
   3    0     0.17   1.03   0.16    1.00    1450 K   2395 K    0.39    0.39    0.00    0.00     78
---------------------------------------------------------------------------------------------------------------
 SKT    0     0.24   1.17   0.20    1.00    5929 K   9675 K    0.39    0.38    0.00    0.00     75
---------------------------------------------------------------------------------------------------------------
 TOTAL  *     0.24   1.17   0.20    1.00    5929 K   9675 K    0.39    0.38    0.00    0.00     N/A

 Instructions retired: 3443 M ; Active cycles: 2948 M ; Time (TSC): 3612 Mticks ; C0 (active,non-halted) core residency: 20.14 %

 C1 core residency: 45.31 %; C3 core residency: 0.00 %; C6 core residency: 0.00 %; C7 core residency: 34.54 %;
 C2 package residency: 25.41 %; C3 package residency: 0.00 %; C6 package residency: 0.00 %; C7 package residency: 0.00 %;

 PHYSICAL CORE IPC                 : 2.34 => corresponds to 58.39 % utilization for cores in active state
 Instructions per nominal CPU cycle: 0.47 => corresponds to 11.76 % core utilization over time interval
 SMI count: 0
---------------------------------------------------------------------------------------------------------------
MEM (GB)->|  READ |  WRITE |   IO   | CPU energy |
---------------------------------------------------------------------------------------------------------------
 SKT   0     1.11     0.46     0.57      20.19
---------------------------------------------------------------------------------------------------------------
Usage message:
Code:
 Usage:
 pcm.x --help | [delay] [options] [-- external_program [external_program_options]]
   <delay>                           => time interval to sample performance counters.
                                        If not specified, or 0, with external program given
                                        will read counters only after external program finishes
 Supported <options> are:
  -h    | --help      | /h           => print this help and exit
  -r    | --reset     | /reset       => reset PMU configuration (at your own risk)
  -nc   | --nocores   | /nc          => hide core related output
  -yc   | --yescores  | /yc          => enable specific cores to output
  -ns   | --nosockets | /ns          => hide socket related output
  -nsys | --nosystem  | /nsys        => hide system related output
  -m    | --multiple-instances | /m  => allow multiple PCM instances running in parallel
  -csv[=file.csv] | /csv[=file.csv]  => output compact CSV format to screen or
                                        to a file, in case filename is provided
  -i[=number] | /i[=number]          => allow to determine number of iterations
 Examples:
  pcm.x 1 -nc -ns          => print counters every second without core and socket output
  pcm.x 1 -i=10            => print counters every second 10 times and exit
  pcm.x 0.5 -csv=test.log  => twice a second save counter values to test.log in CSV format
  pcm.x /csv 5 2>/dev/null => one sampe every 5 seconds, and discard all diagnostic output
 
Top