C++ Clang C++ CPU profiling on FreeBSD 12.3?

Tim Rau · Mar 22, 2023

We have a few threads in a particular process that increase CPU usage over time on FreeBSD 12.3. Reviewing the source code associated with these threads did not identify anything that should be taking a lot of CPU time, much less anything that would increase for a longer-running process.

On Windows, I would debug using the Visual Studio CPU Usage profiler, which shows the time spent in each method (and some system calls). Is there a tool or feature that would allow something like this on FreeBSD for C++ built with Clang? I know there are plenty of instrumentation features available, but have not found the right one yet.

If there is a tool known to be useful for CPU usage profiling of C++ code, are there any particular Clang compile flags needed, and is there any documentation with some usage examples?

Background
I have experience developing primarily on Windows, and have used Microsoft Visual Studio's Performance Profiler to track down issues. Documentation includes step-by-step instructions and screenshots of example results. I have very little experience with either Linux or FreeBSD. Our software is compiled on Windows; we have makefiles that references clang in a MinGW cross-compiler (originally prepared according to this page) to build our C++ code base for FreeBSD. After deploying to our FreeBSD 12.3 test environment, we interact with it using an SSH session. This test environment currently has very few packages that are not installed out-of-the-box with the FreeBSD installer.

Problem
We need to identify a similar tool to run CPU profiling on FreeBSD, also with step-by-step instructions and screenshots of example results (to confirm that this was the right tool for the job).

Solution
The solution that worked for me (see all posts for breadcrumbs):

If not already done, compile C++ code with the -g compiler option
Install the valgrind() package with sudo pkg install valgrind
For documentation with screenshots, see the article by Paul Floyd: Valgrind Part 4: Cachegrind and Callgrind
Use valgrind --tool=callgrind --instr-atstart=no --log-file=cg.out ./<my_app> <my_app_args> to start the app with collection disabled.
1. Expect a process slowdown, even with collection disabled
Use callgrind_control --instr=on to enable collection
Use QCacheGrind on Windows to process and view results from valgrind() on FreeBSD.

Other Suggestions
These may work for others (see all posts for breadcrumbs):

When using valgrind, use the lighter-weight cachegrind tool.
1. My attempt did not give meaningful results
Use dtrace() for collection and for Flame Graphs for processing (per cracauer@ )
1. My attempt encountered an error: dtrace: failed to initialize dtrace: DTrace device not available on system
Use pmcstat() for collection (per Paul Floyd)
1. My attempt encountered an error: pmcstat: ERROR: Initialization of the pmc(3) library failed: No such file or directory
Use google perftools / pprof (per Paul Floyd)
1. My attempt encountered error compiling with the -lprofiler (using the cross-compiler)
2. I did not install the package, so not sure if this requires the "google-perftools" or the "pprof" package

cracauer@ · Mar 22, 2023

Flame graphs by Brendan Gregg, gathered via DTrace.

Flame Graphs

Homepage for Flame Graphs: a visualization for stack traces.

www.brendangregg.com

Tim Rau · Mar 22, 2023

cracauer@ Thanks for you response.

It appears Flame graphs is just a visualization tool for data already gathered by DTrace. Unfortunately, basic attempts to use dtrace got the error "DTrace device not available on system" as mentioned on https://forums.freebsd.org/threads/freebsds-dtrace-needs-specific-settings.31163/ It appears that in order to enable DTrace, one actually needs to "recompile the kernel" (per https://docs.freebsd.org/en/books/handbook/dtrace/#dtrace-enable). Modifying the FreeBSD operating system itself seems a bit daunting for me at this time, so I will defer attempting that for now.

The example dtrace command given on the flame graphs page seems to use plenty of keywords not mentioned on the dtrace() manual page, so I am not sure if it would need to be tweaked for FreeBSD:

dtrace -x ustackframes=100 -n 'profile-99 /execname == "mysqld" && arg1/ { @[ustack()] = count(); } tick-60s { exit(0); }' -o out.stacks

cracauer@ · Mar 22, 2023

Are you issuing the DTrace commands as root?

As for that last commandline, all you have to modify is the "mysql" to match the process(es) you want to investigate.

Tim Rau · Mar 22, 2023

cracauer@ Yes, sudo dtrace -l on FreeBSD+12.3-RELEASE gets dtrace: failed to initialize dtrace: DTrace device not available on system

cracauer@ · Mar 22, 2023

Tim Rau said:
cracauer@ Yes, sudo dtrace -l on FreeBSD+12.3-RELEASE gets dtrace: failed to initialize dtrace: DTrace device not available on system

Odd. Runs like a charm for me on 14-current/amd64 with GENERIC-NODEBUG kernel. What platform are you on?

Jose · Mar 23, 2023

Code:

$ uname -ir                                                                                     
12.4-RELEASE-p1 GENERIC
$ doas dtrace -l
   ID   PROVIDER            MODULE                          FUNCTION NAME
    1     dtrace                                                     BEGIN
    2     dtrace                                                     END
    3     dtrace                                                     ERROR
    4        fbt            kernel                camstatusentrycomp entry
    5        fbt            kernel                camstatusentrycomp return
    6        fbt            kernel            cam_compat_handle_0x17 entry
    7        fbt            kernel            cam_compat_handle_0x17 return
    8        fbt            kernel            cam_compat_handle_0x18 entry
    9        fbt            kernel            cam_compat_handle_0x18 return
   10        fbt            kernel cam_compat_translate_dev_match_0x18 entry
   11        fbt            kernel cam_compat_translate_dev_match_0x18 return
...

Paul Floyd · Mar 23, 2023

Most of the profiling that I do is on Linux.

You might consider pmcstat, roughly the equivalent of Linux perf.

Next, google perftools.

Cachegrind and callgrind will give accurate instruction counts, but won't model either threads or modern cache/branch prediction/speculative execution accurately.

Tim Rau · Mar 23, 2023

TLDR version: It seems might be possible to gather data with dtrace and process with flamegraphs, or gather data with some compiler-added feature and process with pprof, but I'm still unclear from documentation how to enable the associated features (using clang and FreeBSD 12.3-RELEASE amd64).

cracauer@: I am also runnikng on amd64.

Jose:

Code:

$ uname -ir
12.3-RELEASE GENERIC
$ doas dtrace -l
-sh: doas: not found
$ dtrace -l
dtrace: failed to initialize dtrace: DTrace device not available on system

Paul Floyd :

It appears pmcstat() gathers statistics about performance of a process overall; I don't see anything in its documentation that would allow investigating a particular code module.
I could not find the actual home page for google perftools documentation. The pages I did find either mentioned it "is now deprecated" or "this page has moved" with dead links to the new location. I think the repo had several different things, with the pprof tool being the useful one From skimming the deprecated documentation, I think using the CPU profiler requires:
1. Compile my source with -[I]lprofiler[/I] flag under gcc. I have not yet been able to verify a corresponding flag in clang: https://clang.llvm.org/docs/ClangCommandLineReference.html
2. Set an environment variable to turn on profiling when starting the app.
3. Run pprof for FreeBSD to process the raw output.
I found no documentation for cachegrind on FreeBSD. There is an associated "kcacheground" package, but this to require something called KDE: https://forums.freebsd.org/threads/kcachegrind-depend-on-kde-when-using-gnome.19255/
I found no documentation for calgrind on FreeBSD. No such package seems to exist

Code:

$ pkg search perftools
google-perftools-2.10_2        Fast, multi-threaded malloc() and nifty performance analysis tools
$ pkg search pprof
pprof-g20200905_8              Tool for visualization and analysis of profiling data
$ pkg search cachegrind
kcachegrind-22.12.0            Profiler frontend for KDE
$ pkg search calgrind
$

Jose · Mar 23, 2023

doas(1)

cracauer@ · Mar 23, 2023

Tim Rau said:
$ dtrace -l
dtrace: failed to initialize dtrace: DTrace device not available on system[/CODE]

Again, you need to do that as root.

Paul Floyd · Mar 25, 2023

I'm going to be harsher than I usually am. Reading your reply it seems to me that either you didn't read it carefully or you didn't understand what I said.

Tim Rau said:
Paul Floyd :

It appears pmcstat() gathers statistics about performance of a process overall; I don't see anything in its documentation that would allow investigating a particular code module.

Do you have any experience of either software development or profiling?

If you need to get up to speed on the basics of performance measurement, you could do worse than reading Brendan Gregg's older books

https://www.brendangregg.com/books.html

(the newer ones are oriented to Linux, especially the BPF book).

In short the only way to profile a single module is to use some form of instrumentation.

If you use a tool like pmcstat then it will either sample everything (kernel and all processes) or just one process. Both have their uses, the former will give you some idea of what is happening in the kernel, the latter is more focused on your test application.

The pmcstat man page has several examples. There's also info in the FreeBSD wiki

https://wiki.freebsd.org/PmcTools

and more concretely here

https://wiki.freebsd.org/PmcTools/PmcKcachegrind.

Tim Rau said:
I could not find the actual home page for google perftools documentation. The pages I did find either mentioned it "is now deprecated" or "this page has moved" with dead links to the new location. I think the repo had several different things, with the pprof tool being the useful one From skimming the deprecated documentation, I think using the CPU profiler requires:

Compile my source with -[I]lprofiler[/I] flag under gcc. I have not yet been able to verify a corresponding flag in clang: https://clang.llvm.org/docs/ClangCommandLineReference.html

Set an environment variable to turn on profiling when starting the app.

Run pprof for FreeBSD to process the raw output.

"-l" (lower case L) is a flag for the linker or linker driver. It is not specific to GCC.

I have no problems using it with clang.

Code:

clang++ -g main.cpp -o main_prof -lprofiler -L /usr/local/lib
CPUPROFILE=prof.out ./main_prof
/usr/local/bin/perftools-pprof --pdf ./main_prof prof.out > o.pdf

Tim Rau said:
I found no documentation for cachegrind on FreeBSD. There is an associated "kcacheground" package, but this to require something called KDE: https://forums.freebsd.org/threads/kcachegrind-depend-on-kde-when-using-gnome.19255/

I found no documentation for calgrind on FreeBSD. No such package seems to exist

$[/CODE]

'calgrind' was a typo. It should be 'callgrind'. Have you ever used a web search engine? It would have found answers, even with my typo, with no problem and in far less time than you spent answering.

cachegrind and callgrind are parts of the Valgrind suite,

Valgrind

Official Home Page for valgrind, a suite of tools for debugging and profiling. Automatically detect memory management and threading bugs, and perform detailed profiling. The current stable version is valgrind-3.26.0.

valgrind.org

Valgrind

Official Home Page for valgrind, a suite of tools for debugging and profiling. Automatically detect memory management and threading bugs, and perform detailed profiling. The current stable version is valgrind-3.26.0.

valgrind.org

Here's an article I wrote some while back on using them

Valgrind Part 4: Cachegrind and Callgrind

Cachegrind and Callgrind When your application is slow, you need a profiler. Paul Floyd shows us how callgrind and cachegrind can help.

accu.org

Tim Rau · Mar 28, 2023

My original post might have been unclear in what was needed. Yes, I have over 15 years experience with software development - on Windows - and have occasionally used Microsoft Visual Studio's Performance Profiler to track down issues. Documentation includes step-by-step instructions and screenshots of example results is available here. I have worked very little Linux, and have been working with FreeBSD about 3 months. I do not yet have the experience to tell the difference between a Linux vs. FreeBSD difference, a typo, or a command that requires installation of additional packages.

Our software is compiled on Windows; we have makefiles that references clang in a MinGW cross-compiler (originally prepared according to this page) to build our C++ code base for FreeBSD. After deploying to our FreeBSD test environment, we interact with it using an SSH session. This test environment currently has very few packages that are not installed out-of-the-box with the FreeBSD installer.

Given that setup, I needed documentation for a similar CPU profiling tool on FreeBSD, with with step-by-step instructions and screenshots of example results (to confirm that this was the right tool for the job).

This article seems to include exactly what I need, identifying cachegrind and associated tooling as the right tool for the job:

Here's an article I wrote some while back on using them

Valgrind Part 4: Cachegrind and Callgrind

After installing the valgrind() package with sudo pkg install valgrind, the following created files cg.out and cachegrind.out.<pid>.

Code:

valgrind --tool=cachegrind --log-file=cg.out ./<my_app> <my_app_args>

Running cg_annotate cachegrind.out.<pid> as mentioned in the article seemed to output meaningful (but difficult to read) results.

To process the results in a user-friendly format, I think I need to find a valid Windows variant of KCachegrind and copy the results results to my Windows developer machine where the source code is. I will confirm on this thread once I get a chance to actually try that out.

Side notes:

While writing this post, I just discovered that while searching the manual, it is helpful to select an option that includes Ports (FreeBSD+12.3-RELEASE+and+Ports instead of just FreeBSD+12.3-RELEASE)

Regarding -lprofiler for perftools, I got an error attempting to use that; this might be a problem specific to the cross-compiler.

c:\MinGW\msys\1.0\opt\crosstool\x86_64-unknown-freebsd\x86_64-unknown-freebsd\bin\ld: cannot find -lprofiler

Tim Rau · Mar 29, 2023

I was able to use QCacheGrind on Windows to process and view results from valgrind() on FreeBSD.

The --tool=cachegrind output was not very useful. Most of the processing showed up under unknown, even though the -g compiler option was used.
The --tool=callgrind output gave me exactly what I needed. The instrumentation can also be disabled at startup and enabled as needed:

Code:

valgrind --tool=callgrind --instr-atstart=no --log-file=cg.out ./<my_app> <my_app_args>
callgrind_control --instr=on

To show details on the FreeBSD libraries in QCacheGrid, I had to add the cross-compiler directory C:\MinGW\msys\1.0\opt\crosstool\x86_64-unknown-freebsd\usr\include to the source folder list

Observations:

Both cachegrind and callgrid both slowed down my app to an extent it was almost unresponsive, even with instrumentation turned off. This does not seem to be the experience for most users, so may not be of general concern.
valgrind does not seem to have any feature to attach to an already-running process. A different tool might be needed by any users that need to investigate issues that appear gradually over time.

Paul Floyd · Mar 29, 2023

Tim Rau said:
Observations:

Both cachegrind and callgrid both slowed down my app to an extent it was almost unresponsive, even with instrumentation turned off. This does not seem to be the experience for most users, so may not be of general concern.

valgrind does not seem to have any feature to attach to an already-running process. A different tool might be needed by any users that need to investigate issues that appear gradually over time.

Expect typical slowdowns of 10x or more.

Attaching is impossible. Valgrind does not work like debuggers (gdb and lldb). Debuggers have two processes, the debugger and the inferior, with the debugger controlling the inferior via ptrace system calls. Valgrind is totally different. There is only one process (the valgrind tool such as callgrind). The valgrind host emulates a CPU for the guest to run on (and it's this emulation that causes it to be slow).

I'm pretty sure that none of the Valgrind devs ever do any cross compilation. In theory the output binaries should be identical but it's not something that is tried and tested.

Paul Floyd · Mar 29, 2023

Tim Rau said:
This article seems to include exactly what I need, identifying cachegrind and associated tooling as the right tool for the job:

As I said earlier, cachegrind and callgrind have limited cache/branch predictor modelling, which can make inaccurate.

That's why I recommend something like pmcstat to get a second opinion.

Tim Rau · Mar 30, 2023

Unfortunately, all my attempts to run pmcstat() commands get the following error:

pmcstat: ERROR: Initialization of the pmc(3) library failed: No such file or directory

The following threads reported similar issues, with no final resolution:

As suggested in those threads, I attempted sudo kldload hwpmc, but that got error:

kldload: can't load hwpmc: Operation not permitted

One observation regarding hwpmc() suggested it is only supported on some CPUs:

trev said:
An apropos hwpmc suggests hwpmc may not support your CPU

I could not find the mentioned limitation mentioned in the man pages. I found BIOS settings mentioned under hwpmc() BUGS, but only mentioned for single-processor systems:

On the i386 architecture, the driver requires that the local APIC on the
CPU be enabled for sampling mode to be supported. Many single-processor
motherboards keep the APIC disabled in BIOS; on such systems hwpmc will
not support sampling PMCs.

Paul Floyd · Mar 30, 2023

It works on my fairly ancient HP workstation. What hardware are you using? (The "Model name" outpout from lscpu).

Tim Rau · Apr 4, 2023

This is on a Beckhoff CX2042, with lscpu model name: Intel(R) Xeon(R) CPU D-1527 @ 2.20GHz

I got access to another test environment where FreeBSD is running in a VM (virtual machine). This had a different error message initially, but sudo kldload hwpmc enabled pmcstat()

pmcstat: pmu features not supported on host or hwpmc not loaded

On the VM, sudo dtrace -l also seems to work, so these may be limitations with the industrial Beckhoff hardware we work with.

Jose · Apr 4, 2023

Is the thing running a custom kernel? If so, it's likely they didn't enable hwpmc(4) in it.