Threading-related problem with x265 parallel motion estimation on FreeBSD

I'm not sure if this is the right subforum for this, and if it's not, my apologies! I just didn't know where exactly this would fit in.

Here's my problem: I'm using the x265 command-line video encoder on FreeBSD 12.1-RELEASE-p1, and it works perfectly fine, unless I specify the --pme parameter, which activates the "parallel motion estimation" feature of the encoder. Once this is active, the kernel load on the machine will rise significantly, eating up 30-40% even of an entire 32-core 64-thread processor. Let me give you a few specs regarding my test system:
  • OS: FreeBSD 12.1-RELEASE-p1 running the GENERIC kernel
  • CPU: AMD Ryzen Threadripper 3970X (32 cores, 64 threads, baremetal)
  • Architecture: amd64
  • x265 version: Any from 2.5+48-bd438ce10843 up to 3.2.1+1-b5c86a64bbbe
  • Compiler: clang (any version with C++11 support), GCC (any version with C++11 support)
  • Assembler: yasm 1.3.0 and nasm 2.13.x or 2.14.x

I don't really know much about debugging or anything, but I at least tried to find some clues using $ truss -c -D -H -s 256 and $ truss -D -H -s 256. It seems the system gets stuck doing tons of _umtx_op system calls which take an enormous amount of time:

Code:
syscall                     seconds   calls  errors
thr_new                 0.007680868      71       0
getcontext              0.000003631       1       0
getpid                  0.000005630       2       0
__sysctl                0.000016060       4       0
issetugid               0.000007411       2       0
write                   0.000235482      21       0
thr_self                0.000002880       1       0
sysarch                 0.000003029       1       0
sigprocmask             0.000068142      17       0
sigaction               0.000014991       2       0
rtprio_thread           0.000002880       1       0
readlink                0.000004490       1       1
read                   17.799654291  249306       0
pread                   0.000005131       1       0
openat                  0.000061553      13       3
open                    0.000148284       4       1
munmap                  0.010717130      47       0
mprotect                0.001165332      83       0
mmap                    1.810752957    3062       0
madvise                 0.000003110       1       0
getrlimit               0.000002850       1       0
fstat                   0.000068339      13       0
close                   0.000035990      10       0
_umtx_op            26676.881285215  747621    1076
                      ------------- ------- -------
                    26696.511945676 1000286    1081

I assume the 17 seconds in "read" are probably because I'm providing large amounts of uncompressed 8K video data to the encoder via a pipe. But _umtx_op consumed 26676 (CPU?) seconds in just a very short time. I tried to dig further and found lots and lots of those calls in truss output, here's a very tiny snippet:

Code:
101861: 0.011574437 _umtx_op(0x800cb3db8,UMTX_OP_NWAKE_PRIVATE,0x1,0x0,0x0) = 0 (0x0)
101875: 0.001024685 _umtx_op(0x800d2cbb8,UMTX_OP_NWAKE_PRIVATE,0x1,0x0,0x0) = 0 (0x0)
102685: 0.249927998 _umtx_op(0x800cea650,UMTX_OP_WAIT_UINT_PRIVATE,0x0,0x18,0x7fffd79b9d68) ERR#60 'Operation timed out'
101905: 0.021449979 _umtx_op(0x800cea350,UMTX_OP_WAIT_UINT_PRIVATE,0x0,0x0,0x0) = 0 (0x0)
101884: 0.008779694 _umtx_op(0x824c3dd00,UMTX_OP_MUTEX_WAKE2,0x0,0x0,0x0) = 0 (0x0)
101865: 0.000615929 _umtx_op(0x800d2b7b8,UMTX_OP_NWAKE_PRIVATE,0x1,0x0,0x0) = 0 (0x0)
101895: 0.008443820 _umtx_op(0x800d527b8,UMTX_OP_NWAKE_PRIVATE,0x1,0x0,0x0) = 0 (0x0)

I guess that 'Operation timed out' thing might be a part of the issue? But I'm unsure, it shows up every 50-100 or so system calls. There's over a million of those calls for less than 10 minutes of runtime in total...

This did not happen on FreeBSD 11.1-RELEASE before (using clang to compile x265). It also does not happen on modern Fedora 31 Linux (GCC 9) or on Microsoft Windows 10 1909 (MSVC++ 2017), same source code in every case. I tried several versions of clang as well as GCC on FreeBSD 12.1-RELEASE-p1 to see whether the compiler makes a difference, but it doesn't.

My x265 source code is modified to allow for higher resolutions however, so just to make sure, I tried the vanilla source code (much newer version 3.2.1+1 of x265) as well, and the problem stays the same! As soon as I specify --pme, the encoder becomes really slow, and gets stuck because the CPU is being eaten up by the kernel load.

I know one might say that this is a problem I should report to the x265 developers, but given that this appears to be FreeBSD-specific somehow, I thought I'd ask here first.

Does anybody have an idea about how I can narrow the problem down further, or what might be causing this behavior?

This isn't a critical thing for me, because I don't need that parameter for production, but there is a specific test I'd like to run, which relies on it. Using that, I'd love to make a comparison between Windows, Linux and FreeBSD in terms of performance, but that won't make much sense with the encoder being in this state on FreeBSD.

I'd be thankful for any ideas! :)
 
Back
Top