Solved net/samba413 encryption, security/gnutls on Aarch64

Hello. It seems security/gnutls from ports and pkg are not making use of the AES acceleration features from the ARMv8 Cryptographic extensions. This makes Samba server encryption incredibly slow.

I've observed this on a Mac M1 FreeBSD 13-RELEASE virtual machine and also a RockPro64 running 13-RELEASE.

All the examples bellow are running the same software versions:

- FreeBSD 13-RELEASE
- security/gnutls 3.6.16
- security/nettle 3.7.3

The following output seems to suggest the module is baked into the kernel:
Code:
kldload armv8crypto
kldload: can't load armv8crypto: module already loaded or in kernel

I can also see openssl from pkg being greatly benefited from this producing test results up to ~20x faster:

With base OpenSSL on RockPro64:
Code:
for ALG in aes-128-ccm aes-128-gcm; do  
    openssl speed -evp ${ALG} -bytes 1500 2> /dev/null | grep "^${ALG}"
done

aes-128-gcm      15808.85k
aes-128-ccm      13117.70k

With OpenSSL from pkg on RockPro64:
Code:
for ALG in aes-128-ccm aes-128-gcm; do  
    /usr/local/bin/openssl speed -evp ${ALG} -bytes 1500 2> /dev/null | grep "^${ALG}"
done

aes-128-ccm     108811.86k
aes-128-gcm     246024.50k

Here are some comparisons:

RockPro64:
Code:
gnutls-cli --benchmark-tls-ciphers
Testing throughput in cipher/MAC combinations (payload: 1400 bytes)
                   AES-128-GCM - TLS1.2  6.67 MB/sec
                   AES-128-GCM - TLS1.3  7.91 MB/sec
                   AES-128-CCM - TLS1.2  6.12 MB/sec
                   AES-128-CCM - TLS1.3  5.77 MB/sec
             CHACHA20-POLY1305 - TLS1.2  14.24 MB/sec
             CHACHA20-POLY1305 - TLS1.3  14.29 MB/sec
                   AES-128-CBC - TLS1.0  7.76 MB/sec
              CAMELLIA-128-CBC - TLS1.0  6.59 MB/sec
           GOST28147-TC26Z-CNT - TLS1.2  3.18 MB/sec

Testing throughput in cipher/MAC combinations (payload: 16384 bytes)
                   AES-128-GCM - TLS1.2  7.08 MB/sec
                   AES-128-GCM - TLS1.3  8.36 MB/sec
                   AES-128-CCM - TLS1.2  5.64 MB/sec
                   AES-128-CCM - TLS1.3  5.98 MB/sec
             CHACHA20-POLY1305 - TLS1.2  15.79 MB/sec
             CHACHA20-POLY1305 - TLS1.3  15.59 MB/sec
                   AES-128-CBC - TLS1.0  8.30 MB/sec
              CAMELLIA-128-CBC - TLS1.0  6.95 MB/sec
           GOST28147-TC26Z-CNT - TLS1.2  3.28 MB/sec

FreeBSD on M1:
Code:
gnutls-cli --benchmark-tls-ciphers
Testing throughput in cipher/MAC combinations (payload: 1400 bytes)
                   AES-128-GCM - TLS1.2  87.56 MB/sec
                   AES-128-GCM - TLS1.3  87.37 MB/sec
                   AES-128-CCM - TLS1.2  75.69 MB/sec
                   AES-128-CCM - TLS1.3  75.71 MB/sec
             CHACHA20-POLY1305 - TLS1.2  172.27 MB/sec
             CHACHA20-POLY1305 - TLS1.3  171.13 MB/sec
                   AES-128-CBC - TLS1.0  96.21 MB/sec
              CAMELLIA-128-CBC - TLS1.0  59.41 MB/sec
           GOST28147-TC26Z-CNT - TLS1.2  23.07 MB/sec

Testing throughput in cipher/MAC combinations (payload: 16384 bytes)
                   AES-128-GCM - TLS1.2  90.89 MB/sec
                   AES-128-GCM - TLS1.3  90.77 MB/sec
                   AES-128-CCM - TLS1.2  78.60 MB/sec
                   AES-128-CCM - TLS1.3  78.57 MB/sec
             CHACHA20-POLY1305 - TLS1.2  186.75 MB/sec
             CHACHA20-POLY1305 - TLS1.3  185.87 MB/sec
                   AES-128-CBC - TLS1.0  103.75 MB/sec
              CAMELLIA-128-CBC - TLS1.0  62.15 MB/sec
           GOST28147-TC26Z-CNT - TLS1.2  23.45 MB/sec

For comparison, here is the same command running on an old APU2C2 machine from 2012 which should be slower than the RockPro64:
Code:
gnutls-cli --benchmark-tls-ciphers
Testing throughput in cipher/MAC combinations (payload: 1400 bytes)
                   AES-128-GCM - TLS1.2  74.29 MB/sec
                   AES-128-GCM - TLS1.3  67.16 MB/sec
                   AES-128-CCM - TLS1.2  29.43 MB/sec
                   AES-128-CCM - TLS1.3  28.55 MB/sec
             CHACHA20-POLY1305 - TLS1.2  23.21 MB/sec
             CHACHA20-POLY1305 - TLS1.3  21.89 MB/sec
                   AES-128-CBC - TLS1.0  20.25 MB/sec
              CAMELLIA-128-CBC - TLS1.0  8.69 MB/sec
           GOST28147-TC26Z-CNT - TLS1.2  3.79 MB/sec

Testing throughput in cipher/MAC combinations (payload: 16384 bytes)
                   AES-128-GCM - TLS1.2  131.10 MB/sec
                   AES-128-GCM - TLS1.3  127.85 MB/sec
                   AES-128-CCM - TLS1.2  36.79 MB/sec
                   AES-128-CCM - TLS1.3  35.60 MB/sec
             CHACHA20-POLY1305 - TLS1.2  27.30 MB/sec
             CHACHA20-POLY1305 - TLS1.3  26.94 MB/sec
                   AES-128-CBC - TLS1.0  31.21 MB/sec
              CAMELLIA-128-CBC - TLS1.0  11.26 MB/sec
           GOST28147-TC26Z-CNT - TLS1.2  4.01 MB/sec


As for the M1, this is a Linux VM running on the same M1:
Code:
gnutls-cli --benchmark-tls-ciphers
Testing throughput in cipher/MAC combinations (payload: 1400 bytes)
                   AES-128-GCM - TLS1.2  0.72 GB/sec
                   AES-128-GCM - TLS1.3  0.70 GB/sec
                   AES-128-CCM - TLS1.2  0.35 GB/sec
                   AES-128-CCM - TLS1.3  0.34 GB/sec
             CHACHA20-POLY1305 - TLS1.2  178.56 MB/sec
             CHACHA20-POLY1305 - TLS1.3  177.00 MB/sec
                   AES-128-CBC - TLS1.0  0.36 GB/sec
              CAMELLIA-128-CBC - TLS1.0  63.72 MB/sec
           GOST28147-TC26Z-CNT - TLS1.2  23.59 MB/sec

Testing throughput in cipher/MAC combinations (payload: 16384 bytes)
                   AES-128-GCM - TLS1.2  0.87 GB/sec
                   AES-128-GCM - TLS1.3  0.87 GB/sec
                   AES-128-CCM - TLS1.2  0.38 GB/sec
                   AES-128-CCM - TLS1.3  0.37 GB/sec
             CHACHA20-POLY1305 - TLS1.2  193.52 MB/sec
             CHACHA20-POLY1305 - TLS1.3  192.91 MB/sec
                   AES-128-CBC - TLS1.0  0.56 GB/sec
              CAMELLIA-128-CBC - TLS1.0  65.77 MB/sec
           GOST28147-TC26Z-CNT - TLS1.2  23.98 MB/sec
[root@alarm ~]# gnutls-cli --version
gnutls-cli 3.6.12

This is the same for base OpenSSL. However, if I install that package from ports or pkg I get ~25 times more faster operations. However there seems to exist no option to enable something similar on GNUTLS or Nettle. Considering how fast an older version performed on Linux on the same hardware it also doesn't look like its related to a version. In any case this basically makes the possibility of securing net/samba413 on ARM based SBC hardware a bit complicated.

Kindly appreciate any insights and discussion.
 
Nettle doesn't support aarch64 crypto acceleration in it's current release as far as I can tell, it's in master branch however I haven't checked if support in FreeBSD gets detected correctly.
Regarding benchmarks in general there are at least two more issues you need to take into account, since FreeBSD's kernel isn't big.LITTLE-aware (it assumes all cores as being equal) you may end up running the benchmark on a little core which will affect performance even hardware crypto. Additionally as hardware crypto instructions are optional and detection for Linux doesn't work on FreeBSD some ports may end up relying on compiler flags so I would recommend setting for example CPUTYPE?=cortex-a53+crc+crypto in /etc/make.conf if your target hardware supports it.

For example, this is what I get on my RockPro64 (13-STABLE)
Only using CPUTYPE?=cortex-a53 as I have a few devices that doesn't do crypto (which may not be run-time detected)

Code:
Testing throughput in cipher/MAC combinations (payload: 1400 bytes)
                   AES-128-GCM - TLS1.2  21.47 MB/sec
                   AES-128-GCM - TLS1.3  26.41 MB/sec
                   AES-128-CCM - TLS1.2  16.07 MB/sec
                   AES-128-CCM - TLS1.3  12.04 MB/sec
             CHACHA20-POLY1305 - TLS1.2  31.92 MB/sec
             CHACHA20-POLY1305 - TLS1.3  31.07 MB/sec
                   AES-128-CBC - TLS1.0  17.15 MB/sec
              CAMELLIA-128-CBC - TLS1.0  17.27 MB/sec
           GOST28147-TC26Z-CNT - TLS1.2  7.30 MB/sec

Testing throughput in cipher/MAC combinations (payload: 16384 bytes)
                   AES-128-GCM - TLS1.2  25.08 MB/sec
                   AES-128-GCM - TLS1.3  20.41 MB/sec
                   AES-128-CCM - TLS1.2  12.79 MB/sec
                   AES-128-CCM - TLS1.3  13.05 MB/sec
             CHACHA20-POLY1305 - TLS1.2  42.21 MB/sec
             CHACHA20-POLY1305 - TLS1.3  41.03 MB/sec
                   AES-128-CBC - TLS1.0  19.58 MB/sec
              CAMELLIA-128-CBC - TLS1.0  19.92 MB/sec
           GOST28147-TC26Z-CNT - TLS1.2  8.12 MB/sec
 
gnutls does cpu features detection at runtime if the getauxval api is found, we don't have it on FreeBSD but an equivalent one
elf_aux_info(3). It shouldn't be hard to add the missing bits.
 
try
GNUTLS_CPUID_OVERRIDE=0x24 gnutls-cli --benchmark-tls-ciphers on the rockchip box
i have an amlogic s905x3 box running linux
speed improvment is about 6-7x (it works without override but you can run it software only)
 
Nettle doesn't support aarch64 crypto acceleration in it's current release as far as I can tell, it's in master branch however I haven't checked if support in FreeBSD gets detected correctly.
Regarding benchmarks in general there are at least two more issues you need to take into account, since FreeBSD's kernel isn't big.LITTLE-aware (it assumes all cores as being equal) you may end up running the benchmark on a little core which will affect performance even hardware crypto. Additionally as hardware crypto instructions are optional and detection for Linux doesn't work on FreeBSD some ports may end up relying on compiler flags so I would recommend setting for example CPUTYPE?=cortex-a53+crc+crypto in /etc/make.conf if your target hardware supports it.

For example, this is what I get on my RockPro64 (13-STABLE)
Only using CPUTYPE?=cortex-a53 as I have a few devices that doesn't do crypto (which may not be run-time detected)

Code:
Testing throughput in cipher/MAC combinations (payload: 1400 bytes)
                   AES-128-GCM - TLS1.2  21.47 MB/sec
                   AES-128-GCM - TLS1.3  26.41 MB/sec
                   AES-128-CCM - TLS1.2  16.07 MB/sec
                   AES-128-CCM - TLS1.3  12.04 MB/sec
             CHACHA20-POLY1305 - TLS1.2  31.92 MB/sec
             CHACHA20-POLY1305 - TLS1.3  31.07 MB/sec
                   AES-128-CBC - TLS1.0  17.15 MB/sec
              CAMELLIA-128-CBC - TLS1.0  17.27 MB/sec
           GOST28147-TC26Z-CNT - TLS1.2  7.30 MB/sec

Testing throughput in cipher/MAC combinations (payload: 16384 bytes)
                   AES-128-GCM - TLS1.2  25.08 MB/sec
                   AES-128-GCM - TLS1.3  20.41 MB/sec
                   AES-128-CCM - TLS1.2  12.79 MB/sec
                   AES-128-CCM - TLS1.3  13.05 MB/sec
             CHACHA20-POLY1305 - TLS1.2  42.21 MB/sec
             CHACHA20-POLY1305 - TLS1.3  41.03 MB/sec
                   AES-128-CBC - TLS1.0  19.58 MB/sec
              CAMELLIA-128-CBC - TLS1.0  19.92 MB/sec
           GOST28147-TC26Z-CNT - TLS1.2  8.12 MB/sec

I've rebuilt Nettle and GNUTLS with those flags but it didn't made a difference. Right now I'm building a full image with those flags to check if it makes a difference.

try
GNUTLS_CPUID_OVERRIDE=0x24 gnutls-cli --benchmark-tls-ciphers on the rockchip box
i have an amlogic s905x3 box running linux
speed improvment is about 6-7x (it works without override but you can run it software only)

I went from 6-7MB/s to 35-42MB/s! Now I wonder if I can make Samba use GNUTLS with those flags.


GNUTLS_CPUID_OVERRIDE=0x24 gnutls-cli --benchmark-tls-ciphers
Testing throughput in cipher/MAC combinations (payload: 1400 bytes)
AES-128-GCM - TLS1.2 35.37 MB/sec
AES-128-GCM - TLS1.3 26.27 MB/sec
AES-128-CCM - TLS1.2 19.80 MB/sec
AES-128-CCM - TLS1.3 20.49 MB/sec
CHACHA20-POLY1305 - TLS1.2 14.79 MB/sec
CHACHA20-POLY1305 - TLS1.3 14.51 MB/sec
AES-128-CBC - TLS1.0 19.59 MB/sec
CAMELLIA-128-CBC - TLS1.0 7.50 MB/sec
GOST28147-TC26Z-CNT - TLS1.2 3.28 MB/sec

Testing throughput in cipher/MAC combinations (payload: 16384 bytes)
AES-128-GCM - TLS1.2 42.84 MB/sec
AES-128-GCM - TLS1.3 43.46 MB/sec
AES-128-CCM - TLS1.2 25.65 MB/sec
AES-128-CCM - TLS1.3 34.18 MB/sec
CHACHA20-POLY1305 - TLS1.2 16.37 MB/sec
CHACHA20-POLY1305 - TLS1.3 16.18 MB/sec
AES-128-CBC - TLS1.0 25.10 MB/sec
CAMELLIA-128-CBC - TLS1.0 7.14 MB/sec
GOST28147-TC26Z-CNT - TLS1.2 3.31 MB/sec

 
try samba_server_env or samba_server_env_file in rc.conf
I got a 30% boost from that. MacOS client reaching 10MB/s read while Windows 10 can go up to 14MB/s. Also cpusetting the jail to run only on the a72 cores gives more consistent results.

Although still poor maybe the issue now goes beyond the crypto libraries now. Will report back after I've rebuilt my freebsd image.

EDIT: information provided was enough for me. Marking as solved. Thank you.
 
Back
Top