Hi All,
Need some help and direction please.
I have a system with an Intel QAT 8955 installed. I am seeing about a 30% performance throughput increase when qat() and geli() are enabled, but no where near the throughput Intel have stated of 5 gigabits per second (671 megabytes per second).
I have put together a simple test script as follows:
The results are:
Read:
Write:
The maximum throughput I am seeing with the QAT card installed is 35.6 megabytes per second (or 0.284 gigabits per second) throughput, or only about 5% utilisation of the QAT card.
The underlying disk is a PCI NVME which tests without using geli() at 1,082,423,799 (1 gigabyte per second) and 1,279,609,700 (1.2 gigabytes per second) for write and read respectively. Even using no encryption (NULL-128 or NULL-256) is faster than with encryption using offload.
It does seem that the throughput is being limited by the CPU which is maxing out at 100% on a single core for geli() in the host machine (testing in an older Intel(R) Xeon(R) CPU E5-2403 0 @ 1.80GHz).
If it is offloading the crypto functions to the QAT, why is the CPU maxed out?
Any ideas or signposting please?
Need some help and direction please.
I have a system with an Intel QAT 8955 installed. I am seeing about a 30% performance throughput increase when qat() and geli() are enabled, but no where near the throughput Intel have stated of 5 gigabits per second (671 megabytes per second).
I have put together a simple test script as follows:
sh:
#!/bin/sh
for a in HMAC/SHA1 HMAC/RIPEMD160 HMAC/SHA256 HMAC/SHA384 HMAC/SHA512; do
for e in AES-XTS AES-CBC Camellia-CBC NULL; do
for l in 128 256; do
echo Parameters: $a $e $l
dd if=/dev/random of=/root/nda0_test.key bs=128k count=1 > /dev/null 2>&1
gpart create -s GPT nda0 > /dev/null
gpart add -a 4096 -t freebsd-ufs -l nda0_test nda0 > /dev/null
geli init -P -a "$a" -e "$e" -l $l -s 4096 -K /root/nda0_test.key -B nda0_test.eli gpt/nda0_test > /dev/null
geli attach -p -k nda0_test.key gpt/nda0_test
dd if=/dev/zero of=/dev/gpt/nda0_test.eli bs=10m count=25 status=progress 2>&1 | grep sec | awk '{print "write: " $7 " " $8}'
dd if=/dev/gpt/nda0_test.eli of=/dev/null bs=10m count=25 status=progress 2>&1 | grep sec | awk '{print "read : " $7 " " $8}'
geli kill gpt/nda0_test.eli > /dev/null
gpart destroy -F nda0 > /dev/null
echo
done
done
done
The results are:
Read:
| AESNI only | AES-XTS-128 | AES-XTS-256 | AES-CBC-128 | AES-CBC-256 | Camellia-CBC-128 | Camellia-CBC-256 | NULL-128 | NULL-256 |
| HMAC/SHA1 | 30,702,908 | 27,704,552 | 32,433,531 | 29,327,177 | 28,142,544 | 24,942,065 | 54,497,500 | 54,513,682 |
| HMAC/RIPEMD160 | 33,000,080 | 29,583,744 | 35,001,742 | 31,410,990 | 30,043,360 | 26,430,374 | 62,091,234 | 62,108,164 |
| HMAC/SHA256 | 27,655,169 | 25,204,265 | 29,096,825 | 26,582,631 | 25,629,302 | 22,968,065 | 45,561,232 | 45,550,977 |
| HMAC/SHA384 | 29,819,811 | 26,967,032 | 31,466,261 | 28,541,287 | 27,358,907 | 24,397,319 | 51,732,107 | 51,713,366 |
| HMAC/SHA512 | 28,361,483 | 25,713,693 | 29,757,040 | 27,130,714 | 26,178,791 | 23,392,581 | 48,074,248 | 48,076,202 |
| AESNI + QAT(8955: sym;asym) | AES-XTS-128 | AES-XTS-256 | AES-CBC-128 | AES-CBC-256 | Camellia-CBC-128 | Camellia-CBC-256 | NULL-128 | NULL-256 |
| HMAC/SHA1 | 36,461,684 | 36,244,038 | 35,709,570 | 35,257,384 | 28,148,318 | 24,955,089 | 54,493,776 | 54,488,583 |
| HMAC/RIPEMD160 | 32,979,281 | 29,567,102 | 35,002,866 | 31,424,410 | 30,029,677 | 26,433,818 | 62,093,092 | 62,100,242 |
| HMAC/SHA256 | 33,882,860 | 33,719,146 | 33,798,358 | 35,220,940 | 25,637,079 | 22,981,363 | 45,571,151 | 45,491,652 |
| HMAC/SHA384 | 30,826,526 | 30,204,090 | 30,421,097 | 30,692,806 | 27,361,179 | 24,387,271 | 51,740,498 | 51,729,234 |
| HMAC/SHA512 | 28,060,222 | 27,506,714 | 27,839,866 | 28,689,258 | 26,179,396 | 23,391,794 | 48,084,271 | 48,065,621 |
Write:
| AESNI | AES-XTS-128 | AES-XTS-256 | AES-CBC-128 | AES-CBC-256 | Camellia-CBC-128 | Camellia-CBC-256 | NULL-128 | NULL-256 |
| HMAC/SHA1 | 30,418,357 | 27,727,278 | 31,907,911 | 28,930,783 | 27,632,754 | 24,501,878 | 54,319,135 | 54,366,034 |
| HMAC/RIPEMD160 | 33,010,929 | 29,561,264 | 34,420,730 | 30,961,114 | 29,456,883 | 25,903,132 | 61,916,543 | 61,859,619 |
| HMAC/SHA256 | 27,678,885 | 25,216,331 | 28,674,284 | 26,247,519 | 25,218,913 | 22,613,495 | 45,544,125 | 45,497,415 |
| HMAC/SHA384 | 29,880,117 | 27,090,289 | 31,042,452 | 28,240,152 | 26,970,339 | 23,982,172 | 51,816,632 | 51,842,341 |
| HMAC/SHA512 | 28,487,979 | 25,838,583 | 29,483,993 | 26,918,142 | 25,832,218 | 23,065,068 | 48,369,646 | 48,301,360 |
| AESNI + QAT(8955: sym;asym) | AES-XTS-128 | AES-XTS-256 | AES-CBC-128 | AES-CBC-256 | Camellia-CBC-128 | Camellia-CBC-256 | NULL-128 | NULL-256 |
| HMAC/SHA1 | 35,207,868 | 35,656,579 | 35,061,065 | 27,653,261 | 24,506,907 | 24,506,907 | 54,438,692 | 54,385,295 |
| HMAC/RIPEMD160 | 33,029,922 | 29,603,298 | 34,425,763 | 30,990,814 | 29,482,257 | 25,919,554 | 62,052,646 | 61,944,810 |
| HMAC/SHA256 | 33,253,545 | 32,882,889 | 33,244,963 | 32,874,411 | 25,236,309 | 22,608,838 | 45,569,662 | 45,610,269 |
| HMAC/SHA384 | 29,828,802 | 29,552,863 | 29,834,282 | 29,653,120 | 26,997,699 | 24,005,477 | 52,001,647 | 51,941,500 |
| HMAC/SHA512 | 27,210,910 | 26,600,296 | 27,166,874 | 26,880,947 | 25,847,093 | 23,057,189 | 48,394,109 | 48,335,688 |
The maximum throughput I am seeing with the QAT card installed is 35.6 megabytes per second (or 0.284 gigabits per second) throughput, or only about 5% utilisation of the QAT card.
The underlying disk is a PCI NVME which tests without using geli() at 1,082,423,799 (1 gigabyte per second) and 1,279,609,700 (1.2 gigabytes per second) for write and read respectively. Even using no encryption (NULL-128 or NULL-256) is faster than with encryption using offload.
It does seem that the throughput is being limited by the CPU which is maxing out at 100% on a single core for geli() in the host machine (testing in an older Intel(R) Xeon(R) CPU E5-2403 0 @ 1.80GHz).
If it is offloading the crypto functions to the QAT, why is the CPU maxed out?
Any ideas or signposting please?