FreeBSD works slow on Raspberry Pi

maks · Jun 27, 2020

Hi community

I have tried to use FreeBSD 12.1 on my Pi3 and found out that it works two times slowly then Raspbian Linux for the same tasks. Here is the results below.
p.s. I think it happens because the FreeBSD does not switch the CPU to "Turbo" mode and as a fact it works really slow.

Code:

[user@freebsd4pi ~]$ uname -a
FreeBSD freebsd4pi.home 12.1-RELEASE FreeBSD 12.1-RELEASE r354233 GENERIC  arm64

[user@freebsd4pi ~/_test]$ ./my_test
1048576+0 records in
1048576+0 records out
1073741824 bytes transferred in 137.195754 secs (7826349 bytes/sec)
Execution time: 137 seconds.

[user@freebsd4pi ~]$ python --version
Python 3.7.7

[user@generic ~/_bench_lang]$ time python ./bench.py
Python bench.
[3, 4, 1, 3, 5, 1, 92, 2, 4124, 424, 52, 12]
[1, 1, 2, 3, 3, 4, 5, 12, 52, 92, 424, 4124]

real    0m57.174s
user    0m57.135s
sys     0m0.033s

-------------------------------------------
pi@raspberrypi:~/_test $ uname -a
Linux raspberrypi 4.19.118-v7+ #1311 SMP Mon Apr 27 14:21:24 BST 2020 armv7l GNU/Linux

pi@raspberrypi:~/_test $ ./my_test
1048576+0 records in
1048576+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 86.8891 s, 12.4 MB/s
Execution time: 87 seconds.

pi@raspberrypi:~/_bench_lang $ python --version
Python 2.7.16

pi@raspberrypi:~/_bench_lang $ time python ./bench.py
Python bench.
[3, 4, 1, 3, 5, 1, 92, 2, 4124, 424, 52, 12]
[1, 1, 2, 3, 3, 4, 5, 12, 52, 92, 424, 4124]

real    0m17.224s
user    0m17.070s
sys     0m0.040s

"my_test" looks inside like this:

Code:

#!/bin/sh
START=$(date +%s)
#fix start time
#start archivate process
dd bs=1k count=1M if=/dev/urandom | pigz -p 8 - > /dev/null

END=$(date +%s)
DIFF=$(( $END - $START ))
echo "Execution time: $DIFF seconds."

p.s. This is a simple benchmark which tests the CPU and memory in the same time.

George · Jun 27, 2020

What CPU frequency do you use?
Raspberry Pis have no fan, so the cpu is usually throttled down a lot.

ralphbsz · Jun 27, 2020

Sorry, this is not a simply benchmark. It is a very complex benchmark. Just looking at the first dd statement: it uses the kernel's urandom device to generate random bytes, which involves lots of user/kernel transitions, then it buffers up 1KiB of those up at a time, and then feeds then into pigz, which in turns runs a very complex algorithm (gzip compression) with 8 parallel processes, so it exercises multithreading. I think (but that's just intuition, not knowledge) that you are mostly CPU constrained in the kernel on random number generation, and then heavily exercising threading primitives, in the particular situation of having more active threads than cores.

For the python benchmark, we can't see the sourcecode, so I have no idea what it does. But most python code tends to be CPU constrained in the interpreter, but then single-threaded (due to the GIL). Interpreters tend to bottleneck on very specific operations; many decades ago I was trying to performance tune a Python interpreter that in in Java (the Jython version), on the Symantic JVM, and that one tended to be completely limited by string operations, to be exact: allocating memory for strings, abandoning that memory, and then having to garbage collect all the time. I think at some point I counted that the average line of python used several hundred string allocation operations, completely getting killed in the GC. So you might be running into just one specific operation being completely out of whack here.

And before SirDice shows up: The arm is a second-tier platform. Support for it is ... not first tier, to say it politely.

20-100-2fe · Jun 27, 2020

You also compare apples and oranges: FreeBSD uses arm64 architecture (64 bit) and Raspbian armv7l (32 bit). Comparing with an armv8/arm64 build of Raspbian would be more relevant.

maks · Jun 27, 2020

Elazar said:
What CPU frequency do you use?
Raspberry Pis have no fan, so the cpu is usually throttled down a lot.

My Pi has a fan. By the way good question about frequency.

Code:

[user@freebsd4pi /boot/msdos]$ sysctl -a | grep dev.cpu.
dev.cpu.0.temperature: 45.0C
dev.cpu.0.freq_levels: 1200/-1 600/-1
dev.cpu.0.freq: 600

It seems the freq is 600 which is means that this is not turbo. The question: how I can change the frequency in FreeBSD on the Raspberry Pi?

maks · Jun 27, 2020

ralphbsz said:
Sorry, this is not a simply benchmark.

The python code below. But it does not matter what code there was. Most important that the code was the same in both cases. It should show at least pretty similar time like 17s on BSD and maybe 16s on Raspbian, but not as we see above 57s vs 17s. Obviously that the FreeBSD did not use a "turbo" mode on the CPU.

Code:

print("Python bench.")

array = [3,4,1,3,5,1,92,2,4124,424,52,12]
t = 0

print(array)

for c in range (0, 100000):
    for i in range (0, len(array)):
        for y in range (0, (len(array)-1)):
            if array[y+1] < array[y]:
                t = array[y]
                array[y] = array[y + 1]
                array[y + 1] = t

print(array)

# after test should be [ 1, 1, 2, 3, 3, 4, 5, 12, 52, 92, 424, 4124 ]

George · Jun 27, 2020

sysctl dev.cpu.0.freq=1200. Temperature should be syctl dev.cpu.0.temperature.

You can also use powerd().

It is perfectly possible that one operating system is a bit faster, but it shouldn't be 40 seconds difference.

ralphbsz · Jun 27, 2020

OK, that python is pretty simple, it's just a bubble sort being run 100K times. If you are using the normal python interpreter, this will all be precompiled, and turned into relatively efficient list accesses. Given that the arrays here are pretty small, everything should fit in cache, so memory bandwidth should not be an issue. In that case, you should be CPU constrained on integer operations. As 20-100-2fe said, the word size on Raspbian will be 32 bits (easy to verify with a test program that prints sizeof() an int or a pointer), but the speed of cache to register operations should pretty much only depend on clock speed, since the CPI is very similar in 32- and 64-bit mode. So your bet that the clock speed is the source of all these problems seems correct.

maks · Jun 27, 2020

I figured it out already. By adding these lines to /etc/rc.conf I've got 1200Mhz

Code:

powerd_enable="YES"
performance_cx_lowest="Cmax"
economy_cx_lowest="Cmax"

Okay. Lets check it again.

maks · Jun 27, 2020

Elazar said:
sysctl dev.cpu.0.freq=1200. Temperature should be syctl dev.cpu.0.temperature.

Much better now. Thanks for the note about freq.

Code:

[user@freebsd4pi ~/_test]$ ./my_test
1048576+0 records in
1048576+0 records out
1073741824 bytes transferred in 74.301539 secs (14451138 bytes/sec)
Execution time: 75 seconds.

[user@freebsd4pi ~/_bench_lang]$ time python bench.py
Python bench.
[3, 4, 1, 3, 5, 1, 92, 2, 4124, 424, 52, 12]
[1, 1, 2, 3, 3, 4, 5, 12, 52, 92, 424, 4124]

real    0m28.538s
user    0m28.518s
sys     0m0.018s

First test even better than on Raspbian Linux how it should be

maks · Jun 27, 2020

Python 2.7 showed better result but still slower than the same version on Linux

Code:

[user@freebsd4pi ~/_bench_lang]$ time python2.7 ./bench.py
Python bench.
[3, 4, 1, 3, 5, 1, 92, 2, 4124, 424, 52, 12]
[1, 1, 2, 3, 3, 4, 5, 12, 52, 92, 424, 4124]

real    0m19.772s
user    0m19.747s
sys     0m0.027s

ralphbsz · Jun 27, 2020

The /dev/urandom test is a little insane as a generic benchmark. The kernel random number generator is not intended to create a gigabyte of random bytes. It is intended to be a high-quality random number generator, which is unguessable (because it is seeded with entropy), for purposes such as cryptographic communication. But you are using it for a huge quantity of random numbers, and speed at that consumption is not a design goal. If you need a billion random numbers, the generally accepted technique is to set up a random number generator of your own (that is in and of itself a huge and complex science), and if you need the entropy that the kernel provides, then just start the random number generator from /dev/urandom.

So I just wouldn't worry about that result.

The python test is a pretty good test of generic CPU speed, with some memory access thrown in. And the remaining difference between Linux (Raspbian) and FreeBSD could easily be explained by how they handle things like pointers and integers, with the 32- versus 64-bit difference.

Let me ask you a serious question: What are you really after? This sounds, as is common, like an XY problem: You are asking us about how to do one individual thing (benchmarking and adjusting CPU frequency), without telling us what your grand plan is. What do you want to use the Pi for? What are your requirements for it? How will you know when you are done performance tuning? Do these differences really matter?

maks · Jun 29, 2020

ralphbsz said:
So I just wouldn't worry about that result.

You may not be worried about the results which I've got by using this benchmark method, you have the right.

ralphbsz said:
Do these differences really matter?

The answer is: yes of course. Otherwise, why don't you buy 8086 and work with it?

a6h · Jun 29, 2020

ralphbsz
Regardless of python, pie or chocolate trifle; is there any merit to following code:

Code:

#include <stdio.h>
#include <time.h>
#include <sys/types.h>
#include <sys/limits.h>
int
main()
{
  volatile unsigned long u = UINT_MAX;
  clock_t start = clock();
  while (--u > 0);
  clock_t stop = clock();
  printf("%f\n", (double)(stop - start) / CLOCKS_PER_SEC);
  return 0;
}

cc -O0 compute.c -o compute.out

Say, run it for 8 times (an arbitary number) to calculate arithmetic mean, of some sort:
for i in `seq 0 7`; do ./compute.out; done
or
for i in `seq 0 7`; do /usr/bin/time ./compute.out > /dev/null; done

maks said:
why don't you buy 8086 and work with it

At a fair price and in a good condition, why not! it would be fine playing with Turbo C 2.0, Turbo Pascal 7.0 (5.5 is free) and MASM 5.0.

ralphbsz · Jun 29, 2020

Vigole: What are you trying to accomplish with this code? Measure the performance of your computer? The only thing that it measures is one line of code: "while (--u > 0)". So all that does is run decrement, test and branch instruction in a loop. The instructions will all be in L1 cache; I don't know enough about the architecture of modern chips, but those instructions might even be combined, and kept in microcode cache. The data is a single register. So you are not testing the memory interface at all, just CPU clock speed. For an unknown number of instructions (you'd need to look at the assembly listing of the compiler to see what instructions it actually generates).

And now it gets worse. A compiler is free to optimize that line away, because (a) the result of the calculation is not used, and (b) the loop can be unrolled. And if you are on a 64-bit machine, then this loop will never finish unless it is unrolled: UINT_MAX on that machine is 2^64 - 1, which is roughly 10^18, but there are only pi * 10^16 nanoseconds per year, so this code would take 500 years to run if we assume one loop iteration per nanosecond (roughly 3 instructions at 3GHz and a CPI of 1). On the other hand, on a 32-bit machine, it will take seconds.

So, let me ask the same question that I asked maks: what are you trying to accomplish? Benchmarking is not done in a vaccum, it has to have a purpose, otherwise it becomes pointless. Are you trying to measure clock frequency? There are easier and more accurate ways to do that. Are you trying to verify the microarchitecture of the CPU and the compiler? There are better ways to do that (read the assembly listing, and read Intel's architecture documentation for how the instructions actually work). In reality, benchmarking needs to be tied to the intended use of the computer. For example: you might want to use it to perform an everyday task like "run accounts payable overnight", so the benchmark might be: How many hours does it take to run accounts payable, and then estimate whether it will fit between 5pm and 8am. Or you might need to run a simulation (like of the interaction of distant galaxies), you want to have at least 1000 iterations of the simulation to get sufficient statistical accuracy, so you need to know how many minutes it takes per iteration, since you only get to use the supercomputer for 2 days. Which is why I keep asking: What is it that maks and you are after, doing with the computer? What are your performance needs, and your performance wants? How are the results of the measurements that you are doing connected to the task you want to accomplish? How are you going to know whether the performance is "good enough", or whether it is hopeless? How will you know that you have succeeded in benchmarking, and have found a (good or bad) answer? Are you intending to tune something, and do you have a plan for tuning it?

a6h · Jun 29, 2020

ralphbsz said:
1. The instructions will all be in L1 cache
2. The data is a single register
3. you are not testing the memory interface at all, just CPU clock speed
4. A compiler is free to optimize that line away

That is the answer to my question. Thank you.

I didn't want to benchmark an imaginary system, or perfom A/B comparison test on different architecture/ISA. There's lots of suggestion/snippets/scripts on the internet, jumping up and down here and there, ... similar to that one that I've posted, to test/compare execution time of different programs on one system, or in some cases comparing one program in different arch/ISA. I'm not a benchmarker, I just wanted to make sure, these kind of test/result is going nowhere.

maks · Jul 5, 2020

vigole said:
ralphbsz
Regardless of python, pie or chocolate trifle; is there any merit to following code:

Here is my code on C similar to Python.

Code:

#include <locale.h>
#include <stdio.h>

void sortarray( int array[], int size ) {

        int t = 0;

            for (int c = 0; c < 100000; c++) {
                for (int i = 0; i < size; i++) {
                    for (int y = 0; y < size - 1; y++) {
                        if (array[y + 1] < array[y]) {
                            t = array[y];
                            array[y] = array[y + 1];
                            array[y + 1] = t;
                        }
                    }
                }
            }

// after test should be [ 1, 1, 2, 3, 3, 4, 5, 12, 52, 92, 424, 4124 ]

            printf ("[");
        for (int i = 0; i < size; i++){
            printf ("%d, ", array[i]);
        }
            printf ("]");

            printf ("\n~end!");

} // end of function sortarray


int main (void) {

    setlocale(LC_ALL,"");

    int array[12] = { 3,4,1,3,5,1,92,2,4124,424,52,12 };

    sortarray(array, 12);
    return 0;

}

vigole said:
At a fair price and in a good condition, why not! it would be fine playing with Turbo C 2.0, Turbo Pascal 7.0 (5.5 is free) and MASM 5.0.

100% agreed to you. I really miss that time. And modem's sound while launch a connection was like music!

hoobastank69 · Jul 22, 2020

It being that slow isn't normal? it's like that for me too on my haswell i5 laptop
I am not trying to be funny

JonaEngel · Oct 28, 2022

hoobastank69 said:
It being that slow isn't normal? it's like that for me too on my haswell i5 laptop
I am not trying to be funny

Being that slow is not normal.
It is just easy to configure that little circuit to be that slow.

However, it should be enough to set the lowest performance mode frequency in /etc/rc.conf:

Code:

powerd_enable="YES"
performance_cx_lowest="Cmax"

In my experience, increasing the economy mode frequency will mainly help heat up your room.

maks said:
I figured it out already. By adding these lines to /etc/rc.conf I've got 1200Mhz

Code:

powerd_enable="YES" performance_cx_lowest="Cmax" economy_cx_lowest="Cmax"

Okay. Lets check it again.