Other How is the UNIX Calling Convention faster than Microsoft® convention?

I was reading thought the System calls section in the Developers Handbook and at the chapter A.3.3, it is mentioned that the preferred calling convention should be the UNIX calling convention for a number of reasons (except you need Linux compatibility). The thing that interested me a lot was that the UNIX calling convention is faster. What does that "faster" mean? Faster to write, read, or "faster" as faster runtime? If the last is the case, then I cannot see why. If anything, I would expect the UNIX calling convention to be slower!

Based on my understanding, the UNIX calling convention pushes values on the stack so, it uses memory instead of registers. To add to that, the kernel will then have to read the memory and assign the values to the registers. On the other hand, the Microsoft® convention, directly adds the values to the registers, so there it eliminates the need to both read and write to memory.

So yeah, I'm kinda of a newbie to calling conventions (and to assembly in general) so I would appreciate it if anyone could explain that.
 
the handbook theory is that when passing args in regs you need to save / restore them before / after syscall and this is bulkier and probably slower
the slow/fast argument may count for MSDOS but for a syscall which does a mode switch user/kernel a few cycles more or less for the call itself won't matter
 
Most likely that's what's meant. I would recommend to direct that question to the freebsd-hackers mailing list, it's more likely to get some deeper technical explanation there.
Thanks for the advice! I didn't want to make such a big deal about it as I'm going to use the Unix Convention anyway as the compiler will take care of portability. I just posted out of curiosity.

I may post in the mailing list later, depend on how my day goes. Have a great day, my friend!
 
the handbook theory is that when passing args in regs you need to save / restore them before / after syscall and this is bulkier and probably slower
the slow/fast argument may count for MSDOS but for a syscall which does a mode switch user/kernel a few cycles more or less for the call itself won't matter
Hmm... I see! It may indeed allow better optimizations as the values will be there in the stack, so if you need something, you grab it!
It will still be nice if a FreeBSD developer were to make a showcase example, so we could see, however!
 
Yeah, that would make sense for the past! But wouldn't be relevant today.
It depends. Typically not all registers are available all the time, some registers may be specific use (floating point registers), you may have instruction only registers, data only registers etc.
Now toss in a potential context switch and data gets stored in memory at some point.
A syscall may wind up in a context switch (Note I said may), which could lead to "push values directly to the registers, make the syscall which could trigger a context switch resulting in register to memory for current process, then memory to register for the switched in process".
Convoluted? Sure, it won't happen on every single syscall, but if it's worst case that's the upper bound.

I think it also can depend on definitions: one could make an ABI that says "every syscall uses one register that points to a memory block containing all the parameters", sure you're pushing a single register, but also copying a memory block from userspace to kernel space (yes can be a simple remap not a true copy) and then all kernel work is indirect accesses.

Toss in "software always expands to use all hardware" and you need more registers because the compiler allocated all of them.

My opinions based on my understanding so could be far off base.
 
A calling convention is determined by who cleans up the stack after the function call.

x86 architecture uses the stack.
Non-integer parameters are passed on the stack, even when writing in pure Assembler.

Register variables are indeed MUCH faster access than memory based.
Stack variables are indexed types, where the Stack Segment is indexed by the Stack Pointer.
Students of Intel ASM understand there is a significant amount of clock cycles required for this setup.
Compiled languages require the preservation of most registers and flags, then restoration of the same when the function returns.
Simple function returns (integer, boolean) is returned in the xAX register, which is allowed to be destroyed.

Being a decades-long ASM coder, I learned how to use register variables as well as stuffing the NDP registers in lieu of memory stores.
I'm not familiar with UNIX compilers, but only with Borland/Embarcadero products, which allow for inline ASM statements.

Those working in compiled languages are encouraged to learn to employ a Profiler to optimize their code.
GUI environments are huge pigs, so saving one or two cycles is a wasted effort unless calling that routine a million times.
 
It depends. Typically not all registers are available all the time, some registers may be specific use (floating point registers), you may have instruction only registers, data only registers etc.
Now toss in a potential context switch and data gets stored in memory at some point.
A syscall may wind up in a context switch (Note I said may), which could lead to "push values directly to the registers, make the syscall which could trigger a context switch resulting in register to memory for current process, then memory to register for the switched in process".
Convoluted? Sure, it won't happen on every single syscall, but if it's worst case that's the upper bound.

I think it also can depend on definitions: one could make an ABI that says "every syscall uses one register that points to a memory block containing all the parameters", sure you're pushing a single register, but also copying a memory block from userspace to kernel space (yes can be a simple remap not a true copy) and then all kernel work is indirect accesses.

Toss in "software always expands to use all hardware" and you need more registers because the compiler allocated all of them.

My opinions based on my understanding so could be far off base.
Thank you for adding your opinions and helping everyone else learn!
 
A calling convention is determined by who cleans up the stack after the function call.
Wait, that will be function calling convention and not the system call one, right? If I'm not mistaken, the function calling convention is the same across UNIX systems, and it is the AMD64 System V ABI. But then when it comes to system call calling convention, FreeBSD (and I suppose other BSDs) use the C/UNIX calling convention while Linux uses Microsoft's MS-DOS one!

x86 architecture uses the stack.
Non-integer parameters are passed on the stack, even when writing in pure Assembler.
I mean, when it comes to hardware level, numbers don't have types, right? Everything is just a series of bytes from my knowledge and understanding. Same way that there aren't really unsigned and signed numbers, and it's just how the instruction (and higher functions) treat them.

Register variables are indeed MUCH faster access than memory based.
Stack variables are indexed types, where the Stack Segment is indexed by the Stack Pointer.
Students of Intel ASM understand there is a significant amount of clock cycles required for this setup.
Compiled languages require the preservation of most registers and flags, then restoration of the same when the function returns.
Simple function returns (integer, boolean) is returned in the xAX register, which is allowed to be destroyed.
That's interesting! I suppose that in a compiler, this is the job of the backend! Even tho, the frontend can probably provide more info to give more chances for optimizations!

Being a decades-long ASM coder, I learned how to use register variables as well as stuffing the NDP registers in lieu of memory stores.
I'm not familiar with UNIX compilers, but only with Borland/Embarcadero products, which allow for inline ASM statements.
When you say "UNIX compilers", what exactly are you talking about?

Those working in compiled languages are encouraged to learn to employ a Profiler to optimize their code.
Me: intensively sweating

GUI environments are huge pigs, so saving one or two cycles is a wasted effort unless calling that routine a million times.
Yeah, it doesn't matter! Like I said, this post is both out of curiosity and in order to be able to properly learn and understand how things work.
 
speaking of optimization check out this program (pulled from stack overflow)
this inits an array of 32k elements to random 8bit numbers (0-255)
then sums the numbers in the array that are larger than 127
it does it 50k time to take more time
if you run the program with the argument "sort" then the array is sorted before the summing operations

counter intuitive when the array is sorted the program runs 2.5x faster
the explanation is that the branch predictor will fail less
it's not necessary to be sorted just partitioned so the numbers < 128 are not mixed with the ones >= 128
C++:
#include <algorithm>
#include <ctime>
#include <iostream>

int main(int argc,char **argv)
{
    // Generate data
    const unsigned arraySize = 32768;
    int data[arraySize];

    for (unsigned c = 0; c < arraySize; ++c)
        data[c] = std::rand() % 256;

    // !!! With this, the next loop runs faster.
    if(argc == 2 && !strcmp(argv[1],"sort"))
    std::sort(data, data + arraySize);
    // Test
    clock_t start = clock();
    long long sum = 0;
    for (unsigned i = 0; i < 50000; ++i)
    {
        for (unsigned c = 0; c < arraySize; ++c)
        {   // Primary loop.
            if (data[c] >= 128)
                sum += data[c];
        }
    }

    double elapsedTime = static_cast<double>(clock()-start) / CLOCKS_PER_SEC;

    std::cout << elapsedTime << '\n';
    std::cout << "sum = " << sum << '\n';
}
now compile it with c++ a.cc -o acc (do not use -O2 or other optimizations)
run
./acc
./acc sort
 
I was reading thought the System calls section in the Developers Handbook and at the chapter A.3.3, it is mentioned that the preferred calling convention should be the UNIX calling

The Handbook is rather dated. The example is for i386 which passes syscall arguments on the stack (whilst Linux i386 passes them in registers). amd64 passes syscall arguments in registers. In that case the interface is pretty much amd64 System V ABI except that the syscall number is passed in RAX.

There is one slight complication. FreeBSD has three "syscall" syscalls, which is rather confusing.

  1. Standard syscalls. This is what libc does for you, if you write "printf" under the hood libc will call the "write" syscall.
  2. syscall 0 (SYS_syscall). This is what the libc "syscall()" function does. Because 0 is in RAX all the arguments get shifted by one (that means that target syscall number will be in RDI, the first syscall argument in RSI instead of RDI etc.). I've never really understood why libc doesn't just shuffle the args and make a standard syscall. Perhaps it's to make debugging and tracing easier.
  3. syscall 198 (SYS___syscall). This time libc "__syscall()". As point 2 but it does extra argument alignment checking. On amd64 afaik arguments are always correctly aligned so this does nothing.
 
speaking of optimization check out this program (pulled from stack overflow)
this inits an array of 32k elements to random 8bit numbers (0-255)
then sums the numbers in the array that are larger than 127
it does it 50k time to take more time
if you run the program with the argument "sort" then the array is sorted before the summing operations

counter intuitive when the array is sorted the program runs 2.5x faster
the explanation is that the branch predictor will fail less
it's not necessary to be sorted just partitioned so the numbers < 128 are not mixed with the ones >= 128
C++:
#include <algorithm>
#include <ctime>
#include <iostream>

int main(int argc,char **argv)
{
    // Generate data
    const unsigned arraySize = 32768;
    int data[arraySize];

    for (unsigned c = 0; c < arraySize; ++c)
        data[c] = std::rand() % 256;

    // !!! With this, the next loop runs faster.
    if(argc == 2 && !strcmp(argv[1],"sort"))
    std::sort(data, data + arraySize);
    // Test
    clock_t start = clock();
    long long sum = 0;
    for (unsigned i = 0; i < 50000; ++i)
    {
        for (unsigned c = 0; c < arraySize; ++c)
        {   // Primary loop.
            if (data[c] >= 128)
                sum += data[c];
        }
    }

    double elapsedTime = static_cast<double>(clock()-start) / CLOCKS_PER_SEC;

    std::cout << elapsedTime << '\n';
    std::cout << "sum = " << sum << '\n';
}
now compile it with c++ a.cc -o acc (do not use -O2 or other optimizations)
run
./acc
./acc sort
Yeah, branch prediction indeed makes a HUGE difference! It sounds unreal, but it truly does. Thanks for sharing!
 
The Handbook is rather dated. The example is for i386 which passes syscall arguments on the stack (whilst Linux i386 passes them in registers). amd64 passes syscall arguments in registers. In that case the interface is pretty much amd64 System V ABI except that the syscall number is passed in RAX.
Oh! So, it does it like Linux, right? The instruction to make the system call is also "syscall" or is it still "int 80h"?

There is one slight complication. FreeBSD has three "syscall" syscalls, which is rather confusing.

  1. Standard syscalls. This is what libc does for you, if you write "printf" under the hood libc will call the "write" syscall.
And of course, it also does formatting when needed ;)

  1. syscall 0 (SYS_syscall). This is what the libc "syscall()" function does. Because 0 is in RAX all the arguments get shifted by one (that means that target syscall number will be in RDI, the first syscall argument in RSI instead of RDI etc.). I've never really understood why libc doesn't just shuffle the args and make a standard syscall. Perhaps it's to make debugging and tracing easier.
Yeah, sounds weird! It will also result in slightly performance hit. Of course, the difference is not noticeable but still, it's sits bad with me, it seems wrong. Even if it was for easier dubbing and tracing, they could still normally make the system calls when compiling without debug symbols and enabling optimizations.

  1. syscall 198 (SYS___syscall). This time libc "__syscall()". As point 2 but it does extra argument alignment checking. On amd64 afaik arguments are always correctly aligned so this does nothing.
Sad that they are making things bloated. Unless there is a reason that we don't know however...
 
You are all also forgetting something that happened here. The MSDOS call convention passes the values in registers, instead of stack. On Unix, the values are not passed in registers but in cache. You can expect the first KB or such of the stack to reside in the L1 cache. When comparing the speed of CPU core to the main memory, this is equivalent to an original PC running not from RAM but the MFM/RLL Hard Disk. Unix machines usually have some cache since, well, almost forever. MSDOS had not.
 
Also, the MS calling convention in Windows was, IIRC, absolutely not aligned. And they passes packet structs on the stack. This made NT for Alpha such a snail, as the compiler had to byte-read almost anything from memory. Or use a trap handler for that.
 
You are all also forgetting something that happened here. The MSDOS call convention passes the values in registers, instead of stack. On Unix, the values are not passed in registers but in cache. You can expect the first KB or such of the stack to reside in the L1 cache. When comparing the speed of CPU core to the main memory, this is equivalent to an original PC running not from RAM but the MFM/RLL Hard Disk. Unix machines usually have some cache since, well, almost forever. MSDOS had not.
So it makes sense why the decisions were made that way. UNIX didn't think that using the Registers would make much of a difference from using the L1 cache. And MS-DOS couldn't do it even if they thought the same because they didn't always have cache. But in the end, the handbook is not only outdated but also wrong. Or like it was said before, the UNIX way may give more chances for optimization, but this is still not 100% in every case.

Also, the MS calling convention in Windows was, IIRC, absolutely not aligned. And they passes packet structs on the stack. This made NT for Alpha such a snail, as the compiler had to byte-read almost anything from memory. Or use a trap handler for that.
Not having cache memory really seems so weird to me. I do believe it of course, but it seems weird nonetheless! Guess I'm very young...
 
When you say "UNIX compilers", what exactly are you talking about?
GCC (Gnu Compiler Collection, available as lang/gcc for FreeBSD)
LLVM (available as devel/llvm for FreeBSD)
Java (FreeBSD ports have a whole category dedicated to it)
Generally, lots of well-known (and some obscure ones, too!) compiled languages are available for FreeBSD and UNIX in general. An exception would be stuff that is specifically for the Microsoft Windows platform like Visual C#...
Me: intensively sweating
I personally find the idea of learning how to use a profiler interesting... there's prof, and gprof, for starters, and lots of languages/compiler packages include a profiler. It's not always easy to find the correct port for that stuff, granted. My understanding is that knowing how to use a profiler can help with optimizing the code.
 
Not having cache memory really seems so weird to me. I do believe it of course, but it seems weird nonetheless! Guess I'm very young...
MS-DOS was designed for the intel 8088, technically a 16bit CPU, but with only 8bit wide data bus, so the whole system built with it was more or less a typical 8bit microcomputer. These never had any cache, and they wouldn't have profited from it much anyways: Back then, the bottleneck accessing RAM was not caused by the RAM being much slower than the CPU, but by the CPU calculating addresses and executing the bus cycles.

I'm not too familiar with the 8088 or the original IBM PC, but instead very familiar with the 6502, especially in the C64, which is an even simpler system (completely 8bit, with a 16bit wide address bus), and is surprisingly more efficient accessing RAM than the 8088 (but on the other hand only offers one single "general purpose" register and two index registers). The 6502 contains a nice and simple trick called the "zeropage", many instructions have a special addressing mode with only 8bit addresses, the upper 8bits are then hardwired to 0. This speeds up memory accesses by one cycle. So you have 256 bytes of "faster" RAM available :cool:
 
original pc was 8086 which had 16 bit data bus and 20 bit address bits. the 8088 came later and was used in ibm xt. external cache was first in the 386 and internal cache in the 486.
atari st TOS (kind of dos) was passing args on stack for syscalls like a std C call
early windows versions had a pascal calling convention for winapi exported functions because it reduced code size. in pascal cc the callee cleans the stack not the caller like in c
pascal pushes args in order, c in reverse order so pascal cant easily support variable number of arguments
winapi may use pascal as of now but im not sure
 
I wrote the Xmodem protocol on an IBM PC using BASCOM (basic compiler) then wrote the same on an Atari 800 (6502) using the BASIC cartridge.
The Atari had a faster file transfer rate over a null modem cable.
It beat the 8088 / BASCOM configuration quite handily.
LDA.. STA... LDA... STA...
 
GCC (Gnu Compiler Collection, available as lang/gcc for FreeBSD)
LLVM (available as devel/llvm for FreeBSD)
Java (FreeBSD ports have a whole category dedicated to it)
Generally, lots of well-known (and some obscure ones, too!) compiled languages are available for FreeBSD and UNIX in general. An exception would be stuff that is specifically for the Microsoft Windows platform like Visual C#...
Oh, I thought so! But the way he/she said "UNIX compilers" I didn't know if he was referring to compiler (or backends cause very few compilers these days have their own backends) that create binaries for UNIX platforms or compilers that create binaries using the UNIX calling convention, so I wanted a clarification on that one.

I personally find the idea of learning how to use a profiler interesting... there's prof, and gprof, for starters, and lots of languages/compiler packages include a profiler. It's not always easy to find the correct port for that stuff, granted. My understanding is that knowing how to use a profiler can help with optimizing the code.
Haha, yeah tbh, that was a joke. Profilers are indeed important to either make micro-optimizations or to find that part that makes your code veeeery slow.
 
MS-DOS was designed for the intel 8088, technically a 16bit CPU, but with only 8bit wide data bus, so the whole system built with it was more or less a typical 8bit microcomputer. These never had any cache, and they wouldn't have profited from it much anyways: Back then, the bottleneck accessing RAM was not caused by the RAM being much slower than the CPU, but by the CPU calculating addresses and executing the bus cycles.
Thanks for sharing! The more I learn, the more I realize that I am indeed too young. And yeah, I remember hearing about the last part! It's the Moore's law, as showed in diagrams like this one! These days, it's the exact opposite of what it was back then. Cache is highly valuable today.

I'm not too familiar with the 8088 or the original IBM PC, but instead very familiar with the 6502, especially in the C64, which is an even simpler system (completely 8bit, with a 16bit wide address bus), and is surprisingly more efficient accessing RAM than the 8088 (but on the other hand only offers one single "general purpose" register and two index registers). The 6502 contains a nice and simple trick called the "zeropage", many instructions have a special addressing mode with only 8bit addresses, the upper 8bits are then hardwired to 0. This speeds up memory accesses by one cycle. So you have 256 bytes of "faster" RAM available :cool:
Given the fact that the slower part was the CPU accessing the RAM, it makes sense that they will design it this way. But at the same time, hardware design is much more complicated than software design so, these people knew better!
 
original pc was 8086 which had 16 bit data bus and 20 bit address bits. the 8088 came later and was used in ibm xt. external cache was first in the 386 and internal cache in the 486.
Wait, so "8088" was released after "8086" and it had a smaller data bus and address bit? What's the point? Was it much cheaper or something? Also, what was the case for "external" cache? Does an external cache even counts for the definition of "cache"?

atari st TOS (kind of dos) was passing args on stack for syscalls like a std C call
early windows versions had a pascal calling convention for winapi exported functions because it reduced code size.
It's interesting how much important code size was used to be and nowadays, you see compilers creating bloated binaries...

In pascal cc the callee cleans the stack not the caller like in c
pascal pushes args in order, c in reverse order so pascal cant easily support variable number of arguments
winapi may use pascal as of now but im not sure
Other than not been able to support easily variable number of arguments, were there any other problems with the Pascal Calling Convention? Because modern languages don't even use C's varargs and use templates instead. So if that is the only problem, it will not affect modern languages.
 
Back
Top