Solved syscall wrappers

What is the point of the prolog for syscall wrappers? For instance, from FreeBSD 12.4 amd64 in Intel format

Code:
0000000000092b20 <msync>:
   92b20:       55                      push   rbp
   92b21:       48 89 e5                mov    rbp,rsp
   92b24:       48 8b 05 95 52 13 00    mov    rax,QWORD PTR [rip+0x135295]        # 1c7dc0 <svc_maxfd+0x78>
   92b2b:       5d                      pop    rbp
   92b2c:       ff e0                   jmp    rax
   92b2e:       cc                      int3 
   92b2f:       cc                      int3

In C that's just making function pointer call of via __libc_interposing
  1. Push the base pointer
  2. Copy the stack pointer to the base pointer
  3. Copy the rip-relative address of _msync into rax (objdump doesn't seem to be able to figure that out)
  4. Jump to _msync which looks like
Code:
│  > 0x800397440 <_msync>    mov    $0x41,%eax
│    0x800397445 <_msync+5>  mov    %rcx,%r10
│    0x800397448 <_msync+8>  syscall
│    0x80039744a <_msync+10> jb     0x8004055b4 <.cerror>
│    0x800397450 <_msync+16> ret

The TCO is clear enough.

What is the use of pushing and popping rbp? Is this always done with -fno-omit-frame-pointer?

The only side effect that I see is that the contents of rbp will still be just below the stack pointer.
 
OK to answer my own question with an example. Some made-up code

Code:
extern int(*pt[])(int, int);

int tc1(int a, int b)
{
   return pt[1](a, b);
}

If I compile that without optimization that disassembles to

Code:
0000000000000000 <tc1>:
0: 55                            push    rbp
1: 48 89 e5 mov rbp, rsp
4: 48 83 ec 10 sub rsp, 0x10
8: 89 7d fc mov dword ptr [rbp - 0x4], edi
b: 89 75 f8 mov dword ptr [rbp - 0x8], esi
e: 48 8b 04 25 00 00 00 00 mov rax, qword ptr [0x0]
16: 8b 7d fc mov edi, dword ptr [rbp - 0x4]
19: 8b 75 f8 mov esi, dword ptr [rbp - 0x8]
1c: ff d0 call rax
1e: 48 83 c4 10 add rsp, 0x10
22: 5d pop rbp
23: c3                            ret

  1. prolog
  2. make space for temporaries
  3. store args in temporaries
  4. get the rip-relative call address
  5. put the exact same values back from the temporaries to the registers where they came from
  6. call function
  7. reset stack pointer
  8. epilog
  9. return
If I compile with -O3 I get

Code:
0000000000000000 <tc1>:
0: 55                            push    rbp
1: 48 89 e5 mov rbp, rsp
4: 48 8b 05 00 00 00 00 mov rax, qword ptr [rip]    # 0xb <tc1+0xb>
b: 5d pop rbp
c: ff e0 jmp rax
That's the same as the syscall. So clang is using the frame pointer even with -O3. GCC doesn't do that.

If I compile with clang -O3 -fomit-frame-pointer then I get

Code:
0000000000000000 <tc1>:
0: 48 8b 05 00 00 00 00          mov     rax, qword ptr [rip]    # 0x7 <tc1+0x7>
7: ff e0 jmp rax

(almost the same as GCC, which doesn't even bother with rax and does an indirect rip-relative jump.

For completeness, if I want to turn off the TCO then compiling with
-O3 -fomit-frame-pointer -fno-optimize-sibling-calls
results in

Code:
0000000000000000 <tc1>:
0: 50                            push    rax
1: ff 15 00 00 00 00 call qword ptr [rip]         # 0x7 <tc1+0x7>
7: 59 pop rcx
8: c3                            ret

Pushing rax and popping rcx is a bit mysterious.

So in summary it's the fact that clang -O3 doesn't include -fomit-frame-pointer like GCC does that was surprising me.
 
Back
Top