FreeBSD libc - memcpy()/memmove() SSE2 support on AMD64

Hi!

Prologue:
I'm started to write a own implementation from libc for the purpose of education. Some parts I have implemented direct in assembly. There are also a C version too. In my variant of memcpy() I use the SSE2 instruction movdqa for 128 Bit copy, if necessary and available.

Extract from my current code:
C:
#ifdef __SSE2__
    /* Check if we can copy double quad words (128 Bit) */
    movq %rdx, %rax
    movq $0, %rdx
    movq $16, %rcx
    div %rcx
    movq %rax, %rcx
    cmpq $0, %rcx
    je 2f
    /* Copy double quad words (128 Bit) with use of SSE2 */
1:
    movdqa (%rsi), %xmm1
    movdqa %xmm1, (%rbx)
    addq $16, %rsi
    addq $16, %rbx
    decq %rcx
    jnz 1b
#endif

On my (very old) system it's really two times faster, instead I copy 64 Bit values. Have looked in the current FreeBSD libc code, and I was wondering why there isn't a SSE2 support.
I think, the current code from FreeBSD's libc could be faster too...

Extract from FreeBSD memmove()/memcpy() source...

(The copy 256 Bit part.)
C:
103200:
    movq    (%rsi),%rdx
    movq    %rdx,(%rdi)
    movq    8(%rsi),%rdx
    movq    %rdx,8(%rdi)
    movq    16(%rsi),%rdx
    movq    %rdx,16(%rdi)
    movq    24(%rsi),%rdx
    movq    %rdx,24(%rdi)

...could be changed in:
C:
103200:
    movdqa    (%rsi),%xmm1
    movdqa    %xmm1,(%rdi)
    movdqa    16(%rsi),%xmm1
    movdqa    %xmm1,16(%rdi)

...if SSE2 is available.

Are there any plans to add at least SSE2 support to speed up memcpy()/memmove()? AFAIK, all AMD64 platforms supports SSE2.
 
Okay, thanks for the detailed information.

See also simd(7). Some of these improvements are in place in 14.1.
As I have seen memmove/memcpy/memset is currently scalar only.

Have currently tried to modify the current FreeBSD libc source for memmove/memcpy. But current run into an issue. Get sometimes a bus fault. I think, there is something wrong with the 16 Byte alignment. Simply changed what I have posted here, but failed.

Have also tried a simple benchmark with the x86 TSC.

The result (a lower number is faster), but with my implementation from memcpy:
Code:
FreeBSD memcpy():
Min: 4123
Max: 11761
Avg: 4590
My memcpy():
Min: 2118
Max: 3819
Avg: 2380

These results are on FreeBSD 14.1 running on my 15 year old Intel Xeon X5460.

My used benchmark (maybe not ideal):
bench.c:
C:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

extern unsigned long get_tsc(void);
extern void *my_memcpy(void *, const void *, size_t);

int compare(const void* a, const void* b)
{
    return (*(unsigned long*)a - *(unsigned long*)b);
}

void print_result(unsigned long result[])
{
    int i;
    unsigned long avg = 0;
    
    qsort(result, 20, sizeof(unsigned long), compare);
    printf("Min: %lu\n", result[0]);
    printf("Max: %lu\n", result[19]);
    
    for (i = 0; i < 20; i++) {
        avg += result[i];
    }
    
    printf("Avg: %lu\n", avg / 20);
}

int main(int argc, char *argv[])
{
    char *buf1[16384];
    char *buf2[16384];
    unsigned long result[20];
    int i;
    unsigned long tsc_start, tsc_end;
    
    memset(buf1, 0xFF, 16384);
    memset(buf2, 0x00, 16384);
    printf("FreeBSD memcpy():\n");
    
    for (i = 0; i < 20; i++) {
        tsc_start = get_tsc();
        memcpy(buf2, buf1, 16384);
        tsc_end = get_tsc();
        result[i] = tsc_end - tsc_start;
    }
    
    print_result(result);
    
    memset(buf1, 0xFF, 16384);
    memset(buf2, 0x00, 16384);
    printf("My memcpy():\n");
    
    for (i = 0; i < 20; i++) {
        tsc_start = get_tsc();
        my_memcpy(buf2, buf1, 16384);
        tsc_end = get_tsc();
        result[i] = tsc_end - tsc_start;
    }
    
    print_result(result);
    
    return 0;
}

cycle.S (Read TSC):
C:
.text

.global get_tsc
get_tsc:
    pushq %rbp
    movq %rsp, %rbp
    rdtsc
    shl $32, %rdx
    or %rdx, %rax
    popq %rbp
    ret

.end

memcpy.S (My current implentation from memcpy from my own libc):
C:
.text

.global my_memcpy
my_memcpy:
    pushq %rbp
    movq %rsp, %rbp
    pushq %rbx
    pushq %rsi
    movq %rdi, %rbx
    movq %rdx, %rcx
    cmpq $0, %rcx
    je 8f
    /* Check if we can copy 256 Bit values */
    movq %rdx, %rax
    movq $0, %rdx
    movq $32, %rcx
    div %rcx
    movq %rax, %rcx
    cmpq $0, %rcx
    je 2f
    /* Copy 256 Bit values with use of SSE2 */
1:
    movdqa (%rsi), %xmm1
    movdqa %xmm1, (%rbx)
    movdqa 16(%rsi), %xmm1
    movdqa %xmm1, 16(%rbx)
    addq $32, %rsi
    addq $32, %rbx
    decq %rcx
    jnz 1b
2:
    /* Check if we can copy quad words (64 Bit) */
    movq %rdx, %rax
    movq $0, %rdx
    movq $8, %rcx
    div %rcx
    movq %rax, %rcx
    cmpq $0, %rcx
    je 4f
    /* Copy quad words (64 Bit) */
3:
    movq (%rsi), %rax
    movq %rax, (%rbx)
    addq $8, %rsi
    addq $8, %rbx
    decq %rcx
    jnz 3b
4:
    /* Check if we can copy (remaining) double words (32 Bit) */
    movq %rdx, %rax
    movq $0, %rdx
    movq $4, %rcx
    div %rcx
    movq %rax, %rcx
    cmpq $0, %rcx
    je 6f
5:
    /* Copy (remaining) double words (32 Bit) */
    movl (%rsi), %eax
    movl %eax, (%rbx)
    addq $4, %rsi
    addq $4, %rbx
    decq %rcx
    jnz 5b
6:
    movq %rdx, %rcx
    cmpq $0, %rcx
    je 8f
7:
    /* Copy remaining bytes */
    movb (%rsi), %al
    movb %al, (%rbx)
    incq %rsi
    incq %rbx
    decq %rcx
    jnz 7b
8:
    popq %rsi
    movq %rdi, %rax
    popq %rbx
    popq %rbp
    ret

.end

Seriously, is it worth when I succesfully add SSE2 support to FreeBSD's libc memcpy and release a patch? If not, I'll leave it at that.
 
I would expect that patches you come up with now are also coming out of the Foundation's effort.

And as you see, it is not trivial to get all the edge cases right.
 
Seriously, is it worth when I succesfully add SSE2 support to FreeBSD's libc memcpy and release a patch?
Yes! FreeBSD is a community project, and patches are always welcome.

Just be prepared that it may not immediately be accepted.

Stability trumps performance, so things will need to be rock solid. Also reach out to the developer(s) that has been doing the other SIMD work to see if there are any gotchas / best practices they can suggest, etc. The work for the runtime selection of code paths should be applicable here, too, for example.
 
Back
Top