Hi!
Prologue:
I'm started to write a own implementation from libc for the purpose of education. Some parts I have implemented direct in assembly. There are also a C version too. In my variant of memcpy() I use the SSE2 instruction movdqa for 128 Bit copy, if necessary and available.
Extract from my current code:
On my (very old) system it's really two times faster, instead I copy 64 Bit values. Have looked in the current FreeBSD libc code, and I was wondering why there isn't a SSE2 support.
I think, the current code from FreeBSD's libc could be faster too...
Extract from FreeBSD memmove()/memcpy() source...
(The copy 256 Bit part.)
...could be changed in:
...if SSE2 is available.
Are there any plans to add at least SSE2 support to speed up memcpy()/memmove()? AFAIK, all AMD64 platforms supports SSE2.
Prologue:
I'm started to write a own implementation from libc for the purpose of education. Some parts I have implemented direct in assembly. There are also a C version too. In my variant of memcpy() I use the SSE2 instruction movdqa for 128 Bit copy, if necessary and available.
Extract from my current code:
C:
#ifdef __SSE2__
/* Check if we can copy double quad words (128 Bit) */
movq %rdx, %rax
movq $0, %rdx
movq $16, %rcx
div %rcx
movq %rax, %rcx
cmpq $0, %rcx
je 2f
/* Copy double quad words (128 Bit) with use of SSE2 */
1:
movdqa (%rsi), %xmm1
movdqa %xmm1, (%rbx)
addq $16, %rsi
addq $16, %rbx
decq %rcx
jnz 1b
#endif
On my (very old) system it's really two times faster, instead I copy 64 Bit values. Have looked in the current FreeBSD libc code, and I was wondering why there isn't a SSE2 support.
I think, the current code from FreeBSD's libc could be faster too...
Extract from FreeBSD memmove()/memcpy() source...
(The copy 256 Bit part.)
C:
103200:
movq (%rsi),%rdx
movq %rdx,(%rdi)
movq 8(%rsi),%rdx
movq %rdx,8(%rdi)
movq 16(%rsi),%rdx
movq %rdx,16(%rdi)
movq 24(%rsi),%rdx
movq %rdx,24(%rdi)
...could be changed in:
C:
103200:
movdqa (%rsi),%xmm1
movdqa %xmm1,(%rdi)
movdqa 16(%rsi),%xmm1
movdqa %xmm1,16(%rdi)
...if SSE2 is available.
Are there any plans to add at least SSE2 support to speed up memcpy()/memmove()? AFAIK, all AMD64 platforms supports SSE2.