Optimizing memcpy Performance on Intel Core i7 10700K: SIMD and Compiler Flags

I am analyzing the performance of memcpy on an Intel Core i7 10700K CPU , using GCC 10.2 on Linux kernel 5.10. My assumption is that its speed should be close to the time it takes to transfer one long multiplied by the number of longs being copied. Could memcpy be optimized to exceed this expectation, possibly using SIMD or other CPU specific features? Are there any compiler flags or hardware optimizations I should be aware of to get the best performance out of memcpy?
5 Replies
ke7c2mi
ke7c2mi•2mo ago
Could memcpy be optimized to exceed this expectation, possibly using SIMD or other CPU specific features? Yes, and it will usually be very well optimised THis is probably the memcpy you are running: https://elixir.bootlin.com/linux/v5.10/source/arch/x86/lib/memcpy_32.c This is probably the one it actually uses: https://elixir.bootlin.com/linux/v5.10/source/arch/x86/lib/mmx_32.c#L29 I think the general principle to understand is that any time a CPU grabs something from memory / cache it is doing a transaction on some bus. 1 8 byte transaction will have less overhead than 8 1 byte transactions The game pretty much becomes what are the biggest chunks of data we can move at a time. Here we see MMX registers used, in Arm we will expect to find the instruction which load/store multiple regs used - in both these cases, more data for less bus transactions - faster 🤓
memcpy_32.c - arch/x86/lib/memcpy_32.c - Linux sourc...
Elixir Cross Referencer - source file of Linux (version v5.10). Browsed file: /arch/x86/lib/memcpy_32.c
mmx_32.c - arch/x86/lib/mmx_32.c - Linux source code...
Elixir Cross Referencer - source file of Linux (version v5.10). Browsed file: /arch/x86/lib/mmx_32.c
Marvee Amasi
Marvee Amasi•2mo ago
Woaw thanks @ke7c2mi . It is interesting how memcpy can use SIMD instructions like MMX to optimize the memory transfer and minimize bus transactions. I wasn't aware of the exact mechanisms behind this 🫠
Marvee Amasi
Marvee Amasi•2mo ago
I saw the use of MMX in the code you linked. How does that compare with more modern SIMD instructions like SSE, AVX, or AVX-512? Could the memcpy implementations for newer processors make use these extensions for even more significant performance gains?
ke7c2mi
ke7c2mi•2mo ago
I suspect there are but It's not something I know much about, just the general concept. It would be worth looking into gcc/clang source to see if you can find various arch specific memcpy implementations there to see more
Marvee Amasi
Marvee Amasi•2mo ago
Aiit , time to dig into the gcc and clang source code. Might take a while I know, but I’ll see what I find. I appreciate your help pointing me in the right direction @ke7c2mi
Want results from more Discord servers?
Add your server