Optimizing memcpy Performance on Intel Core i7 10700K: SIMD and Compiler Flags
I am analyzing the performance of
memcpy
on an Intel Core i7 10700K CPU , using GCC 10.2 on Linux kernel 5.10. My assumption is that its speed should be close to the time it takes to transfer one long multiplied by the number of longs being copied. Could memcpy
be optimized to exceed this expectation, possibly using SIMD
or other CPU specific features?
Are there any compiler flags or hardware optimizations I should be aware of to get the best performance out of memcpy
?5 Replies
Could memcpy be optimized to exceed this expectation, possibly using SIMD or other CPU specific features?
Yes, and it will usually be very well optimised
THis is probably the memcpy you are running:
https://elixir.bootlin.com/linux/v5.10/source/arch/x86/lib/memcpy_32.c
This is probably the one it actually uses:
https://elixir.bootlin.com/linux/v5.10/source/arch/x86/lib/mmx_32.c#L29
I think the general principle to understand is that any time a CPU grabs something from memory / cache it is doing a transaction on some bus. 1 8 byte transaction will have less overhead than 8 1 byte transactions
The game pretty much becomes what are the biggest chunks of data we can move at a time.
Here we see MMX registers used, in Arm we will expect to find the instruction which load/store multiple regs used - in both these cases, more data for less bus transactions - faster 🤓memcpy_32.c - arch/x86/lib/memcpy_32.c - Linux sourc...
Elixir Cross Referencer - source file of Linux (version v5.10). Browsed file: /arch/x86/lib/memcpy_32.c
mmx_32.c - arch/x86/lib/mmx_32.c - Linux source code...
Elixir Cross Referencer - source file of Linux (version v5.10). Browsed file: /arch/x86/lib/mmx_32.c
Woaw thanks @ke7c2mi . It is interesting how
memcpy
can use SIMD
instructions like MMX
to optimize the memory transfer and minimize bus transactions. I wasn't aware of the exact mechanisms behind this 🫠I saw the use of MMX in the code you linked. How does that compare with more modern SIMD instructions like SSE, AVX, or AVX-512? Could the memcpy implementations for newer processors make use these extensions for even more significant performance gains?
I suspect there are but It's not something I know much about, just the general concept. It would be worth looking into gcc/clang source to see if you can find various arch specific memcpy implementations there to see more
Aiit , time to dig into the gcc and clang source code. Might take a while I know, but I’ll see what I find. I appreciate your help pointing me in the right direction @ke7c2mi