DevHeads IoT Integration Server•8mo ago

How to optimize SIMD instructions for double precision floating point operations on Intel Core i7

I want to optimize a computationally intensive loop using SIMD instructions on an Intel Core i7 12700K processor and 32GB of DDR4 3200 memory , to boost the performance for a double precision floating point vector addition operation within a larger scientific computation

section .data
data_array: dq 1.0, 2.0, 3.0, 4.0, ..., 1000000.0  ; Array of 1 million double-precision values

section .text
global my_function

my_function:
  mov rcx, 1000000 / 4  ; Loop counter (number of 128-bit chunks)
  mov rsi, data_array

loop_start:
  movups xmm0, [rsi]
  movups xmm1, [rsi + 16]
  addps xmm0, xmm1
  movups [rsi], xmm0
  add rsi, 32
  dec rcx
  jnz loop_start
  ret

section .data
data_array: dq 1.0, 2.0, 3.0, 4.0, ..., 1000000.0  ; Array of 1 million double-precision values

section .text
global my_function

my_function:
  mov rcx, 1000000 / 4  ; Loop counter (number of 128-bit chunks)
  mov rsi, data_array

loop_start:
  movups xmm0, [rsi]
  movups xmm1, [rsi + 16]
  addps xmm0, xmm1
  movups [rsi], xmm0
  add rsi, 32
  dec rcx
  jnz loop_start
  ret

1 Reply

Marvee Amasi•8mo ago

I've compiled the code with GCC using the -O3 optimization flag. While there is some performance improvement compared to the scalar version, it's significantly less than expected. I've measured a speedup of approximately 1.5x on an Intel Core i7 12700K processor. So I'm looking for suggestions on how to further optimize this code for maximum performance. Are there any specific SIMD instructions or techniques that could be beneficial? Thinking of exploring memory optimization strategies like prefetching or prolly cache blocking

Gaming

Programming

How to optimize SIMD instructions for double precision floating point operations on Intel Core i7

Did you find this page helpful?