What Are the Architectural Constraints in Haswell That Limit CPE Optimization?

I want to understand why any scalar version of the inner product procedure cannot achieve a CPE less than 1.00 on an Intel Core i7 4790 Haswell processor, Ubuntu 20.04 Linux Kernel 5.4, with GCC 9.3.0 compiler. I want to optimize the inner product procedure using 6x1a loop unrolling on the Intel Core i7 Haswell processor. For integer data, my unrolled version gives a CPE as in cycles per element of 1.07. For floating-point data, it still remains at 3.01. I understand that pipelining and vectorization offer opportunities for parallelism, but is there a fundamental limitation in scalar code that prevents CPE from dropping below 1.00, even with loop unrolling? Are there architectural constraints in the Haswell processor that make achieving a CPE of less than 1.00 impossible? What will be the best approach to optimize further?
attachment 0
4 Replies
Renuel Roberts
Renuel Roberts2mo ago
Achieving a CPE (cycles per element) of less than 1.00 for scalar inner product procedures on an Intel Core i7 4790 (Haswell) processor is challenging due to fundamental architectural constraints. In scalar execution, each instruction processes only one data element, and factors such as instruction latency, limited functional units, and data dependencies prevent processing more than one element per cycle. While loop unrolling reduces overhead, it doesn't overcome the bottlenecks in scalar execution. Haswell supports SIMD (AVX2), allowing multiple elements to be processed per instruction. To achieve lower CPE, you need to leverage vectorization. Compiler flags ( -O3 -march=native) can help, but for more control, manual use of SIMD intrinsics may be necessary. Additionally, optimizing memory access and reducing cache misses can further improve performance. Also, scalar code has inherent limitations, and achieving a CPE below 1.00 requires moving to vectorized code, which fully utilizes the processor's SIMD capabilities.
Marvee Amasi
Marvee Amasi2mo ago
Yes I agree that scalar code has its inherent limitations, and vectorization seems to be the clear path forward for lowering the CPE. I hav already tried using the -O3 -march=native flags, and while they help, I hav not explored manual AVX2 intrinsics in depth yet. I will dive into that next
Marvee Amasi
Marvee Amasi2mo ago
On the memory side, I am curious , do you think cache alignment or prefetching techniques could significantly impact the CPE, even with SIMD? Is there even a way to track how well my cache usage is optimized on the Haswell architecture, perhaps using perf or another tool?

Did you find this page helpful?