Marvee Amasi
Marvee Amasi
DIIDevHeads IoT Integration Server
Created by Marvee Amasi on 7/31/2024 in #code-review
How can I optimize matrix multiplication performance and reduce L3 cache misses in my C++ library?
I started a C++ library for efficient matrix operations, with a primary focus on matrix multiplication. The target application is scientific computing, of course performance is critical. I implemented a start matrix class and a matrix multiplication function, used SSE instructions for optimization on Intel Core i7 12700K, 32GB DDR4 3200 RAM on visual studio code with clang format extension . https://github.com/Marveeamasi/image-processing-matrix-multiplier even after using SSE instructions, the current matrix multiplication implementation started to show significant performance bottlenecks, especially when dealing with large matrices. Profiling results indicate high L3 cache miss rates as the primary culprit
Matrix Matrix::operator*(const Matrix& other) const {
if (cols_ != other.rows()) {
exit(1);
}

Matrix result(rows_, other.cols_);

for (int i = 0; i < rows_; ++i) {
for (int j = 0; j < other.cols_; ++j) {
double sum = 0.0;
for (int k = 0; k < cols_; ++k) {
sum += (*this)(i, k) * other(k, j);
}
result(i, j) = sum;
}
}

return result;
}
Matrix Matrix::operator*(const Matrix& other) const {
if (cols_ != other.rows()) {
exit(1);
}

Matrix result(rows_, other.cols_);

for (int i = 0; i < rows_; ++i) {
for (int j = 0; j < other.cols_; ++j) {
double sum = 0.0;
for (int k = 0; k < cols_; ++k) {
sum += (*this)(i, k) * other(k, j);
}
result(i, j) = sum;
}
}

return result;
}
tried to optimize memory access patterns and loop structure, but performance gains are still limited. Please need help on strategies to improve cache locality, reduce cache misses, and further enhance the overall efficiency of the matrix multiplication operation.
I'm eager to know about different approaches and best practices for high performance matrix computations.
10 replies