How can I optimize matrix multiplication performance and reduce L3 cache misses in my C++ library?

I started a C++ library for efficient matrix operations, with a primary focus on matrix multiplication. The target application is scientific computing, of course performance is critical. I implemented a start matrix class and a matrix multiplication function, used SSE instructions for optimization on Intel Core i7 12700K, 32GB DDR4 3200 RAM on visual studio code with clang format extension . https://github.com/Marveeamasi/image-processing-matrix-multiplier even after using SSE instructions, the current matrix multiplication implementation started to show significant performance bottlenecks, especially when dealing with large matrices. Profiling results indicate high L3 cache miss rates as the primary culprit
Matrix Matrix::operator*(const Matrix& other) const {
if (cols_ != other.rows()) {
exit(1);
}

Matrix result(rows_, other.cols_);

for (int i = 0; i < rows_; ++i) {
for (int j = 0; j < other.cols_; ++j) {
double sum = 0.0;
for (int k = 0; k < cols_; ++k) {
sum += (*this)(i, k) * other(k, j);
}
result(i, j) = sum;
}
}

return result;
}
Matrix Matrix::operator*(const Matrix& other) const {
if (cols_ != other.rows()) {
exit(1);
}

Matrix result(rows_, other.cols_);

for (int i = 0; i < rows_; ++i) {
for (int j = 0; j < other.cols_; ++j) {
double sum = 0.0;
for (int k = 0; k < cols_; ++k) {
sum += (*this)(i, k) * other(k, j);
}
result(i, j) = sum;
}
}

return result;
}
tried to optimize memory access patterns and loop structure, but performance gains are still limited. Please need help on strategies to improve cache locality, reduce cache misses, and further enhance the overall efficiency of the matrix multiplication operation.
I'm eager to know about different approaches and best practices for high performance matrix computations.
GitHub
GitHub - Marveeamasi/image-processing-matrix-multiplier: An x86-64 ...
An x86-64 bit processor application to optimizing a matrix multiplication routine for an image processing - Marveeamasi/image-processing-matrix-multiplier
8 Replies
Ming
Ming6mo ago
You may already have come across this, but while I have not looked at it in (many!) years - the bible when I did my MSC in Computational Mathematics was "Numerical Recipes in C" https://www.grad.hr/nastava/gs/prg/NumericalRecipesinC.pdf
Manuel
Manuel6mo ago
Where do you actually use SSE instructions? The code you provided does not contain any, as far as I see.
Marvee Amasi
Marvee Amasi6mo ago
Thanks guys, my matrix operation suffered from poor cache locality, was accessing elements of the matrices in a scattered manner, so I didn't do something complex to fix bottleneck 👌, but it fixed it. Dividing the matrices into smaller blocks to improve cache locality using blocking function, now more data is likely to be found in the processor's cache when needed, reducing the number of costly memory accesses
Marvee Amasi
Marvee Amasi6mo ago
No .. it's not used in the code I pushed. It was an alternative , didn't push it
Manuel
Manuel6mo ago
Regarding cache locality, try to access memory consecutively. So, if it is your first index, that is multiplied to get the memory position, then keep this fixed to prevent big steps in memory. In your case this would mean the inner most loop should iterate over j not over k, as k is used as first index for the access to other. So try:
for (int i = 0; i < rows_; ++i) {
for (int j = 0; j < other.cols_; ++j) {
result(i, j) = 0.0;
}
for (int k = 0; k < cols_; ++k) {
for (int j = 0; j < other.cols_; ++j) {
result(i, j) += (*this)(i, k) * other(k, j);
}
}
}
for (int i = 0; i < rows_; ++i) {
for (int j = 0; j < other.cols_; ++j) {
result(i, j) = 0.0;
}
for (int k = 0; k < cols_; ++k) {
for (int j = 0; j < other.cols_; ++j) {
result(i, j) += (*this)(i, k) * other(k, j);
}
}
}
This should boost performance quite a bit.
Marvee Amasi
Marvee Amasi6mo ago
Thanks @Manuel I understand the logic behind iterating over j in the innermost loop to improve cache locality. By keeping the first memory access index fixed (i), I would increase the chance of accessing adjacent elements in the same cache line. It makes perfect sense in the context of reducing L3 cache misses. I'll definitely give this approach a try and compare the performance with the original loop structure. It would be interesting to see how much this optimization impacts the overall execution time, especially for larger matrices ✅ , If you check out my new commits I explored using blocking also Thanks @Ming 👍
Manuel
Manuel6mo ago
you may also search the web for efficient matrix multiplication. this is a standard topic and therefore there should be enough articles how to optimize it
Marvee Amasi
Marvee Amasi6mo ago
Yh thanks @Manuel , I tried out using blocking, been pretty okay using it since then , I could try out other optimization techniques for some other projects prolly

Did you find this page helpful?