Optimizing a bubble sort implementation in C for an x86-64 architecture

Hello , I'm working on optimizing a bubble sort implementation in C for an x86-64 architecture specifically targeting an Intel Core i7 processor using GCC 11.2 . I noticed that the -O3 flag resulted in slower performance compared to -O2 when sorting large arrays of integers 1 million elements in this case. Here are the timings, average of multiple runs:
Solution:
Compile your code with profiling enabled using ```sh Copy code gcc -pg -O3 -o sort sort.c ./sort 1000000...
Jump to solution
11 Replies
Marvee Amasi
Marvee Amasi5mo ago
Without optimizations:
./sort 1000000 2.54s user 0.01s system 99% cpu 2.552 total
./sort 1000000 2.54s user 0.01s system 99% cpu 2.552 total
-O2:
./sort 1000000 1.25s user 0.00s system 99% cpu 1.253 total
./sort 1000000 1.25s user 0.00s system 99% cpu 1.253 total
-O3:
./sort 1000000 1.78s user 0.01s system 99% cpu 1.790 total
./sort 1000000 1.78s user 0.01s system 99% cpu 1.790 total
The relevant part of my bubble sort function is included below:
Marvee Amasi
Marvee Amasi5mo ago
void bubblesort(int *arr, int n) {
for (int i = 0; i < n-1; i++) {
for (int j = 0; j < n-i-1; j++) {
if (arr[j] > arr[j+1]) {
int temp = arr[j];
arr[j] = arr[j+1];
arr[j+1] = temp;
}
}
}
}
void bubblesort(int *arr, int n) {
for (int i = 0; i < n-1; i++) {
for (int j = 0; j < n-i-1; j++) {
if (arr[j] > arr[j+1]) {
int temp = arr[j];
arr[j] = arr[j+1];
arr[j+1] = temp;
}
}
}
}
Marvee Amasi
Marvee Amasi5mo ago
I've examined the assembly output for both -O2 and -O3 versions. While -O3 seems to attempt vector instructions like movdqa, there might be other factors affecting performance, like unnecessary register spills or missed opportunities for further vectorization. Could additional instructions introduced by -O3 outweigh the benefit of vectorization for bubble sort on x86-64? Are there any compiler flags or optimization techniques specific to x86-64 that might be more suitable for this scenario, considering the limitations of bubble sort for large datasets?
RED HAT
RED HAT5mo ago
Hi @Marvee Amasi ! It's interesting that -O3 is slower than -O2 for your case. -O3 enables more aggressive optimizations, but these optimizations can sometimes lead to unexpected performance
Marvee Amasi
Marvee Amasi5mo ago
Thanks man @RED HAT I'm curious to dig deeper into why this might be happening
RED HAT
RED HAT5mo ago
@Marvee Amasi The -O3 flag can sometimes result in larger binary sizes, which might not fit well in the CPU cache, leading to slower performance and the -O3 might inline functions more aggressively, which can sometimes degrade performance if the inline functions are large or if they lead to increased instruction cache misses.
RED HAT
RED HAT5mo ago
you should use profiling for your application to understand where the issues are, tools like gprof, perf, or even valgrind can provide insights into which parts of your code are consuming the most time.
Marvee Amasi
Marvee Amasi5mo ago
Oh profiling sounds like the next step . I wonder if anyone else has encountered similar behavior with -O3 and bubble sort on x86. Perhaps there are known patterns to watch out for in the profiling data?
Marvee Amasi
Marvee Amasi5mo ago
So cache behavior and inlining decisions can play a significant role here. Any profiling techniques using gprof or perf to pinpoint the bottlenecks in my bubble sort implementation @RED HAT
Solution
RED HAT
RED HAT5mo ago
Compile your code with profiling enabled using
Copy code
gcc -pg -O3 -o sort sort.c
./sort 1000000
gprof sort gmon.out > analysis.txt
Copy code
gcc -pg -O3 -o sort sort.c
./sort 1000000
gprof sort gmon.out > analysis.txt
For perf, use
perf record -g ./sort 1000000
perf record -g ./sort 1000000
Generate a report to get full detail of where time is being spent on your program
perf report
perf report
Marvee Amasi
Marvee Amasi5mo ago
Gracious 👍
Want results from more Discord servers?
Add your server