DevHeads IoT Integration Server•10mo ago

Optimizing a bubble sort implementation in C for an x86-64 architecture

Hello , I'm working on optimizing a bubble sort implementation in C for an x86-64 architecture specifically targeting an Intel Core i7 processor using GCC 11.2 . I noticed that the -O3 flag resulted in slower performance compared to -O2 when sorting large arrays of integers 1 million elements in this case. Here are the timings, average of multiple runs:

Solution:

Compile your code with profiling enabled using ```sh Copy code gcc -pg -O3 -o sort sort.c ./sort 1000000...

Jump to solution

11 Replies

Marvee Amasi•10mo ago

Without optimizations:

./sort 1000000  2.54s user 0.01s system 99% cpu 2.552 total

./sort 1000000  2.54s user 0.01s system 99% cpu 2.552 total

-O2:

./sort 1000000  1.25s user 0.00s system 99% cpu 1.253 total

./sort 1000000  1.25s user 0.00s system 99% cpu 1.253 total

-O3:

./sort 1000000  1.78s user 0.01s system 99% cpu 1.790 total

./sort 1000000  1.78s user 0.01s system 99% cpu 1.790 total

The relevant part of my bubble sort function is included below:

Marvee Amasi•10mo ago

void bubblesort(int *arr, int n) {
    for (int i = 0; i < n-1; i++) {
        for (int j = 0; j < n-i-1; j++) {
            if (arr[j] > arr[j+1]) {
                int temp = arr[j];
                arr[j] = arr[j+1];
                arr[j+1] = temp;
            }
        }
    }
}

void bubblesort(int *arr, int n) {
    for (int i = 0; i < n-1; i++) {
        for (int j = 0; j < n-i-1; j++) {
            if (arr[j] > arr[j+1]) {
                int temp = arr[j];
                arr[j] = arr[j+1];
                arr[j+1] = temp;
            }
        }
    }
}

Marvee Amasi•10mo ago

I've examined the assembly output for both -O2 and -O3 versions. While -O3 seems to attempt vector instructions like movdqa, there might be other factors affecting performance, like unnecessary register spills or missed opportunities for further vectorization. Could additional instructions introduced by -O3 outweigh the benefit of vectorization for bubble sort on x86-64? Are there any compiler flags or optimization techniques specific to x86-64 that might be more suitable for this scenario, considering the limitations of bubble sort for large datasets?

RED HAT•10mo ago

Hi @Marvee Amasi ! It's interesting that -O3 is slower than -O2 for your case. -O3 enables more aggressive optimizations, but these optimizations can sometimes lead to unexpected performance

Marvee Amasi•10mo ago

Thanks man @RED HAT I'm curious to dig deeper into why this might be happening

RED HAT•10mo ago

@Marvee Amasi The -O3 flag can sometimes result in larger binary sizes, which might not fit well in the CPU cache, leading to slower performance and the -O3 might inline functions more aggressively, which can sometimes degrade performance if the inline functions are large or if they lead to increased instruction cache misses.

RED HAT•10mo ago

you should use profiling for your application to understand where the issues are, tools like gprof, perf, or even valgrind can provide insights into which parts of your code are consuming the most time.

Marvee Amasi•10mo ago

Oh profiling sounds like the next step . I wonder if anyone else has encountered similar behavior with -O3 and bubble sort on x86. Perhaps there are known patterns to watch out for in the profiling data?

Marvee Amasi•10mo ago

So cache behavior and inlining decisions can play a significant role here. Any profiling techniques using gprof or perf to pinpoint the bottlenecks in my bubble sort implementation @RED HAT

Solution

RED HAT•10mo ago

Compile your code with profiling enabled using

Copy code
gcc -pg -O3 -o sort sort.c
./sort 1000000
gprof sort gmon.out > analysis.txt

Copy code
gcc -pg -O3 -o sort sort.c
./sort 1000000
gprof sort gmon.out > analysis.txt

For perf, use

perf record -g ./sort 1000000

perf record -g ./sort 1000000

Generate a report to get full detail of where time is being spent on your program

perf report

perf report

Marvee Amasi•10mo ago

Gracious 👍

Gaming

Programming

Optimizing a bubble sort implementation in C for an x86-64 architecture

Did you find this page helpful?