Modular•12mo ago

mojo-sort

https://github.com/mzaks/mojo-sort By @Maxim

GitHub

Contribute to mzaks/mojo-sort development by creating an account on GitHub.

37 Replies

Maxim•12mo ago

Added parallel tim sort. On large lists it seems to be up to 3x faster than Tim sort (tried on m1 and Intel i7 with 8 cores), but still slower than Radix sort. I guess on a machine with 64 Cores that might be a different story though.

Mohamed Mabrouk•12mo ago

If it can be useful and there is a benchmark script. I can run it on one of our quiet servers (96 cores Threadripper - Ubuntu server 22.04)

Maxim•12mo ago

That would be awesome, the benchmark script can be found here: https://github.com/mzaks/mojo-sort/blob/main/benchmark.mojo if you give me a few minutes I can also make it output proper CSV.

GitHub

mojo-sort/benchmark.mojo at main · mzaks/mojo-sort

Contribute to mzaks/mojo-sort development by creating an account on GitHub.

Mohamed Mabrouk•12mo ago

No worries, just ping me when ever you are happy with the script.

Maxim•12mo ago

it's ready 🙂

Maxim•12mo ago

BTW benchmark results for M1 are her https://github.com/mzaks/mojo-sort/blob/main/benchmark_result_m1.csv

GitHub

mojo-sort/benchmark_result_m1.csv at main · mzaks/mojo-sort

Contribute to mzaks/mojo-sort development by creating an account on GitHub.

Maxim•12mo ago

And for i7 https://github.com/mzaks/mojo-sort/blob/main/benchmark_result_i7.csv

GitHub

mojo-sort/benchmark_result_i7.csv at main · mzaks/mojo-sort

Contribute to mzaks/mojo-sort development by creating an account on GitHub.

Maxim•12mo ago

The 0 values in m1 benchmark for smaller lists are due to the fact that time function on mac does not give nanosecond precision it rounds to microseconds

Mohamed Mabrouk•12mo ago

Should I run it on Mojo nightly or the main branch?

Maxim•12mo ago

Nightly please. Speaking of, I think it is possible to implement a parallel Radix sort, which is kind of cool 🙂

Mohamed Mabrouk•12mo ago

Here is the results. I ran the benchmark using

mojo build benchmark.mojo
./benchmark

mojo build benchmark.mojo
./benchmark

The server that I ran it on has dual AMD EPYC 7401 24 cores, 2 threads/core. All the cores were working by they didn't reach 100%

benchmark_EPYC_96cor...

Maxim•12mo ago

Hm, interesting generally the single core algorithms seems to be much slower, then m1 (I guess the machine has lots of CPU but they are not as powerful), you know what, I think I am doing the parallelisation wrong 🙂 need to rewrite the algorithm ok, will continue on it tomorrow 🙂

aurelian•12mo ago

afaik LLVM struggles to interleave vector reads on ARM might be a factor

Maxim•12mo ago

Updated the lib after the ComparableCollectionElement landed in nightly. Benchmarks for string sorting are quite interesting.

Sitron•12mo ago

Benchmarks for string sorting are quite interesting.

Interesting bad or interesting good? Or just weird?

Maxim•12mo ago

Generally interesting 😄 before MSB Radix sort was an obvious winner. Now TimSort / parallel TimSort is faster in some cases.

Sitron•12mo ago

oh, interesting String comparisons are not vectorized at the moment, so there's some room to optimize still

Maxim•12mo ago

they are actually vectorized

Maxim•12mo ago

Internaly memcmp is used https://github.com/modularml/mojo/blob/32e7c06bd832f9e781d00bf11cc6911c7b678c18/stdlib/src/memory/memory.mojo#L45

GitHub

mojo/stdlib/src/memory/memory.mojo at 32e7c06bd832f9e781d00bf11cc69...

The Mojo Programming Language. Contribute to modularml/mojo development by creating an account on GitHub.

Sitron•12mo ago

not at the moment. String used to be, but all string comparisons were delegated to StringRef and it's using a basic local (temporary hopefully) memcmp implementation due to an internal bug. (StringLiteral had the same issue before they were all bunched together) EDIT: string comparisons are now vectorized as of 2024.6.512 https://github.com/modularml/mojo/compare/1febed7a4e55feea94d9a10f06e27ea7441e9e9d...a752f0d7279d26ecf34ff1efb6e0b471b3e9afe5

GitHub

Comparing 1febed7a4e55feea94d9a10f06e27ea7441e9e9d...a752f0d7279d26...

The Mojo Programming Language. Contribute to modularml/mojo development by creating an account on GitHub.

dan13llljws•11mo ago

Is anyone actively working on making quicksort -> introsort

Maxim•11mo ago

No, I am having this PR in the pipeline to introduce benchmark for the sorting of scalars and a start with a data driven approach for algorithm optimizations. What I had in mind is, if you have an algorithm improvement you want to contribute, first make a PR with a separate benchmark where you execute the current algorithm and the contending algorithm. As there is no infrastructure to check performance on different hardware we ask Modular folks and other contributors to run benchmarks locally and share results. Based on those we can discuss if it is suitable to replace current algorithm with provided one, or maybe just add another algorithm to the module.

Maxim•11mo ago

BTW many other standard libraries use pattern-defeating quicksort (https://youtu.be/jz-PBiWwNjc?si=2ad9-LvxRbA_S8VY). @dan13llljws if you intend in writing a quick sort derivative I think it will be more interesting. But on the other hand

CWI Database Architectures

YouTube

Pattern-defeating Quicksort - Orson Peters

Maxim•11mo ago

Oh I forgot to post the link to mentioned PR: https://github.com/modularml/mojo/issues/3022

GitHub

[stdlib] benchmark sort scalar list by mzaks · Pull Request #3022 ·...

This PR adds a benchmark for sorting a list of scalars. The benchmarks are performed on the public sort and internal _insertion_sort and _small_sort functions. The intention of the benchmark are fo...

dan13llljws•11mo ago

The benchmark PR is great, and I am looking forward to it. I was more interested in making something possible for comparison (comparing past implementations and perhaps to std::sort). As I played around with the sort module a bit more, I noticed that the current implementation actually SHOOT UP drastically as number of elements in the list blows up. (I implemented Heapsort for fun and the current quicksort is actually ~20x worse, not sure why). I am interested in refactoring the entire module algorithm-wise if no one else is actively working on it. 👀

Maxim•11mo ago

Have you had a look at my repo and algorithms there? Curious how they compare to what you implemented.

dan13llljws•11mo ago

I have looked at a few, haven't run benchmark on them though.

Maxim•11mo ago

Please let me know if you do! There are also benchmarks in the repo.

dan13llljws•11mo ago

I noticed something really weird about the benchmark. Running insertion sort on 64 elements takes longer than 32?? I might be running the benchmark wrong, but somehow std sort appears a lot slower on benchmark while if I just do

 
var start = time.now()
sort(...)
var end = time.now()
print(end - start)

 
var start = time.now()
sort(...)
var end = time.now()
print(end - start)

then std sort consistently do better than raw insertion sort @Maxim I made a few comments on your PR. I don’t know a way to work around the list not being captured. You can still put the random list initialization in call_fn and still get a good comparison result.

dan13llljws•11mo ago

@Maxim I benchmarked various partition scheme for quicksort (current stdlib implementation, an improved version I have internally, your implementation). Here is the benchmark result: (old: stdlib, new: my implementation, new left only: your implementation heapsort: just for reference).

benchmark_result

Maxim•11mo ago

Thanks @dan13llljws ! I will have a look

dan13llljws•11mo ago

I was curious why do our cmp_fn implement “less than equal” rather than “less than”? This seems unconventional.

Maxim•11mo ago

It's is also discussed in the video I posted above here is the link with th etime stamp directly relevant for the comparison function https://youtu.be/jz-PBiWwNjc?si=BQLYQYj-BTusYjy-&t=827

CWI Database Architectures

YouTube

Pattern-defeating Quicksort - Orson Peters

dan13llljws•11mo ago

ahhhh, I haven't got a chance to take a look yet

ModularBot•11mo ago

Congrats @dan13llljws, you just advanced to level 1!

Maxim•11mo ago

Long story short with just < you might get a undefined behaviour and it is also unstable (I think. :))

dan13llljws•11mo ago

Really? I guess it really depends on how you implement your partition. I know for the implementation where pick the pivot as last element and you do:

while True:
    while comp(a[left], pivot_vlaue):
        left += 1
    while left < right and not comp(a[right], pivot_value):
        right -= 1
    ....

while True:
    while comp(a[left], pivot_vlaue):
        left += 1
    while left < right and not comp(a[right], pivot_value):
        right -= 1
    ....

It really requires you have < since otherwise when a[left] == pivot_value left will go out of bound. Take a look at

#include <algorithm>
#include <iostream>
#include <vector>

int main() {
    std::vector<int> a(100, 1);
    std::sort(a.begin(), a.end(), [&](int x, int y){
        return x <= y;
    });
}
/*
[M] dan13 % g++ -std=c++20 test.cpp && ./a.out
zsh: segmentation fault  ./a.out
but when I do < it is fine.
*/

#include <algorithm>
#include <iostream>
#include <vector>

int main() {
    std::vector<int> a(100, 1);
    std::sort(a.begin(), a.end(), [&](int x, int y){
        return x <= y;
    });
}
/*
[M] dan13 % g++ -std=c++20 test.cpp && ./a.out
zsh: segmentation fault  ./a.out
but when I do < it is fine.
*/

I think we < or <= is just a convernsion, we can make either work. But it is ultimately the question of "what is the best convention for the users". When I really get a chance to dig into pdqsort, I will be able to discuss more! Just watched the video td. I think its the other way around?

Gaming

Programming

mojo-sort

Did you find this page helpful?