mojo-sort
https://github.com/mzaks/mojo-sort
By @Maxim
GitHub
GitHub - mzaks/mojo-sort
Contribute to mzaks/mojo-sort development by creating an account on GitHub.
37 Replies
Added parallel tim sort. On large lists it seems to be up to 3x faster than Tim sort (tried on m1 and Intel i7 with 8 cores), but still slower than Radix sort. I guess on a machine with 64 Cores that might be a different story though.
If it can be useful and there is a benchmark script. I can run it on one of our quiet servers (96 cores Threadripper - Ubuntu server 22.04)
That would be awesome, the benchmark script can be found here: https://github.com/mzaks/mojo-sort/blob/main/benchmark.mojo if you give me a few minutes I can also make it output proper CSV.
GitHub
mojo-sort/benchmark.mojo at main · mzaks/mojo-sort
Contribute to mzaks/mojo-sort development by creating an account on GitHub.
No worries, just ping me when ever you are happy with the script.
it's ready 🙂
BTW benchmark results for M1 are her https://github.com/mzaks/mojo-sort/blob/main/benchmark_result_m1.csv
GitHub
mojo-sort/benchmark_result_m1.csv at main · mzaks/mojo-sort
Contribute to mzaks/mojo-sort development by creating an account on GitHub.
GitHub
mojo-sort/benchmark_result_i7.csv at main · mzaks/mojo-sort
Contribute to mzaks/mojo-sort development by creating an account on GitHub.
The
0
values in m1 benchmark for smaller lists are due to the fact that time function on mac does not give nanosecond precision it rounds to microsecondsShould I run it on Mojo nightly or the main branch?
Nightly please.
Speaking of, I think it is possible to implement a parallel Radix sort, which is kind of cool 🙂
Here is the results.
I ran the benchmark using
The server that I ran it on has dual AMD EPYC 7401 24 cores, 2 threads/core.
All the cores were working by they didn't reach 100%
Hm, interesting
generally the single core algorithms seems to be much slower, then m1 (I guess the machine has lots of CPU but they are not as powerful), you know what, I think I am doing the parallelisation wrong 🙂 need to rewrite the algorithm
ok, will continue on it tomorrow 🙂
afaik LLVM struggles to interleave vector reads on ARM
might be a factor
Updated the lib after the
ComparableCollectionElement
landed in nightly. Benchmarks for string sorting are quite interesting.Benchmarks for string sorting are quite interesting.Interesting bad or interesting good? Or just weird?
Generally interesting 😄 before MSB Radix sort was an obvious winner. Now TimSort / parallel TimSort is faster in some cases.
oh, interesting
String comparisons are not vectorized at the moment, so there's some room to optimize still
they are actually vectorized
Internaly memcmp is used https://github.com/modularml/mojo/blob/32e7c06bd832f9e781d00bf11cc6911c7b678c18/stdlib/src/memory/memory.mojo#L45
GitHub
mojo/stdlib/src/memory/memory.mojo at 32e7c06bd832f9e781d00bf11cc69...
The Mojo Programming Language. Contribute to modularml/mojo development by creating an account on GitHub.
not at the moment. String used to be, but all string comparisons were delegated to StringRef and it's using a basic local (temporary hopefully) memcmp implementation due to an internal bug. (StringLiteral had the same issue before they were all bunched together)
EDIT: string comparisons are now vectorized as of
2024.6.512
https://github.com/modularml/mojo/compare/1febed7a4e55feea94d9a10f06e27ea7441e9e9d...a752f0d7279d26ecf34ff1efb6e0b471b3e9afe5GitHub
Comparing 1febed7a4e55feea94d9a10f06e27ea7441e9e9d...a752f0d7279d26...
The Mojo Programming Language. Contribute to modularml/mojo development by creating an account on GitHub.
Is anyone actively working on making quicksort -> introsort
No, I am having this PR in the pipeline to introduce benchmark for the sorting of scalars and a start with a data driven approach for algorithm optimizations. What I had in mind is, if you have an algorithm improvement you want to contribute, first make a PR with a separate benchmark where you execute the current algorithm and the contending algorithm. As there is no infrastructure to check performance on different hardware we ask Modular folks and other contributors to run benchmarks locally and share results. Based on those we can discuss if it is suitable to replace current algorithm with provided one, or maybe just add another algorithm to the module.
BTW many other standard libraries use pattern-defeating quicksort (https://youtu.be/jz-PBiWwNjc?si=2ad9-LvxRbA_S8VY). @dan13llljws if you intend in writing a quick sort derivative I think it will be more interesting. But on the other hand
Oh I forgot to post the link to mentioned PR: https://github.com/modularml/mojo/issues/3022
GitHub
[stdlib] benchmark sort scalar list by mzaks · Pull Request #3022 ·...
This PR adds a benchmark for sorting a list of scalars.
The benchmarks are performed on the public sort and internal _insertion_sort and _small_sort functions.
The intention of the benchmark are fo...
The benchmark PR is great, and I am looking forward to it. I was more interested in making something possible for comparison (comparing past implementations and perhaps to
std::sort
).
As I played around with the sort module a bit more, I noticed that the current implementation actually SHOOT UP drastically as number of elements in the list blows up. (I implemented Heapsort for fun and the current quicksort is actually ~20x worse, not sure why).
I am interested in refactoring the entire module algorithm-wise if no one else is actively working on it. 👀Have you had a look at my repo and algorithms there? Curious how they compare to what you implemented.
I have looked at a few, haven't run benchmark on them though.
Please let me know if you do! There are also benchmarks in the repo.
I noticed something really weird about the benchmark. Running insertion sort on 64 elements takes longer than 32??
I might be running the benchmark wrong, but somehow std sort appears a lot slower on benchmark while if I just do
then std sort consistently do better than raw insertion sort
@Maxim I made a few comments on your PR. I don’t know a way to work around the list not being captured. You can still put the random list initialization in call_fn and still get a good comparison result.
@Maxim I benchmarked various partition scheme for quicksort (current stdlib implementation, an improved version I have internally, your implementation). Here is the benchmark result: (old: stdlib, new: my implementation, new left only: your implementation heapsort: just for reference).
Thanks @dan13llljws ! I will have a look
I was curious why do our cmp_fn implement “less than equal” rather than “less than”? This seems unconventional.
It's is also discussed in the video I posted above here is the link with th etime stamp directly relevant for the comparison function https://youtu.be/jz-PBiWwNjc?si=BQLYQYj-BTusYjy-&t=827
ahhhh, I haven't got a chance to take a look yet
Congrats @dan13llljws, you just advanced to level 1!
Long story short with just
<
you might get a undefined behaviour and it is also unstable (I think. :))Really? I guess it really depends on how you implement your partition. I know for the implementation where pick the pivot as last element and you do:
It really requires you have
<
since otherwise when a[left] == pivot_value
left will go out of bound.
Take a look at
I think we <
or <=
is just a convernsion, we can make either work. But it is ultimately the question of "what is the best convention for the users".
When I really get a chance to dig into pdqsort, I will be able to discuss more!
Just watched the video td. I think its the other way around?