Why is my SIMD code slower than the scalar version?
I wrote the following to learn more about simd - it tries to find a substring
https://paste.mod.gg/jswpvcpgoxgo/0
I ran a benchmark on my machine, which is avx512, compared against when it goes down the scalar path by setting DOTNET_EnableHWIntrinsic=0.
In my benchmark I have 2 paragraphs of Lorem Ipsum (1156 chars length) and a search string of a few words (47 chars length).
The vector512 benchmark takes approx 2.8us and the scalar benchmark takes 4.1us which seems like a fairly large difference and indicative that I’ve done something wrong.
Is there any more profiling I can use to work out what went wrong?
BlazeBin - jswpvcpgoxgo
A tool for sharing your source code with the world!
3 Replies
On closer inspection of the runtime guide I need to
- use load rather than create in the loop
- install vtune or something to work out what’s going wrong in terms of the instructions
The vector512 benchmark takes approx 2.8us and the scalar benchmark takes 4.1us which seems like a fairly large difference and indicative that I’ve done something wrong.i don't understand. this means that the vector512 benchmark is faster. it takes fewer microseconds
I muddled those around sorry. Scalar is 2.8 and vector 4.1
I was mucking about on my work computer where I don’t have discord otherwise I’d have pasted the table