Why is my SIMD code slower than the scalar version?

I wrote the following to learn more about simd - it tries to find a substring https://paste.mod.gg/jswpvcpgoxgo/0 I ran a benchmark on my machine, which is avx512, compared against when it goes down the scalar path by setting DOTNET_EnableHWIntrinsic=0. In my benchmark I have 2 paragraphs of Lorem Ipsum (1156 chars length) and a search string of a few words (47 chars length). The vector512 benchmark takes approx 2.8us and the scalar benchmark takes 4.1us which seems like a fairly large difference and indicative that I’ve done something wrong. Is there any more profiling I can use to work out what went wrong?
BlazeBin - jswpvcpgoxgo
A tool for sharing your source code with the world!
3 Replies
dreadfullydistinct
dreadfullydistinctOP10mo ago
On closer inspection of the runtime guide I need to - use load rather than create in the loop - install vtune or something to work out what’s going wrong in terms of the instructions
reflectronic
reflectronic10mo ago
The vector512 benchmark takes approx 2.8us and the scalar benchmark takes 4.1us which seems like a fairly large difference and indicative that I’ve done something wrong.
i don't understand. this means that the vector512 benchmark is faster. it takes fewer microseconds
dreadfullydistinct
dreadfullydistinctOP10mo ago
I muddled those around sorry. Scalar is 2.8 and vector 4.1 I was mucking about on my work computer where I don’t have discord otherwise I’d have pasted the table
Want results from more Discord servers?
Add your server