Is there a faster way of splitting strings?
I feel like I've hit a local minimum with my code. Basically it splits strings on certain characters and lowercases the result, eg.
"This fair child of mine\nShall sum my count, and make my old excuse"
into
[this, fair, child, of, mine, shall, sum, my, count, and, make, my, old, excuse]
It processes 100MB of text content in 2.33 seconds, but I feel like it should be able to go faster.
- It's too much of a hassle to use Memory<char>
elsewhere in the code, instead of returning strings
- I've tried string pooling instead of constantly allocating new strings, and it actually ends up being slower somehow
Code:
18 Replies
You're potentially allocating an unbounded amount of stack there, so you're risking a stack overflow
Why not use
string.Create
instead?u are using the
SearchValues
in quite a slow way.
the IndexOfAny*
methods from the memory extension methods are a lot faster because they use vectorization instead of going char by char, so something like the following would be faster:oh, i forgot the to lower case part, but that should be easy to add
A call to
ToLowerInvariant
should be pretty optimal. It'll do fewer copies than you do, and it optimises for if the string contains no upper-case characters
Beyond that, time to get out a profile!well, with something like
could be that it will be possible in .net 9 to pass the ROS directly without these shenanigans tho
actually that one wont work and is a gc hole
or it would pin which is slow D:
string.Create()
and the ReadOnlySpan<char>
from my code u make it at least only 1 allocation for the string,
tho it will be a bit ugly because u have to pass the ROS as nint
😒
interesting, i'm not sure why but this version seems to run about 0.4s slower, and using
ToLowerInvariant
doesn't seem to change much
profiler seems to paint the string allocation as the bad guy, but even checking under another mode it doesn't mention the stackalloc at all
25% is not accounted for either, which is funny. maybe i'm really at the limitthis is the correct one:
note that it produces some warnings, but they can be ignored in this specific case
this is actually slower as well somehow. for a 5MB batch of documents its avg is 500ms, the direct copy to stackalloc and checking each char avg is about 450ms
thats weird
yes i think that sums up the last 4 or so hours i've been staring at this
things that generally feel like they should perform better, don't
only thing i could think of is that
ToLowerInvariant()
has generally more overhead than your asciii letters only to lower.
how does it fare if u throw that into the lambda instead of calling ToLowerInvariant()
there?lets take a look
even more interestingly, it avgs at around
550ms
so it seems that ToLowerInvariant is faster in this context but not in the otherAscii.ToLower Method (System.Text)
Copies text from a source buffer to a destination buffer, converting ASCII letters to lowercase during the copy.
Ascii.ToLowerInPlace Method (System.Text)
Performs in-place uppercase conversion.
for ascii only
interesting ... i added the AsciiToLower and ToLowerInvariant as Deferred creation and ran the benchmarks as well and my results are completely different: