C
C#4mo ago
Bepis

Is there a faster way of splitting strings?

I feel like I've hit a local minimum with my code. Basically it splits strings on certain characters and lowercases the result, eg. "This fair child of mine\nShall sum my count, and make my old excuse" into [this, fair, child, of, mine, shall, sum, my, count, and, make, my, old, excuse] It processes 100MB of text content in 2.33 seconds, but I feel like it should be able to go faster. - It's too much of a hassle to use Memory<char> elsewhere in the code, instead of returning strings - I've tried string pooling instead of constantly allocating new strings, and it actually ends up being slower somehow Code:
static string MemoryToLower(ReadOnlySpan<char> span)
{
Span<char> chars = stackalloc char[span.Length];

for (int i = 0; i < span.Length; i++)
{
char c = span[i];
chars[i] = (c >= 'A' && c <= 'Z') ? (char)(c + 32) : c;
}

return new string(chars);
}

static SearchValues<char> searchValues = SearchValues.Create(new[] { ' ', '\r', '\n', ',', '.', '?', '"', '\'', ';', '!', '\t', '(', ')', '[', ']', '<', '>', '+', '-', '*' });

static List<string> Tokenizer(string input)
{
var startIndex = 0;
var list = new List<string>();

for (var i = 0; i < input.Length; i++)
{
var c = input[i];
if (searchValues.Contains(c))
{
if (i - startIndex > 0)
list.Add(MemoryToLower(input.AsSpan(startIndex, i - startIndex)));

startIndex = i + 1;
}
}
if (input.Length - startIndex > 0)
list.Add(MemoryToLower(input.AsSpan(startIndex, input.Length - startIndex)));

return list;
}
static string MemoryToLower(ReadOnlySpan<char> span)
{
Span<char> chars = stackalloc char[span.Length];

for (int i = 0; i < span.Length; i++)
{
char c = span[i];
chars[i] = (c >= 'A' && c <= 'Z') ? (char)(c + 32) : c;
}

return new string(chars);
}

static SearchValues<char> searchValues = SearchValues.Create(new[] { ' ', '\r', '\n', ',', '.', '?', '"', '\'', ';', '!', '\t', '(', ')', '[', ']', '<', '>', '+', '-', '*' });

static List<string> Tokenizer(string input)
{
var startIndex = 0;
var list = new List<string>();

for (var i = 0; i < input.Length; i++)
{
var c = input[i];
if (searchValues.Contains(c))
{
if (i - startIndex > 0)
list.Add(MemoryToLower(input.AsSpan(startIndex, i - startIndex)));

startIndex = i + 1;
}
}
if (input.Length - startIndex > 0)
list.Add(MemoryToLower(input.AsSpan(startIndex, input.Length - startIndex)));

return list;
}
18 Replies
canton7
canton74mo ago
You're potentially allocating an unbounded amount of stack there, so you're risking a stack overflow Why not use string.Create instead?
cap5lut
cap5lut4mo ago
u are using the SearchValues in quite a slow way. the IndexOfAny* methods from the memory extension methods are a lot faster because they use vectorization instead of going char by char, so something like the following would be faster:
MODiX
MODiX4mo ago
cap5lut
sharplab.io (click here)
var str = "This fair child of mine\nShall sum my count, an...
var searchValues = SearchValues.Create(new[] { ' ', '\r', ...
var list = new List<string>();
var view = str.AsSpan();
while (true) {
var index = view.IndexOfAny(searchValues);
// skip all search values
if (index == 0) {
index = view.IndexOfAnyExcept(searchValues);
view = view[index..];
// 18 more lines. Follow the link to view.
var str = "This fair child of mine\nShall sum my count, an...
var searchValues = SearchValues.Create(new[] { ' ', '\r', ...
var list = new List<string>();
var view = str.AsSpan();
while (true) {
var index = view.IndexOfAny(searchValues);
// skip all search values
if (index == 0) {
index = view.IndexOfAnyExcept(searchValues);
view = view[index..];
// 18 more lines. Follow the link to view.
React with ❌ to remove this embed.
cap5lut
cap5lut4mo ago
oh, i forgot the to lower case part, but that should be easy to add
canton7
canton74mo ago
A call to ToLowerInvariant should be pretty optimal. It'll do fewer copies than you do, and it optimises for if the string contains no upper-case characters Beyond that, time to get out a profile!
cap5lut
cap5lut4mo ago
well, with string.Create() and the ReadOnlySpan<char> from my code u make it at least only 1 allocation for the string, tho it will be a bit ugly because u have to pass the ROS as nint 😒 something like
static string CreateLowerCaseInvariantString(ReadOnlySpan<Char> value)
{
unsafe
{
fixed (char* ptr = value)
{
return String.Create(value.Length, (Ptr: (nint)ptr, Len: value.Length), static (buffer, info) =>
{
var value = new ReadOnlySpan<char>((char*)info.Ptr, info.Len);
value.ToLowerInvariant(buffer);
});
}
}
}
static string CreateLowerCaseInvariantString(ReadOnlySpan<Char> value)
{
unsafe
{
fixed (char* ptr = value)
{
return String.Create(value.Length, (Ptr: (nint)ptr, Len: value.Length), static (buffer, info) =>
{
var value = new ReadOnlySpan<char>((char*)info.Ptr, info.Len);
value.ToLowerInvariant(buffer);
});
}
}
}
could be that it will be possible in .net 9 to pass the ROS directly without these shenanigans tho actually that one wont work and is a gc hole or it would pin which is slow D:
Bepis
BepisOP4mo ago
interesting, i'm not sure why but this version seems to run about 0.4s slower, and using ToLowerInvariant doesn't seem to change much profiler seems to paint the string allocation as the bad guy, but even checking under another mode it doesn't mention the stackalloc at all 25% is not accounted for either, which is funny. maybe i'm really at the limit
No description
cap5lut
cap5lut4mo ago
this is the correct one:
static string CreateLowerCaseInvariantString(ReadOnlySpan<char> value)
{
unsafe
{
var ptr = &value;
return String.Create(value.Length, (nint)ptr, static (buffer, ptr) =>
{
var value = *(ReadOnlySpan<char>*)ptr;
value.ToLowerInvariant(buffer);
});
}
}
static string CreateLowerCaseInvariantString(ReadOnlySpan<char> value)
{
unsafe
{
var ptr = &value;
return String.Create(value.Length, (nint)ptr, static (buffer, ptr) =>
{
var value = *(ReadOnlySpan<char>*)ptr;
value.ToLowerInvariant(buffer);
});
}
}
note that it produces some warnings, but they can be ignored in this specific case
Bepis
BepisOP4mo ago
this is actually slower as well somehow. for a 5MB batch of documents its avg is 500ms, the direct copy to stackalloc and checking each char avg is about 450ms
cap5lut
cap5lut4mo ago
thats weird
Bepis
BepisOP4mo ago
yes i think that sums up the last 4 or so hours i've been staring at this things that generally feel like they should perform better, don't
cap5lut
cap5lut4mo ago
only thing i could think of is that ToLowerInvariant() has generally more overhead than your asciii letters only to lower. how does it fare if u throw that into the lambda instead of calling ToLowerInvariant() there?
Bepis
BepisOP4mo ago
lets take a look even more interestingly, it avgs at around 550ms so it seems that ToLowerInvariant is faster in this context but not in the other
Petris
Petris4mo ago
Ascii.ToLower Method (System.Text)
Copies text from a source buffer to a destination buffer, converting ASCII letters to lowercase during the copy.
Petris
Petris4mo ago
for ascii only
Bepis
BepisOP4mo ago
| Method | TestString | Mean | Error | StdDev |
|---------------------------- |----------- |----------:|----------:|---------:|
| BareImplementation | benchmarks | 25.90 ns | 4.169 ns | 2.481 ns |
| BareImplementationWithCheck | benchmarks | 16.13 ns | 2.335 ns | 1.389 ns |
| AsciiToLower | benchmarks | 40.56 ns | 1.616 ns | 0.845 ns |
| ToLowerInvariant | benchmarks | 47.25 ns | 0.531 ns | 0.278 ns |
| DeferredCreation | benchmarks | 136.06 ns | 9.650 ns | 5.743 ns |
| BareImplementation | BenchMarks | 21.46 ns | 1.496 ns | 0.890 ns |
| BareImplementationWithCheck | BenchMarks | 26.97 ns | 1.720 ns | 1.023 ns |
| AsciiToLower | BenchMarks | 40.60 ns | 1.062 ns | 0.555 ns |
| ToLowerInvariant | BenchMarks | 47.16 ns | 0.902 ns | 0.472 ns |
| DeferredCreation | BenchMarks | 147.46 ns | 10.854 ns | 5.677 ns |
| Method | TestString | Mean | Error | StdDev |
|---------------------------- |----------- |----------:|----------:|---------:|
| BareImplementation | benchmarks | 25.90 ns | 4.169 ns | 2.481 ns |
| BareImplementationWithCheck | benchmarks | 16.13 ns | 2.335 ns | 1.389 ns |
| AsciiToLower | benchmarks | 40.56 ns | 1.616 ns | 0.845 ns |
| ToLowerInvariant | benchmarks | 47.25 ns | 0.531 ns | 0.278 ns |
| DeferredCreation | benchmarks | 136.06 ns | 9.650 ns | 5.743 ns |
| BareImplementation | BenchMarks | 21.46 ns | 1.496 ns | 0.890 ns |
| BareImplementationWithCheck | BenchMarks | 26.97 ns | 1.720 ns | 1.023 ns |
| AsciiToLower | BenchMarks | 40.60 ns | 1.062 ns | 0.555 ns |
| ToLowerInvariant | BenchMarks | 47.16 ns | 0.902 ns | 0.472 ns |
| DeferredCreation | BenchMarks | 147.46 ns | 10.854 ns | 5.677 ns |
public class LowercaseBenchmark
{
[Params("benchmarks", "BenchMarks")]
public string TestString { get; set; }

[Benchmark]
public string BareImplementation()
{
ReadOnlySpan<char> span = TestString;
Span<char> chars = stackalloc char[span.Length];

for (int i = 0; i < span.Length; i++)
{
char c = span[i];
chars[i] = (c >= 'A' && c <= 'Z') ? (char)(c + 32) : c;
}

return new string(chars);
}

[Benchmark]
public string BareImplementationWithCheck()
{
ReadOnlySpan<char> span = TestString;

bool hasUppercase = false;
for (int i = 0; i < span.Length; i++)
{
char c = span[i];
if (c >= 'A' && c <= 'Z')
{
hasUppercase = true;
break;
}
}

if (!hasUppercase)
return new string(span);

Span<char> chars = stackalloc char[span.Length];

for (int i = 0; i < span.Length; i++)
{
char c = span[i];
chars[i] = (c >= 'A' && c <= 'Z') ? (char)(c + 32) : c;
}

return new string(chars);
}

[Benchmark]
public string AsciiToLower()
{
ReadOnlySpan<char> span = TestString;
Span<char> chars = stackalloc char[span.Length];

Ascii.ToLower(span, chars, out _);

return new string(chars);
}

[Benchmark]
public string ToLowerInvariant()
{
ReadOnlySpan<char> span = TestString;
Span<char> chars = stackalloc char[span.Length];

span.ToLowerInvariant(chars);

return new string(chars);
}

[Benchmark]
public string DeferredCreation()
{
ReadOnlySpan<char> span = TestString;

unsafe
{
var ptr = &span;
return String.Create(span.Length, (nint)ptr, static (buffer, ptr) =>
{
var span = *(ReadOnlySpan<char>*)ptr;
for (int i = 0; i < span.Length; i++)
{
char c = span[i];
buffer[i] = (c >= 'A' && c <= 'Z') ? (char)(c + 32) : c;
}
});
}
}
}
public class LowercaseBenchmark
{
[Params("benchmarks", "BenchMarks")]
public string TestString { get; set; }

[Benchmark]
public string BareImplementation()
{
ReadOnlySpan<char> span = TestString;
Span<char> chars = stackalloc char[span.Length];

for (int i = 0; i < span.Length; i++)
{
char c = span[i];
chars[i] = (c >= 'A' && c <= 'Z') ? (char)(c + 32) : c;
}

return new string(chars);
}

[Benchmark]
public string BareImplementationWithCheck()
{
ReadOnlySpan<char> span = TestString;

bool hasUppercase = false;
for (int i = 0; i < span.Length; i++)
{
char c = span[i];
if (c >= 'A' && c <= 'Z')
{
hasUppercase = true;
break;
}
}

if (!hasUppercase)
return new string(span);

Span<char> chars = stackalloc char[span.Length];

for (int i = 0; i < span.Length; i++)
{
char c = span[i];
chars[i] = (c >= 'A' && c <= 'Z') ? (char)(c + 32) : c;
}

return new string(chars);
}

[Benchmark]
public string AsciiToLower()
{
ReadOnlySpan<char> span = TestString;
Span<char> chars = stackalloc char[span.Length];

Ascii.ToLower(span, chars, out _);

return new string(chars);
}

[Benchmark]
public string ToLowerInvariant()
{
ReadOnlySpan<char> span = TestString;
Span<char> chars = stackalloc char[span.Length];

span.ToLowerInvariant(chars);

return new string(chars);
}

[Benchmark]
public string DeferredCreation()
{
ReadOnlySpan<char> span = TestString;

unsafe
{
var ptr = &span;
return String.Create(span.Length, (nint)ptr, static (buffer, ptr) =>
{
var span = *(ReadOnlySpan<char>*)ptr;
for (int i = 0; i < span.Length; i++)
{
char c = span[i];
buffer[i] = (c >= 'A' && c <= 'Z') ? (char)(c + 32) : c;
}
});
}
}
}
cap5lut
cap5lut4mo ago
interesting ... i added the AsciiToLower and ToLowerInvariant as Deferred creation and ran the benchmarks as well and my results are completely different:
| Method | TestString | Mean | Error | StdDev |
|--------------------------------- |----------- |---------:|---------:|---------:|
| BareImplementation | benchmarks | 31.64 ns | 0.260 ns | 0.373 ns |
| BareImplementationWithCheck | benchmarks | 23.20 ns | 0.237 ns | 0.198 ns |
| AsciiToLower | benchmarks | 30.00 ns | 0.591 ns | 0.494 ns |
| ToLowerInvariant | benchmarks | 36.13 ns | 0.621 ns | 0.518 ns |
| DeferredCreation | benchmarks | 23.04 ns | 0.209 ns | 0.205 ns |
| DeferredCreationAsciiToLower | benchmarks | 24.07 ns | 0.419 ns | 0.350 ns |
| DeferredCreationToLowerInvariant | benchmarks | 27.06 ns | 0.315 ns | 0.294 ns |
| BareImplementation | BenchMarks | 29.62 ns | 0.627 ns | 0.644 ns |
| BareImplementationWithCheck | BenchMarks | 35.43 ns | 0.705 ns | 0.589 ns |
| AsciiToLower | BenchMarks | 29.81 ns | 0.441 ns | 0.391 ns |
| ToLowerInvariant | BenchMarks | 34.92 ns | 0.681 ns | 0.604 ns |
| DeferredCreation | BenchMarks | 23.86 ns | 0.246 ns | 0.192 ns |
| DeferredCreationAsciiToLower | BenchMarks | 24.32 ns | 0.388 ns | 0.363 ns |
| DeferredCreationToLowerInvariant | BenchMarks | 27.30 ns | 0.530 ns | 0.496 ns |
| Method | TestString | Mean | Error | StdDev |
|--------------------------------- |----------- |---------:|---------:|---------:|
| BareImplementation | benchmarks | 31.64 ns | 0.260 ns | 0.373 ns |
| BareImplementationWithCheck | benchmarks | 23.20 ns | 0.237 ns | 0.198 ns |
| AsciiToLower | benchmarks | 30.00 ns | 0.591 ns | 0.494 ns |
| ToLowerInvariant | benchmarks | 36.13 ns | 0.621 ns | 0.518 ns |
| DeferredCreation | benchmarks | 23.04 ns | 0.209 ns | 0.205 ns |
| DeferredCreationAsciiToLower | benchmarks | 24.07 ns | 0.419 ns | 0.350 ns |
| DeferredCreationToLowerInvariant | benchmarks | 27.06 ns | 0.315 ns | 0.294 ns |
| BareImplementation | BenchMarks | 29.62 ns | 0.627 ns | 0.644 ns |
| BareImplementationWithCheck | BenchMarks | 35.43 ns | 0.705 ns | 0.589 ns |
| AsciiToLower | BenchMarks | 29.81 ns | 0.441 ns | 0.391 ns |
| ToLowerInvariant | BenchMarks | 34.92 ns | 0.681 ns | 0.604 ns |
| DeferredCreation | BenchMarks | 23.86 ns | 0.246 ns | 0.192 ns |
| DeferredCreationAsciiToLower | BenchMarks | 24.32 ns | 0.388 ns | 0.363 ns |
| DeferredCreationToLowerInvariant | BenchMarks | 27.30 ns | 0.530 ns | 0.496 ns |

Did you find this page helpful?