C#•6mo ago

Is there a faster way of splitting strings?

I feel like I've hit a local minimum with my code. Basically it splits strings on certain characters and lowercases the result, eg. "This fair child of mine\nShall sum my count, and make my old excuse" into [this, fair, child, of, mine, shall, sum, my, count, and, make, my, old, excuse] It processes 100MB of text content in 2.33 seconds, but I feel like it should be able to go faster. - It's too much of a hassle to use Memory<char> elsewhere in the code, instead of returning strings - I've tried string pooling instead of constantly allocating new strings, and it actually ends up being slower somehow Code:

static string MemoryToLower(ReadOnlySpan<char> span)
{
    Span<char> chars = stackalloc char[span.Length];

    for (int i = 0; i < span.Length; i++)
    {
        char c = span[i];
        chars[i] = (c >= 'A' && c <= 'Z') ? (char)(c + 32) : c;
    }

    return new string(chars);
}

static SearchValues<char> searchValues = SearchValues.Create(new[] { ' ', '\r', '\n', ',', '.', '?', '"', '\'', ';', '!', '\t', '(', ')', '[', ']', '<', '>', '+', '-', '*' });

static List<string> Tokenizer(string input)
{
    var startIndex = 0;
    var list = new List<string>();

    for (var i = 0; i < input.Length; i++)
    {
        var c = input[i];
        if (searchValues.Contains(c))
        {
            if (i - startIndex > 0)
                list.Add(MemoryToLower(input.AsSpan(startIndex, i - startIndex)));
                
            startIndex = i + 1;
        }
    }
    if (input.Length - startIndex > 0)
        list.Add(MemoryToLower(input.AsSpan(startIndex, input.Length - startIndex)));
    
    return list;
}

static string MemoryToLower(ReadOnlySpan<char> span)
{
    Span<char> chars = stackalloc char[span.Length];

    for (int i = 0; i < span.Length; i++)
    {
        char c = span[i];
        chars[i] = (c >= 'A' && c <= 'Z') ? (char)(c + 32) : c;
    }

    return new string(chars);
}

static SearchValues<char> searchValues = SearchValues.Create(new[] { ' ', '\r', '\n', ',', '.', '?', '"', '\'', ';', '!', '\t', '(', ')', '[', ']', '<', '>', '+', '-', '*' });

static List<string> Tokenizer(string input)
{
    var startIndex = 0;
    var list = new List<string>();

    for (var i = 0; i < input.Length; i++)
    {
        var c = input[i];
        if (searchValues.Contains(c))
        {
            if (i - startIndex > 0)
                list.Add(MemoryToLower(input.AsSpan(startIndex, i - startIndex)));
                
            startIndex = i + 1;
        }
    }
    if (input.Length - startIndex > 0)
        list.Add(MemoryToLower(input.AsSpan(startIndex, input.Length - startIndex)));
    
    return list;
}

18 Replies

canton7•6mo ago

You're potentially allocating an unbounded amount of stack there, so you're risking a stack overflow Why not use string.Create instead?

cap5lut•6mo ago

u are using the SearchValues in quite a slow way. the IndexOfAny* methods from the memory extension methods are a lot faster because they use vectorization instead of going char by char, so something like the following would be faster:

MODiX•6mo ago

cap5lut

sharplab.io (click here)

var str = "This fair child of mine\nShall sum my count, an...
var searchValues = SearchValues.Create(new[] { ' ', '\r', ...
var list = new List<string>();
var view = str.AsSpan();
while (true) {
    var index = view.IndexOfAny(searchValues);
    // skip all search values
    if (index == 0) {
        index = view.IndexOfAnyExcept(searchValues);
        view = view[index..];
// 18 more lines. Follow the link to view.

var str = "This fair child of mine\nShall sum my count, an...
var searchValues = SearchValues.Create(new[] { ' ', '\r', ...
var list = new List<string>();
var view = str.AsSpan();
while (true) {
    var index = view.IndexOfAny(searchValues);
    // skip all search values
    if (index == 0) {
        index = view.IndexOfAnyExcept(searchValues);
        view = view[index..];
// 18 more lines. Follow the link to view.

React with ❌ to remove this embed.

cap5lut•6mo ago

oh, i forgot the to lower case part, but that should be easy to add

canton7•6mo ago

A call to ToLowerInvariant should be pretty optimal. It'll do fewer copies than you do, and it optimises for if the string contains no upper-case characters Beyond that, time to get out a profile!

cap5lut•6mo ago

well, with string.Create() and the ReadOnlySpan<char> from my code u make it at least only 1 allocation for the string, tho it will be a bit ugly because u have to pass the ROS as nint 😒 something like
static string CreateLowerCaseInvariantString(ReadOnlySpan<Char> value) { unsafe { fixed (char* ptr = value) { return String.Create(value.Length, (Ptr: (nint)ptr, Len: value.Length), static (buffer, info) => { var value = new ReadOnlySpan<char>((char*)info.Ptr, info.Len); value.ToLowerInvariant(buffer); }); } } }
static string CreateLowerCaseInvariantString(ReadOnlySpan<Char> value) { unsafe { fixed (char* ptr = value) { return String.Create(value.Length, (Ptr: (nint)ptr, Len: value.Length), static (buffer, info) => { var value = new ReadOnlySpan<char>((char*)info.Ptr, info.Len); value.ToLowerInvariant(buffer); }); } } }
could be that it will be possible in .net 9 to pass the ROS directly without these shenanigans tho actually that one wont work and is a gc hole or it would pin which is slow D:

BepisOP•6mo ago

interesting, i'm not sure why but this version seems to run about 0.4s slower, and using ToLowerInvariant doesn't seem to change much profiler seems to paint the string allocation as the bad guy, but even checking under another mode it doesn't mention the stackalloc at all 25% is not accounted for either, which is funny. maybe i'm really at the limit

cap5lut•6mo ago

this is the correct one:

static string CreateLowerCaseInvariantString(ReadOnlySpan<char> value)
{
    unsafe
    {
        var ptr = &value;
        return String.Create(value.Length, (nint)ptr, static (buffer, ptr) =>
        {
            var value = *(ReadOnlySpan<char>*)ptr;
            value.ToLowerInvariant(buffer);
        });
    }
}

static string CreateLowerCaseInvariantString(ReadOnlySpan<char> value)
{
    unsafe
    {
        var ptr = &value;
        return String.Create(value.Length, (nint)ptr, static (buffer, ptr) =>
        {
            var value = *(ReadOnlySpan<char>*)ptr;
            value.ToLowerInvariant(buffer);
        });
    }
}

note that it produces some warnings, but they can be ignored in this specific case

BepisOP•6mo ago

this is actually slower as well somehow. for a 5MB batch of documents its avg is 500ms, the direct copy to stackalloc and checking each char avg is about 450ms

cap5lut•6mo ago

thats weird

BepisOP•6mo ago

yes i think that sums up the last 4 or so hours i've been staring at this things that generally feel like they should perform better, don't

cap5lut•6mo ago

only thing i could think of is that ToLowerInvariant() has generally more overhead than your asciii letters only to lower. how does it fare if u throw that into the lambda instead of calling ToLowerInvariant() there?

BepisOP•6mo ago

lets take a look even more interestingly, it avgs at around 550ms so it seems that ToLowerInvariant is faster in this context but not in the other

Petris•6mo ago

https://learn.microsoft.com/en-us/dotnet/api/system.text.ascii.tolower?view=net-8.0

Ascii.ToLower Method (System.Text)

Copies text from a source buffer to a destination buffer, converting ASCII letters to lowercase during the copy.

Petris•6mo ago

https://learn.microsoft.com/en-us/dotnet/api/system.text.ascii.tolowerinplace?view=net-8.0

Ascii.ToLowerInPlace Method (System.Text)

Performs in-place uppercase conversion.

Petris•6mo ago

for ascii only

BepisOP•6mo ago

| Method                      | TestString | Mean      | Error     | StdDev   |
|---------------------------- |----------- |----------:|----------:|---------:|
| BareImplementation          | benchmarks |  25.90 ns |  4.169 ns | 2.481 ns |
| BareImplementationWithCheck | benchmarks |  16.13 ns |  2.335 ns | 1.389 ns |
| AsciiToLower                | benchmarks |  40.56 ns |  1.616 ns | 0.845 ns |
| ToLowerInvariant            | benchmarks |  47.25 ns |  0.531 ns | 0.278 ns |
| DeferredCreation            | benchmarks | 136.06 ns |  9.650 ns | 5.743 ns |
| BareImplementation          | BenchMarks |  21.46 ns |  1.496 ns | 0.890 ns |
| BareImplementationWithCheck | BenchMarks |  26.97 ns |  1.720 ns | 1.023 ns |
| AsciiToLower                | BenchMarks |  40.60 ns |  1.062 ns | 0.555 ns |
| ToLowerInvariant            | BenchMarks |  47.16 ns |  0.902 ns | 0.472 ns |
| DeferredCreation            | BenchMarks | 147.46 ns | 10.854 ns | 5.677 ns |

| Method                      | TestString | Mean      | Error     | StdDev   |
|---------------------------- |----------- |----------:|----------:|---------:|
| BareImplementation          | benchmarks |  25.90 ns |  4.169 ns | 2.481 ns |
| BareImplementationWithCheck | benchmarks |  16.13 ns |  2.335 ns | 1.389 ns |
| AsciiToLower                | benchmarks |  40.56 ns |  1.616 ns | 0.845 ns |
| ToLowerInvariant            | benchmarks |  47.25 ns |  0.531 ns | 0.278 ns |
| DeferredCreation            | benchmarks | 136.06 ns |  9.650 ns | 5.743 ns |
| BareImplementation          | BenchMarks |  21.46 ns |  1.496 ns | 0.890 ns |
| BareImplementationWithCheck | BenchMarks |  26.97 ns |  1.720 ns | 1.023 ns |
| AsciiToLower                | BenchMarks |  40.60 ns |  1.062 ns | 0.555 ns |
| ToLowerInvariant            | BenchMarks |  47.16 ns |  0.902 ns | 0.472 ns |
| DeferredCreation            | BenchMarks | 147.46 ns | 10.854 ns | 5.677 ns |

    public class LowercaseBenchmark
    {
        [Params("benchmarks", "BenchMarks")]
        public string TestString { get; set; }
        
        [Benchmark]
        public string BareImplementation()
        {
            ReadOnlySpan<char> span = TestString;
            Span<char> chars = stackalloc char[span.Length];

            for (int i = 0; i < span.Length; i++)
            {
                char c = span[i];
                chars[i] = (c >= 'A' && c <= 'Z') ? (char)(c + 32) : c;
            }

            return new string(chars);
        }
        
        [Benchmark]
        public string BareImplementationWithCheck()
        {
            ReadOnlySpan<char> span = TestString;

            bool hasUppercase = false;
            for (int i = 0; i < span.Length; i++)
            {
                char c = span[i];
                if (c >= 'A' && c <= 'Z')
                {
                    hasUppercase = true;
                    break;
                }
            }

            if (!hasUppercase)
                return new string(span);

            Span<char> chars = stackalloc char[span.Length];

            for (int i = 0; i < span.Length; i++)
            {
                char c = span[i];
                chars[i] = (c >= 'A' && c <= 'Z') ? (char)(c + 32) : c;
            }

            return new string(chars);
        }
        
        [Benchmark]
        public string AsciiToLower()
        {
            ReadOnlySpan<char> span = TestString;
            Span<char> chars = stackalloc char[span.Length];
        
            Ascii.ToLower(span, chars, out _);

            return new string(chars);
        }
        
        [Benchmark]
        public string ToLowerInvariant()
        {
            ReadOnlySpan<char> span = TestString;
            Span<char> chars = stackalloc char[span.Length];
        
            span.ToLowerInvariant(chars);

            return new string(chars);
        }
        
        [Benchmark]
        public string DeferredCreation()
        {
            ReadOnlySpan<char> span = TestString;

            unsafe
            {
                var ptr = &span;
                return String.Create(span.Length, (nint)ptr, static (buffer, ptr) =>
                {
                    var span = *(ReadOnlySpan<char>*)ptr;
                    for (int i = 0; i < span.Length; i++)
                    {
                        char c = span[i];
                        buffer[i] = (c >= 'A' && c <= 'Z') ? (char)(c + 32) : c;
                    }
                });
            }
        }
    }

    public class LowercaseBenchmark
    {
        [Params("benchmarks", "BenchMarks")]
        public string TestString { get; set; }
        
        [Benchmark]
        public string BareImplementation()
        {
            ReadOnlySpan<char> span = TestString;
            Span<char> chars = stackalloc char[span.Length];

            for (int i = 0; i < span.Length; i++)
            {
                char c = span[i];
                chars[i] = (c >= 'A' && c <= 'Z') ? (char)(c + 32) : c;
            }

            return new string(chars);
        }
        
        [Benchmark]
        public string BareImplementationWithCheck()
        {
            ReadOnlySpan<char> span = TestString;

            bool hasUppercase = false;
            for (int i = 0; i < span.Length; i++)
            {
                char c = span[i];
                if (c >= 'A' && c <= 'Z')
                {
                    hasUppercase = true;
                    break;
                }
            }

            if (!hasUppercase)
                return new string(span);

            Span<char> chars = stackalloc char[span.Length];

            for (int i = 0; i < span.Length; i++)
            {
                char c = span[i];
                chars[i] = (c >= 'A' && c <= 'Z') ? (char)(c + 32) : c;
            }

            return new string(chars);
        }
        
        [Benchmark]
        public string AsciiToLower()
        {
            ReadOnlySpan<char> span = TestString;
            Span<char> chars = stackalloc char[span.Length];
        
            Ascii.ToLower(span, chars, out _);

            return new string(chars);
        }
        
        [Benchmark]
        public string ToLowerInvariant()
        {
            ReadOnlySpan<char> span = TestString;
            Span<char> chars = stackalloc char[span.Length];
        
            span.ToLowerInvariant(chars);

            return new string(chars);
        }
        
        [Benchmark]
        public string DeferredCreation()
        {
            ReadOnlySpan<char> span = TestString;

            unsafe
            {
                var ptr = &span;
                return String.Create(span.Length, (nint)ptr, static (buffer, ptr) =>
                {
                    var span = *(ReadOnlySpan<char>*)ptr;
                    for (int i = 0; i < span.Length; i++)
                    {
                        char c = span[i];
                        buffer[i] = (c >= 'A' && c <= 'Z') ? (char)(c + 32) : c;
                    }
                });
            }
        }
    }

cap5lut•6mo ago

interesting ... i added the AsciiToLower and ToLowerInvariant as Deferred creation and ran the benchmarks as well and my results are completely different:

| Method                           | TestString | Mean     | Error    | StdDev   |
|--------------------------------- |----------- |---------:|---------:|---------:|
| BareImplementation               | benchmarks | 31.64 ns | 0.260 ns | 0.373 ns |
| BareImplementationWithCheck      | benchmarks | 23.20 ns | 0.237 ns | 0.198 ns |
| AsciiToLower                     | benchmarks | 30.00 ns | 0.591 ns | 0.494 ns |
| ToLowerInvariant                 | benchmarks | 36.13 ns | 0.621 ns | 0.518 ns |
| DeferredCreation                 | benchmarks | 23.04 ns | 0.209 ns | 0.205 ns |
| DeferredCreationAsciiToLower     | benchmarks | 24.07 ns | 0.419 ns | 0.350 ns |
| DeferredCreationToLowerInvariant | benchmarks | 27.06 ns | 0.315 ns | 0.294 ns |
| BareImplementation               | BenchMarks | 29.62 ns | 0.627 ns | 0.644 ns |
| BareImplementationWithCheck      | BenchMarks | 35.43 ns | 0.705 ns | 0.589 ns |
| AsciiToLower                     | BenchMarks | 29.81 ns | 0.441 ns | 0.391 ns |
| ToLowerInvariant                 | BenchMarks | 34.92 ns | 0.681 ns | 0.604 ns |
| DeferredCreation                 | BenchMarks | 23.86 ns | 0.246 ns | 0.192 ns |
| DeferredCreationAsciiToLower     | BenchMarks | 24.32 ns | 0.388 ns | 0.363 ns |
| DeferredCreationToLowerInvariant | BenchMarks | 27.30 ns | 0.530 ns | 0.496 ns |

| Method                           | TestString | Mean     | Error    | StdDev   |
|--------------------------------- |----------- |---------:|---------:|---------:|
| BareImplementation               | benchmarks | 31.64 ns | 0.260 ns | 0.373 ns |
| BareImplementationWithCheck      | benchmarks | 23.20 ns | 0.237 ns | 0.198 ns |
| AsciiToLower                     | benchmarks | 30.00 ns | 0.591 ns | 0.494 ns |
| ToLowerInvariant                 | benchmarks | 36.13 ns | 0.621 ns | 0.518 ns |
| DeferredCreation                 | benchmarks | 23.04 ns | 0.209 ns | 0.205 ns |
| DeferredCreationAsciiToLower     | benchmarks | 24.07 ns | 0.419 ns | 0.350 ns |
| DeferredCreationToLowerInvariant | benchmarks | 27.06 ns | 0.315 ns | 0.294 ns |
| BareImplementation               | BenchMarks | 29.62 ns | 0.627 ns | 0.644 ns |
| BareImplementationWithCheck      | BenchMarks | 35.43 ns | 0.705 ns | 0.589 ns |
| AsciiToLower                     | BenchMarks | 29.81 ns | 0.441 ns | 0.391 ns |
| ToLowerInvariant                 | BenchMarks | 34.92 ns | 0.681 ns | 0.604 ns |
| DeferredCreation                 | BenchMarks | 23.86 ns | 0.246 ns | 0.192 ns |
| DeferredCreationAsciiToLower     | BenchMarks | 24.32 ns | 0.388 ns | 0.363 ns |
| DeferredCreationToLowerInvariant | BenchMarks | 27.30 ns | 0.530 ns | 0.496 ns |

Gaming

Programming

Is there a faster way of splitting strings?

Did you find this page helpful?