C
C#3y ago
ero

Optimizing some string manipulation

I want to both substring an input string at the last occurrence of '/' and normalize it into only alphanumeric (a-z, A-Z, 0-9) characters, turning any characters unable to be normalized (meaning characters with diacritics turning into their non-diacritic versions (ä -> a)) into _. Here's what I've got so far;
if (input.Length == 0)
{
return "";
}

Span<char> outBuf = stackalloc char[128];
char* pNorm = stackalloc char[128];

fixed (char* pIn = input, pOut = outBuf)
{
int dLength = NormalizeString(2, (ushort*)pIn, input.Length, (ushort*)pNorm, 128);

int start = 127, length = 0;
char first = default;

for (int i = dLength - 1; i >= 0; i--)
{
char c = pNorm[i];

if (CharUnicodeInfo.GetUnicodeCategory(c) == UnicodeCategory.NonSpacingMark)
{
continue;
}

if (c is '/')
{
break;
}

pOut[start] = first = c switch
{
(>= '0' and <= '9') or (>= 'A' and <= 'Z') or (>= 'a' and <= 'z') => c,
_ => '_'
};

start--;
length++;
}

if (first is >= '0' and <= '9')
{
pOut[start] = '_';
length++;
}

return outBuf.Slice(start, length).ToString();
}

[DllImport("normaliz")]
static extern int NormalizeString(
int normForm,
ushort* source,
int sourceLength,
ushort* destination,
int destinationLength);
if (input.Length == 0)
{
return "";
}

Span<char> outBuf = stackalloc char[128];
char* pNorm = stackalloc char[128];

fixed (char* pIn = input, pOut = outBuf)
{
int dLength = NormalizeString(2, (ushort*)pIn, input.Length, (ushort*)pNorm, 128);

int start = 127, length = 0;
char first = default;

for (int i = dLength - 1; i >= 0; i--)
{
char c = pNorm[i];

if (CharUnicodeInfo.GetUnicodeCategory(c) == UnicodeCategory.NonSpacingMark)
{
continue;
}

if (c is '/')
{
break;
}

pOut[start] = first = c switch
{
(>= '0' and <= '9') or (>= 'A' and <= 'Z') or (>= 'a' and <= 'z') => c,
_ => '_'
};

start--;
length++;
}

if (first is >= '0' and <= '9')
{
pOut[start] = '_';
length++;
}

return outBuf.Slice(start, length).ToString();
}

[DllImport("normaliz")]
static extern int NormalizeString(
int normForm,
ushort* source,
int sourceLength,
ushort* destination,
int destinationLength);
However, this is hardly faster than using Substring and Normalize (with some custom code involving CharUnicodeInfo.GetUnicodeCategory). Any ideas?
8 Replies
ero
eroOP3y ago
Ah, right, and if the final string begins with a number, it should prepend _ as well. As an example; /Foo/123Bar (Bäz) would get turned into _123Bar__Baz_.
Zombie
Zombie3y ago
Sounds like what you're really doing is turning an arbitrary string into a valid identifier (without unicode) That unbounded stackalloc is a bad idea btw It's going to not only be less efficient than a stackalloc of constant size but potentially hard crash with a stack overflow if you pass in a large string
ero
eroOP3y ago
yup, that's basically it good point on the stackalloc. i know that the max length i'm gonna pass in is 256, so i can just cap it at that
Zombie
Zombie3y ago
Declaring pNorm inside the fixed is also weird and potentially more expensive than necessary A stackalloc is already fixed In that example you're getting a Span and then pinning it unnecessarily I'd look into this further but it's late and I'm on my phone so I don't wanna suffer with that experience lol
ero
eroOP3y ago
no worries, you've already helped a good amount
Unknown User
Unknown User3y ago
Message Not Public
Sign In & Join Server To View
ero
eroOP3y ago
yup as well as substringing it from the last / onwards, and if the first character is a digit, prepend an underscore
Unknown User
Unknown User3y ago
Message Not Public
Sign In & Join Server To View

Did you find this page helpful?