C#•3y ago

Optimizing some string manipulation

I want to both substring an input string at the last occurrence of '/' and normalize it into only alphanumeric (a-z, A-Z, 0-9) characters, turning any characters unable to be normalized (meaning characters with diacritics turning into their non-diacritic versions (ä -> a)) into _. Here's what I've got so far;

if (input.Length == 0)
{
  return "";
}

Span<char> outBuf = stackalloc char[128];
char* pNorm = stackalloc char[128];

fixed (char* pIn = input, pOut = outBuf)
{
  int dLength = NormalizeString(2, (ushort*)pIn, input.Length, (ushort*)pNorm, 128);

  int start = 127, length = 0;
  char first = default;

  for (int i = dLength - 1; i >= 0; i--)
  {
    char c = pNorm[i];

    if (CharUnicodeInfo.GetUnicodeCategory(c) == UnicodeCategory.NonSpacingMark)
    {
      continue;
    }

    if (c is '/')
    {
      break;
    }

    pOut[start] = first = c switch
    {
      (>= '0' and <= '9') or (>= 'A' and <= 'Z') or (>= 'a' and <= 'z') => c,
      _ => '_'
    };

    start--;
    length++;
  }

  if (first is >= '0' and <= '9')
  {
    pOut[start] = '_';
    length++;
  }

  return outBuf.Slice(start, length).ToString();
}

[DllImport("normaliz")]
static extern int NormalizeString(
  int normForm,
  ushort* source,
  int sourceLength,
  ushort* destination,
  int destinationLength);

if (input.Length == 0)
{
  return "";
}

Span<char> outBuf = stackalloc char[128];
char* pNorm = stackalloc char[128];

fixed (char* pIn = input, pOut = outBuf)
{
  int dLength = NormalizeString(2, (ushort*)pIn, input.Length, (ushort*)pNorm, 128);

  int start = 127, length = 0;
  char first = default;

  for (int i = dLength - 1; i >= 0; i--)
  {
    char c = pNorm[i];

    if (CharUnicodeInfo.GetUnicodeCategory(c) == UnicodeCategory.NonSpacingMark)
    {
      continue;
    }

    if (c is '/')
    {
      break;
    }

    pOut[start] = first = c switch
    {
      (>= '0' and <= '9') or (>= 'A' and <= 'Z') or (>= 'a' and <= 'z') => c,
      _ => '_'
    };

    start--;
    length++;
  }

  if (first is >= '0' and <= '9')
  {
    pOut[start] = '_';
    length++;
  }

  return outBuf.Slice(start, length).ToString();
}

[DllImport("normaliz")]
static extern int NormalizeString(
  int normForm,
  ushort* source,
  int sourceLength,
  ushort* destination,
  int destinationLength);

However, this is hardly faster than using Substring and Normalize (with some custom code involving CharUnicodeInfo.GetUnicodeCategory). Any ideas?

8 Replies

eroOP•3y ago

Ah, right, and if the final string begins with a number, it should prepend _ as well. As an example; /Foo/123Bar (Bäz) would get turned into _123Bar__Baz_.

Zombie•3y ago

Sounds like what you're really doing is turning an arbitrary string into a valid identifier (without unicode) That unbounded stackalloc is a bad idea btw It's going to not only be less efficient than a stackalloc of constant size but potentially hard crash with a stack overflow if you pass in a large string

eroOP•3y ago

yup, that's basically it good point on the stackalloc. i know that the max length i'm gonna pass in is 256, so i can just cap it at that

Zombie•3y ago

Declaring pNorm inside the fixed is also weird and potentially more expensive than necessary A stackalloc is already fixed In that example you're getting a Span and then pinning it unnecessarily I'd look into this further but it's late and I'm on my phone so I don't wanna suffer with that experience lol

eroOP•3y ago

no worries, you've already helped a good amount

Unknown User•3y ago

Message Not Public

eroOP•3y ago

yup as well as substringing it from the last / onwards, and if the first character is a digit, prepend an underscore

Unknown User•3y ago

Message Not Public

Gaming

Programming

Optimizing some string manipulation

Did you find this page helpful?