C#•2y ago

❔ char utf16-length

since utf16 is variable-width, given a char, how can I tell how many bytes it requires when encoded in utf16?

37 Replies

Jimmacle•2y ago

https://learn.microsoft.com/en-us/dotnet/api/system.text.encoding.getbytecount?view=net-7.0 for strings, at least

__dil__OP•2y ago

oh okay, so nothing for chars I assume?

Jimmacle•2y ago

nothing i could find with a cursory browse

__dil__OP•2y ago

also, are strings indexed by byte index or by character index?

Jimmacle•2y ago

by character char is always 2 bytes

__dil__OP•2y ago

right, but its encoding in utf16 might not

Jimmacle•2y ago

i mean the actual C# type has a constant size

__dil__OP•2y ago

yes so, is string indexing O(n)? AFAIK, since the encoding is variable-width you have to start from the start to count characters

Jimmacle•2y ago

no, that's what i'm saying C# strings are always internally represented as UTF-16

__dil__OP•2y ago

and utf-16 is variable width

Jimmacle•2y ago

i'm not even sure C# supports characters larger than 2 bytes through "standard" APIs i haven't really had to care about it personally

__dil__OP•2y ago

but they could be 1 or 2 bytes

Jimmacle•2y ago

no, utf-16 characters are a minimum of 2 bytes

__dil__OP•2y ago

The encoding is variable-length, as code points are encoded with one or two 16-bit code units.

Jimmacle•2y ago

1 or 2 16 bit units

__dil__OP•2y ago

ahhh yeah makes sense

Jimmacle•2y ago

utf8 can go all the way down to one byte and following that pattern utf32 is always 4 bytes

__dil__OP•2y ago

yeah, I got mixed up. But it's still kind of the same problem

Jimmacle•2y ago

what's the context around your need for encoding?

__dil__OP•2y ago

parsing input read from a file the file is external, meaning I don't know in advance the input in question

Jimmacle•2y ago

are you expecting the file encoding to vary?

__dil__OP•2y ago

it could be anything, but I expect it to be utf8. if the encoding is invalid/not-supported, the program should report an error and exit.

reflectronic•2y ago

a char is a UTF-16 code unit. it could be half of a code point oh i should have scrolled down

Jimmacle•2y ago

i'm not actually sure if/what C# APIs do encoding detection on input streams

reflectronic•2y ago

yes, the string indexer is kind of busted for this reason

__dil__OP•2y ago

Does it really do linear seek O_O?

Jimmacle•2y ago

reflectronic•2y ago

it is a direct char offset. if that offset is half of a surrogate pair then that's what you get you have to be pretty diligent using char.IsLowSurrogate/char.IsHighSurrogate if you want your code to work with surrogate pairs. (for in-memory strings there are convenient helpers, but not really for TextReader, unfortunately) so, if you read one half of a surrogate pair from your TextReader, be prepared to read the other half following it and treat them as one code point

reflectronic•2y ago

https://learn.microsoft.com/en-us/dotnet/api/system.text.rune?view=net-7.0 represents a full code point and it has some helpers you may find useful

Rune Struct (System.Text)

Represents a Unicode scalar value ([ U+0000..U+D7FF ], inclusive; or [ U+E000..U+10FFFF ], inclusive).

__dil__OP•2y ago

huh utf16 is a bit strange to me, I don't fully understand what surrogates are for

reflectronic•2y ago

for a given unicode code point, it can require either one or two UTF-16 code units (chars, 16 byte integers) to encode if it requires one, then that is sort of self-explanatory if it requires two, then they come in a surrogate pair. there is first a high surrogate code unit, and then a low surrogate code unit. together they encode one code point

__dil__OP•2y ago

and rune is that type that can fully represent a codepoint, is that correct?

reflectronic•2y ago

right

__dil__OP•2y ago

alright, thanks! I thought char designated a code point, not a code unit. Now things make more sense 🙂 This is why I was confused about indexing being O(n) (now I understand why it's not)

canton7•2y ago

It's all legacy. Originally we thought that 65535 different sorts of characters ought to be enough for anyone (at least in comparison to the old code page system, where you have 255 sorts of characters), and UCS-2 was born. Windows (and others) went all-in on UCS-2, which had a fixed 2 bytes per character (I use that term because "codepoint" wasn't invented at the time). Then Unicode came along and suddenly we needed more than 65535 types of character. But there was no way all of Windows was going to change from 2 bytes to 4 bytes per char. So we hacked it: under special circumstances two 2-byte chars can be combined, and that's enough to encode all of Unicode's codepoints. And so UTF-16 was born. And really, the reality is that even codepoints aren't a particularly useful way of splitting up text. Codepoints can be combined together (e.g. 'a' next to U+0308 COMBINING DIARESIS gives ä. Two codepoints yet rendered as a single glyph). That happens more often in non-Western writing systems, and particularly with Emoji. 👩🏾‍🚒 is 4 separate codepoints: U+1F469 Woman 👩, U+1F3FF Emoji Modifier Fitzpatrick Type-6 🏿 (a skin tone modifier), U+200D, a zero-width joiner, which is kind of a glue character, and U+1F692 Fire Engine 🚒. Put all of those in a sequence and you get a single glyph, 👩🏾‍🚒. We call these Extended Grapheme Clusters (EGCs). So, when navigating and splitting text you actually care about EGCs and not codepoints. Codepoints are essentially just an implmentation detail of the Unicode standard.

__dil__OP•2y ago

Thanks for explaining!

Accord•2y ago

Was this issue resolved? If so, run /close - otherwise I will mark this as stale and this post will be archived until there is new activity.

Gaming

Programming

❔ char utf16-length

Did you find this page helpful?