C
C#ā€¢10mo ago
__dil__

ā” char utf16-length

since utf16 is variable-width, given a char, how can I tell how many bytes it requires when encoded in utf16?
37 Replies
__dil__
__dil__ā€¢10mo ago
oh okay, so nothing for chars I assume?
Jimmacle
Jimmacleā€¢10mo ago
nothing i could find with a cursory browse
__dil__
__dil__ā€¢10mo ago
also, are strings indexed by byte index or by character index?
Jimmacle
Jimmacleā€¢10mo ago
by character char is always 2 bytes
__dil__
__dil__ā€¢10mo ago
right, but its encoding in utf16 might not
Jimmacle
Jimmacleā€¢10mo ago
i mean the actual C# type has a constant size
__dil__
__dil__ā€¢10mo ago
yes so, is string indexing O(n)? AFAIK, since the encoding is variable-width you have to start from the start to count characters
Jimmacle
Jimmacleā€¢10mo ago
no, that's what i'm saying C# strings are always internally represented as UTF-16
__dil__
__dil__ā€¢10mo ago
and utf-16 is variable width
Jimmacle
Jimmacleā€¢10mo ago
i'm not even sure C# supports characters larger than 2 bytes through "standard" APIs i haven't really had to care about it personally
__dil__
__dil__ā€¢10mo ago
but they could be 1 or 2 bytes
Jimmacle
Jimmacleā€¢10mo ago
no, utf-16 characters are a minimum of 2 bytes
__dil__
__dil__ā€¢10mo ago
The encoding is variable-length, as code points are encoded with one or two 16-bit code units.
Jimmacle
Jimmacleā€¢10mo ago
1 or 2 16 bit units
__dil__
__dil__ā€¢10mo ago
ahhh yeah makes sense
Jimmacle
Jimmacleā€¢10mo ago
utf8 can go all the way down to one byte and following that pattern utf32 is always 4 bytes
__dil__
__dil__ā€¢10mo ago
yeah, I got mixed up. But it's still kind of the same problem
Jimmacle
Jimmacleā€¢10mo ago
what's the context around your need for encoding?
__dil__
__dil__ā€¢10mo ago
parsing input read from a file the file is external, meaning I don't know in advance the input in question
Jimmacle
Jimmacleā€¢10mo ago
are you expecting the file encoding to vary?
__dil__
__dil__ā€¢10mo ago
it could be anything, but I expect it to be utf8. if the encoding is invalid/not-supported, the program should report an error and exit.
reflectronic
reflectronicā€¢10mo ago
a char is a UTF-16 code unit. it could be half of a code point oh i should have scrolled down
Jimmacle
Jimmacleā€¢10mo ago
i'm not actually sure if/what C# APIs do encoding detection on input streams
reflectronic
reflectronicā€¢10mo ago
yes, the string indexer is kind of busted for this reason
__dil__
__dil__ā€¢10mo ago
Does it really do linear seek O_O?
Jimmacle
Jimmacleā€¢10mo ago
no
reflectronic
reflectronicā€¢10mo ago
it is a direct char offset. if that offset is half of a surrogate pair then that's what you get you have to be pretty diligent using char.IsLowSurrogate/char.IsHighSurrogate if you want your code to work with surrogate pairs. (for in-memory strings there are convenient helpers, but not really for TextReader, unfortunately) so, if you read one half of a surrogate pair from your TextReader, be prepared to read the other half following it and treat them as one code point
reflectronic
reflectronicā€¢10mo ago
https://learn.microsoft.com/en-us/dotnet/api/system.text.rune?view=net-7.0 represents a full code point and it has some helpers you may find useful
Rune Struct (System.Text)
Represents a Unicode scalar value ([ U+0000..U+D7FF ], inclusive; or [ U+E000..U+10FFFF ], inclusive).
__dil__
__dil__ā€¢10mo ago
huh utf16 is a bit strange to me, I don't fully understand what surrogates are for
reflectronic
reflectronicā€¢10mo ago
for a given unicode code point, it can require either one or two UTF-16 code units (chars, 16 byte integers) to encode if it requires one, then that is sort of self-explanatory if it requires two, then they come in a surrogate pair. there is first a high surrogate code unit, and then a low surrogate code unit. together they encode one code point
__dil__
__dil__ā€¢10mo ago
and rune is that type that can fully represent a codepoint, is that correct?
reflectronic
reflectronicā€¢10mo ago
right
__dil__
__dil__ā€¢10mo ago
alright, thanks! I thought char designated a code point, not a code unit. Now things make more sense šŸ™‚ This is why I was confused about indexing being O(n) (now I understand why it's not)
canton7
canton7ā€¢10mo ago
It's all legacy. Originally we thought that 65535 different sorts of characters ought to be enough for anyone (at least in comparison to the old code page system, where you have 255 sorts of characters), and UCS-2 was born. Windows (and others) went all-in on UCS-2, which had a fixed 2 bytes per character (I use that term because "codepoint" wasn't invented at the time). Then Unicode came along and suddenly we needed more than 65535 types of character. But there was no way all of Windows was going to change from 2 bytes to 4 bytes per char. So we hacked it: under special circumstances two 2-byte chars can be combined, and that's enough to encode all of Unicode's codepoints. And so UTF-16 was born. And really, the reality is that even codepoints aren't a particularly useful way of splitting up text. Codepoints can be combined together (e.g. 'a' next to U+0308 COMBINING DIARESIS gives aĢˆ. Two codepoints yet rendered as a single glyph). That happens more often in non-Western writing systems, and particularly with Emoji. šŸ‘©šŸ¾ā€šŸš’ is 4 separate codepoints: U+1F469 Woman šŸ‘©, U+1F3FF Emoji Modifier Fitzpatrick Type-6 šŸæ (a skin tone modifier), U+200D, a zero-width joiner, which is kind of a glue character, and U+1F692 Fire Engine šŸš’. Put all of those in a sequence and you get a single glyph, šŸ‘©šŸ¾ā€šŸš’. We call these Extended Grapheme Clusters (EGCs). So, when navigating and splitting text you actually care about EGCs and not codepoints. Codepoints are essentially just an implmentation detail of the Unicode standard.
__dil__
__dil__ā€¢10mo ago
Thanks for explaining!
Accord
Accordā€¢10mo ago
Was this issue resolved? If so, run /close - otherwise I will mark this as stale and this post will be archived until there is new activity.