ā char utf16-length
since utf16 is variable-width, given a char, how can I tell how many bytes it requires when encoded in utf16?
37 Replies
oh okay, so nothing for chars I assume?
nothing i could find with a cursory browse
also, are strings indexed by byte index or by character index?
by character
char
is always 2 bytesright, but its encoding in utf16 might not
i mean the actual C# type has a constant size
yes
so, is string indexing O(n)? AFAIK, since the encoding is variable-width you have to start from the start to count characters
no, that's what i'm saying
C# strings are always internally represented as UTF-16
and utf-16 is variable width
i'm not even sure C# supports characters larger than 2 bytes through "standard" APIs
i haven't really had to care about it personally
but they could be 1 or 2 bytes
no, utf-16 characters are a minimum of 2 bytes
The encoding is variable-length, as code points are encoded with one or two 16-bit code units.
1 or 2 16 bit units
ahhh
yeah
makes sense
utf8 can go all the way down to one byte
and following that pattern utf32 is always 4 bytes
yeah, I got mixed up. But it's still kind of the same problem
what's the context around your need for encoding?
parsing input read from a file
the file is external, meaning I don't know in advance the input in question
are you expecting the file encoding to vary?
it could be anything, but I expect it to be utf8.
if the encoding is invalid/not-supported, the program should report an error and exit.
a char is a UTF-16 code unit. it could be half of a code point
oh i should have scrolled down
i'm not actually sure if/what C# APIs do encoding detection on input streams
yes, the string indexer is kind of busted for this reason
Does it really do linear seek O_O?
no
it is a direct char offset. if that offset is half of a surrogate pair then that's what you get
you have to be pretty diligent using
char.IsLowSurrogate
/char.IsHighSurrogate
if you want your code to work with surrogate pairs. (for in-memory strings there are convenient helpers, but not really for TextReader, unfortunately)
so, if you read one half of a surrogate pair from your TextReader, be prepared to read the other half following it and treat them as one code pointhttps://learn.microsoft.com/en-us/dotnet/api/system.text.rune?view=net-7.0 represents a full code point and it has some helpers you may find useful
Rune Struct (System.Text)
Represents a Unicode scalar value ([ U+0000..U+D7FF ], inclusive; or [ U+E000..U+10FFFF ], inclusive).
huh
utf16 is a bit strange to me, I don't fully understand what surrogates are for
for a given unicode code point, it can require either one or two UTF-16 code units (chars, 16 byte integers) to encode
if it requires one, then that is sort of self-explanatory
if it requires two, then they come in a surrogate pair. there is first a high surrogate code unit, and then a low surrogate code unit. together they encode one code point
and rune is that type that can fully represent a codepoint, is that correct?
right
alright, thanks! I thought
char
designated a code point, not a code unit. Now things make more sense š
This is why I was confused about indexing being O(n) (now I understand why it's not)It's all legacy. Originally we thought that 65535 different sorts of characters ought to be enough for anyone (at least in comparison to the old code page system, where you have 255 sorts of characters), and UCS-2 was born. Windows (and others) went all-in on UCS-2, which had a fixed 2 bytes per character (I use that term because "codepoint" wasn't invented at the time).
Then Unicode came along and suddenly we needed more than 65535 types of character. But there was no way all of Windows was going to change from 2 bytes to 4 bytes per char. So we hacked it: under special circumstances two 2-byte chars can be combined, and that's enough to encode all of Unicode's codepoints. And so UTF-16 was born.
And really, the reality is that even codepoints aren't a particularly useful way of splitting up text. Codepoints can be combined together (e.g. 'a' next to U+0308 COMBINING DIARESIS gives aĢ. Two codepoints yet rendered as a single glyph). That happens more often in non-Western writing systems, and particularly with Emoji.
š©š¾āš is 4 separate codepoints: U+1F469 Woman š©, U+1F3FF Emoji Modifier Fitzpatrick Type-6 šæ (a skin tone modifier), U+200D, a zero-width joiner, which is kind of a glue character, and U+1F692 Fire Engine š. Put all of those in a sequence and you get a single glyph, š©š¾āš. We call these Extended Grapheme Clusters (EGCs).
So, when navigating and splitting text you actually care about EGCs and not codepoints. Codepoints are essentially just an implmentation detail of the Unicode standard.
Thanks for explaining!
Was this issue resolved? If so, run
/close
- otherwise I will mark this as stale and this post will be archived until there is new activity.