Week 122 — How can one specify a charset when reading from/writing to a text file?
Question of the Week #122
How can one specify a charset when reading from/writing to a text file?
2 Replies
When writing a file, it is stored as binary data on the file system.
To store text files, the textual content has to converted to a binary format before writing to the file and converted back after reading. The used character encoding/charset decides how that process is performed. When converting between strings and binary data, the same charset should be used both converting the text to binary and for converting it back.
One such charset is ASCII which represents each character using only 7 bits at the cost of only being able to represent a few characters (typically aligned so that each character actually needs one byte). In contrast to that, unicode allows storing more characters where different characters may require a different amount of space depending on the character used. It comes in different flavors like UTF-8 (which requires at least 8 bits = 1 byte to store each character) or UTF-16 (requires at least 16 bits = 2 bytes to store each character). Letters that can be represented using ASCII are represented the exact same way in UTF-8 with the first bit being set to
0
. To indicate multiple bytes needed for a "code point" the first bits are set to 1
. Unicode also supports many other things like combining (joining) multiple code points to a single displayable character.
Java provides the java.nio.charset.Charset
class to represent charsets and includes the java.nio.charset.StandardCharsets
class with some commonly used charsets like US_ASCII
and UTF_8
.To write data to a text file encoded with a given charset, the JDK provides overloads for many methods that behave differently for charsets:
This file can then be read using the same charset:
If the console is able to display emojis and configured correctly using UTF-8, executing the two code snippets after each other should display
Hello World
and the 💻 emoji.
However, if the file is attempted to be read using a charset that doesn't match, the output can get messed up:
As UTF-8 and UTF-16 don't use the same representations of characters, the output consists of characters that have little in common with the original text. When attemping to read the text file using ASCII instead, an exception is thrown because the emoji contains characters that are not valid ASCII:
📖 Sample answer from dan1st