Java Community | Help. Code. Learn.•4d ago

Week 122 — How can one specify a charset when reading from/writing to a text file?

Question of the Week #122

How can one specify a charset when reading from/writing to a text file?

2 Replies

When writing a file, it is stored as binary data on the file system. To store text files, the textual content has to converted to a binary format before writing to the file and converted back after reading. The used character encoding/charset decides how that process is performed. When converting between strings and binary data, the same charset should be used both converting the text to binary and for converting it back. One such charset is ASCII which represents each character using only 7 bits at the cost of only being able to represent a few characters (typically aligned so that each character actually needs one byte). In contrast to that, unicode allows storing more characters where different characters may require a different amount of space depending on the character used. It comes in different flavors like UTF-8 (which requires at least 8 bits = 1 byte to store each character) or UTF-16 (requires at least 16 bits = 2 bytes to store each character). Letters that can be represented using ASCII are represented the exact same way in UTF-8 with the first bit being set to 0. To indicate multiple bytes needed for a "code point" the first bits are set to 1. Unicode also supports many other things like combining (joining) multiple code points to a single displayable character. Java provides the java.nio.charset.Charset class to represent charsets and includes the java.nio.charset.StandardCharsets class with some commonly used charsets like US_ASCII and UTF_8.

dan1st | Daniel•4d ago

To write data to a text file encoded with a given charset, the JDK provides overloads for many methods that behave differently for charsets:

try(BufferedWriter writer = Files.newBufferedWriter(Path.of("utfFile.txt"), StandardCharsets.UTF_8)) {
  writer.write("Hello World");
  writer.newLine();
  writer.write("\uD83D\uDCBB");//unicode escapes for an emoji
}

try(BufferedWriter writer = Files.newBufferedWriter(Path.of("utfFile.txt"), StandardCharsets.UTF_8)) {
  writer.write("Hello World");
  writer.newLine();
  writer.write("\uD83D\uDCBB");//unicode escapes for an emoji
}

This file can then be read using the same charset:

try(BufferedReader reader = Files.newBufferedReader(Path.of("utfFile.txt"), StandardCharsets.UTF_8)) {
  reader.lines().forEach(System.out::println);
}

try(BufferedReader reader = Files.newBufferedReader(Path.of("utfFile.txt"), StandardCharsets.UTF_8)) {
  reader.lines().forEach(System.out::println);
}

If the console is able to display emojis and configured correctly using UTF-8, executing the two code snippets after each other should display Hello World and the 💻 emoji. However, if the file is attempted to be read using a charset that doesn't match, the output can get messed up:

try(BufferedReader reader = Files.newBufferedReader(Path.of("utfFile.txt"), StandardCharsets.UTF_16)) {
  reader.lines().forEach(System.out::println);
}

try(BufferedReader reader = Files.newBufferedReader(Path.of("utfFile.txt"), StandardCharsets.UTF_16)) {
  reader.lines().forEach(System.out::println);
}

As UTF-8 and UTF-16 don't use the same representations of characters, the output consists of characters that have little in common with the original text. When attemping to read the text file using ASCII instead, an exception is thrown because the emoji contains characters that are not valid ASCII:

try(BufferedReader reader = Files.newBufferedReader(Path.of("utfFile.txt"), StandardCharsets.US_ASCII)) {
  reader.lines().forEach(System.out::println); // MalformedInputException
}

try(BufferedReader reader = Files.newBufferedReader(Path.of("utfFile.txt"), StandardCharsets.US_ASCII)) {
  reader.lines().forEach(System.out::println); // MalformedInputException
}

📖 Sample answer from dan1st

Gaming

Programming

Week 122 — How can one specify a charset when reading from/writing to a text file?

Did you find this page helpful?