C#•3mo ago

Linux wc vs. C# Character Count: BOM Causes Off-by-One Error

Hi everyone, I’m working on building a custom wc tool to mimic the behavior of the wc command from Linux. I’m currently implementing the function that counts characters in a file. However, I noticed that my program’s result is consistently 1 character less than the result from the Linux wc tool for some files. After debugging, I realized that the discrepancy happens when the file contains a BOM (Byte Order Mark) at the start. My current method doesn’t account for the BOM, and I’m wondering how I should handle this if I want my tool to accurately mimic Linux’s wc. Here’s the code I’m using for counting characters:

 private static void CountCharInFile(string filePath)
 {
     var lines = File.ReadLines(filePath);
     var numChar = lines.Sum(line => line.Length + Environment.NewLine.Length);
     Console.WriteLine($"{numChar} {Path.GetFileName(filePath)}");
 }

 private static void CountCharInFile(string filePath)
 {
     var lines = File.ReadLines(filePath);
     var numChar = lines.Sum(line => line.Length + Environment.NewLine.Length);
     Console.WriteLine($"{numChar} {Path.GetFileName(filePath)}");
 }

5 Replies

canton7•3mo ago

Note that your count will also be wrong if the file uses CRLF line endings Actually, since you're doing wc -c, why do you care about lines at all? Just use a StreamReader This seems to work:

var bytes = new byte[] { 0xEF, 0xBB, 0xBF, (byte)'H', (byte)'e', (byte)'l', (byte)'l', (byte)'o' };

using var sr = new StreamReader(new MemoryStream(bytes), new UTF8Encoding(encoderShouldEmitUTF8Identifier: false), detectEncodingFromByteOrderMarks: false);

int total = 0;
Span<char> buffer = new char[1024];
while (sr.ReadBlock(buffer) is var count and >0)
{
    total += count;
}

Console.WriteLine(total);

var bytes = new byte[] { 0xEF, 0xBB, 0xBF, (byte)'H', (byte)'e', (byte)'l', (byte)'l', (byte)'o' };

using var sr = new StreamReader(new MemoryStream(bytes), new UTF8Encoding(encoderShouldEmitUTF8Identifier: false), detectEncodingFromByteOrderMarks: false);

int total = 0;
Span<char> buffer = new char[1024];
while (sr.ReadBlock(buffer) is var count and >0)
{
    total += count;
}

Console.WriteLine(total);

You need both the combination of a UTF8Encoding with encoderShouldEmitUTF8Identifier = false to stop the StreamReader from recognising and discarding the bytes of the BOM, and a StreamReader constructed with detectEncodingFromByteOrderMarks = false to stop the StreamReader from trying to work out what the encoding of the file is from the BOM (which also consumes the BOM). Note that this obviously won't work for UTF16/UTF-32: you might have to do your own BOM detection.

Quest o()xx[{:::::::::::::::>OP•3mo ago

@canton7 Hey, thanks a lot for your help, I will take a closer look when I m back home and come back to you!

Quest o()xx[{:::::::::::::::>OP•3mo ago

I was actually working on -m. Check out the screenshot for the exact task I’m trying to code. I was using File.ReadLines, which works with an IEnumerable, so it’s more memory-friendly for big files compared to something like ReadAllText. I’m sure there are better ways to do this, but I don’t know them yet. That’s why I’m building this tool—mainly to learn. As for why I didn’t use StreamReader, I honestly didn’t even know it existed until you mentioned it. Since yesterday, I’ve refactored my code a bit, and it’s giving me decent results. That said, I’m pretty sure it doesn’t handle everything perfectly. Anyway, this is what I’ve got so far, and I’m about to try your approach now.

 private static bool HasBOM(string filePath)
 {
     var bytes = File.ReadAllBytes(filePath);


     if (bytes is [0xEF, 0xBB, 0xBF, ..]) return true; // UTF-8 BOM
     if (bytes is [0xFF, 0xFE, ..]) return true; // UTF-16 LE BOM
     if (bytes is [0xFE, 0xFF, ..]) return true; // UTF-16 BE BOM

     return false;
 }


 // ******TODO: handle UTF-8 with BOM files******
 private static void CountCharInFile(string filePath)
 {
     var lentgh = File.ReadAllText(filePath);
     var numChar = lentgh.Length;

     Console.WriteLine(HasBOM(filePath)
         ? $"{numChar + 1} {Path.GetFileName(filePath)}"
         : $"{numChar} {Path.GetFileName(filePath)}");
 }

 private static bool HasBOM(string filePath)
 {
     var bytes = File.ReadAllBytes(filePath);


     if (bytes is [0xEF, 0xBB, 0xBF, ..]) return true; // UTF-8 BOM
     if (bytes is [0xFF, 0xFE, ..]) return true; // UTF-16 LE BOM
     if (bytes is [0xFE, 0xFF, ..]) return true; // UTF-16 BE BOM

     return false;
 }


 // ******TODO: handle UTF-8 with BOM files******
 private static void CountCharInFile(string filePath)
 {
     var lentgh = File.ReadAllText(filePath);
     var numChar = lentgh.Length;

     Console.WriteLine(HasBOM(filePath)
         ? $"{numChar + 1} {Path.GetFileName(filePath)}"
         : $"{numChar} {Path.GetFileName(filePath)}");
 }

Quest o()xx[{:::::::::::::::>OP•3mo ago

OH I see that your code handle the memory issue with this part, interesting 🤔

Quest o()xx[{:::::::::::::::>OP•3mo ago

You're solution works perfectly! I need a moment to digest what's going on though, it's a bit too advanced for me 😂 Thanks a lot mate

Gaming

Programming

Linux wc vs. C# Character Count: BOM Causes Off-by-One Error

Did you find this page helpful?