Linux wc vs. C# Character Count: BOM Causes Off-by-One Error

Hi everyone, I’m working on building a custom wc tool to mimic the behavior of the wc command from Linux. I’m currently implementing the function that counts characters in a file. However, I noticed that my program’s result is consistently 1 character less than the result from the Linux wc tool for some files. After debugging, I realized that the discrepancy happens when the file contains a BOM (Byte Order Mark) at the start. My current method doesn’t account for the BOM, and I’m wondering how I should handle this if I want my tool to accurately mimic Linux’s wc. Here’s the code I’m using for counting characters:
private static void CountCharInFile(string filePath)
{
var lines = File.ReadLines(filePath);
var numChar = lines.Sum(line => line.Length + Environment.NewLine.Length);
Console.WriteLine($"{numChar} {Path.GetFileName(filePath)}");
}
private static void CountCharInFile(string filePath)
{
var lines = File.ReadLines(filePath);
var numChar = lines.Sum(line => line.Length + Environment.NewLine.Length);
Console.WriteLine($"{numChar} {Path.GetFileName(filePath)}");
}
5 Replies
canton7
canton72w ago
Note that your count will also be wrong if the file uses CRLF line endings Actually, since you're doing wc -c, why do you care about lines at all? Just use a StreamReader This seems to work:
var bytes = new byte[] { 0xEF, 0xBB, 0xBF, (byte)'H', (byte)'e', (byte)'l', (byte)'l', (byte)'o' };

using var sr = new StreamReader(new MemoryStream(bytes), new UTF8Encoding(encoderShouldEmitUTF8Identifier: false), detectEncodingFromByteOrderMarks: false);

int total = 0;
Span<char> buffer = new char[1024];
while (sr.ReadBlock(buffer) is var count and >0)
{
total += count;
}

Console.WriteLine(total);
var bytes = new byte[] { 0xEF, 0xBB, 0xBF, (byte)'H', (byte)'e', (byte)'l', (byte)'l', (byte)'o' };

using var sr = new StreamReader(new MemoryStream(bytes), new UTF8Encoding(encoderShouldEmitUTF8Identifier: false), detectEncodingFromByteOrderMarks: false);

int total = 0;
Span<char> buffer = new char[1024];
while (sr.ReadBlock(buffer) is var count and >0)
{
total += count;
}

Console.WriteLine(total);
You need both the combination of a UTF8Encoding with encoderShouldEmitUTF8Identifier = false to stop the StreamReader from recognising and discarding the bytes of the BOM, and a StreamReader constructed with detectEncodingFromByteOrderMarks = false to stop the StreamReader from trying to work out what the encoding of the file is from the BOM (which also consumes the BOM). Note that this obviously won't work for UTF16/UTF-32: you might have to do your own BOM detection.
Quest o()xx[{:::::::::::::::>
@canton7 Hey, thanks a lot for your help, I will take a closer look when I m back home and come back to you!
Quest o()xx[{:::::::::::::::>
I was actually working on -m. Check out the screenshot for the exact task I’m trying to code. I was using File.ReadLines, which works with an IEnumerable, so it’s more memory-friendly for big files compared to something like ReadAllText. I’m sure there are better ways to do this, but I don’t know them yet. That’s why I’m building this tool—mainly to learn. As for why I didn’t use StreamReader, I honestly didn’t even know it existed until you mentioned it. Since yesterday, I’ve refactored my code a bit, and it’s giving me decent results. That said, I’m pretty sure it doesn’t handle everything perfectly. Anyway, this is what I’ve got so far, and I’m about to try your approach now.
private static bool HasBOM(string filePath)
{
var bytes = File.ReadAllBytes(filePath);


if (bytes is [0xEF, 0xBB, 0xBF, ..]) return true; // UTF-8 BOM
if (bytes is [0xFF, 0xFE, ..]) return true; // UTF-16 LE BOM
if (bytes is [0xFE, 0xFF, ..]) return true; // UTF-16 BE BOM

return false;
}


// ******TODO: handle UTF-8 with BOM files******
private static void CountCharInFile(string filePath)
{
var lentgh = File.ReadAllText(filePath);
var numChar = lentgh.Length;

Console.WriteLine(HasBOM(filePath)
? $"{numChar + 1} {Path.GetFileName(filePath)}"
: $"{numChar} {Path.GetFileName(filePath)}");
}
private static bool HasBOM(string filePath)
{
var bytes = File.ReadAllBytes(filePath);


if (bytes is [0xEF, 0xBB, 0xBF, ..]) return true; // UTF-8 BOM
if (bytes is [0xFF, 0xFE, ..]) return true; // UTF-16 LE BOM
if (bytes is [0xFE, 0xFF, ..]) return true; // UTF-16 BE BOM

return false;
}


// ******TODO: handle UTF-8 with BOM files******
private static void CountCharInFile(string filePath)
{
var lentgh = File.ReadAllText(filePath);
var numChar = lentgh.Length;

Console.WriteLine(HasBOM(filePath)
? $"{numChar + 1} {Path.GetFileName(filePath)}"
: $"{numChar} {Path.GetFileName(filePath)}");
}
No description
Quest o()xx[{:::::::::::::::>
OH I see that your code handle the memory issue with this part, interesting 🤔
No description
Quest o()xx[{:::::::::::::::>
You're solution works perfectly! I need a moment to digest what's going on though, it's a bit too advanced for me 😂 Thanks a lot mate

Did you find this page helpful?