Linux wc vs. C# Character Count: BOM Causes Off-by-One Error
Hi everyone,
I’m working on building a custom wc tool to mimic the behavior of the wc command from Linux. I’m currently implementing the function that counts characters in a file. However, I noticed that my program’s result is consistently 1 character less than the result from the Linux wc tool for some files.
After debugging, I realized that the discrepancy happens when the file contains a BOM (Byte Order Mark) at the start. My current method doesn’t account for the BOM, and I’m wondering how I should handle this if I want my tool to accurately mimic Linux’s wc.
Here’s the code I’m using for counting characters:
5 Replies
Note that your count will also be wrong if the file uses CRLF line endings
Actually, since you're doing
wc -c
, why do you care about lines at all? Just use a StreamReader
This seems to work:
You need both the combination of a UTF8Encoding
with encoderShouldEmitUTF8Identifier = false
to stop the StreamReader from recognising and discarding the bytes of the BOM, and a StreamReader constructed with detectEncodingFromByteOrderMarks = false
to stop the StreamReader from trying to work out what the encoding of the file is from the BOM (which also consumes the BOM).
Note that this obviously won't work for UTF16/UTF-32: you might have to do your own BOM detection.@canton7 Hey, thanks a lot for your help, I will take a closer look when I m back home and come back to you!
I was actually working on -m.
Check out the screenshot for the exact task I’m trying to code.
I was using
File.ReadLines
, which works with an IEnumerable, so it’s more memory-friendly for big files compared to something like ReadAllText
. I’m sure there are better ways to do this, but I don’t know them yet. That’s why I’m building this tool—mainly to learn.
As for why I didn’t use StreamReader
, I honestly didn’t even know it existed until you mentioned it.
Since yesterday, I’ve refactored my code a bit, and it’s giving me decent results. That said, I’m pretty sure it doesn’t handle everything perfectly.
Anyway, this is what I’ve got so far, and I’m about to try your approach now.
OH I see that your code handle the memory issue with this part, interesting 🤔
You're solution works perfectly! I need a moment to digest what's going on though, it's a bit too advanced for me 😂
Thanks a lot mate