C
C#•3y ago
bookuha

File comparison

I want to compare two files very fast, I've implemented a full comparison (Comparing size / 1024 bytes spans), but I feel like there might some very unique info in File headers that I can use for my comparison. What do you think?
17 Replies
toddlahakbar
toddlahakbar•3y ago
If you want a full comparison, then do a full comparison Also last- first is usually a really quick algorithm to search
mtreit
mtreit•3y ago
@bookuha what do you mean by unique info in file headers? What kind of files are we talking about?
Susko3
Susko3•3y ago
if you want to do it really fast, then look into vectorization with eg. Vector256 class I suppose you're comparing for (in)equality?
Jester
Jester•3y ago
the bottle neck is probably just reading the files
bookuha
bookuhaOP•3y ago
any i guess, there must be some info that describes the file to the OS in some of these bytes yes
Kouhai
Kouhai•3y ago
what do you mean info that describes the file?
bookuha
bookuhaOP•3y ago
Headers that OS use to identify it. I am just guessing
Kouhai
Kouhai•3y ago
If you're doing full file comparison, why would the file headers matter 😅 ?
bookuha
bookuhaOP•3y ago
I dont really want to do full comparison, I want to do it as fast as possible I need to compare them, but going full size is more expensive so I decided to find some array of bytes that can be compared instead of the entire file
Susko3
Susko3•3y ago
compare sections and stop when you find a non-matching section, makes sense the important question for you may be: "where do the files usually differ?" maybe in the beginning, maybe in the end?
bookuha
bookuhaOP•3y ago
yes! i want to find out
Kouhai
Kouhai•3y ago
The header for two files can be the same but the rest Is different, even for a simple bmp file.
bookuha
bookuhaOP•3y ago
hmm, got it
Susko3
Susko3•3y ago
well.... you're the one with the files
Jester
Jester•3y ago
its possible to listen for file chanfes in a folder. maybe make a hash for every changed file so you can compare hashes? oh yea you can also compare the file name first and then the last modified date
mtreit
mtreit•3y ago
Files are completely arbitrary sequences of bytes. Unless you are restricted to a very specific file format you can't rely on headers - many file formats have no headers. (Think of a text file) File names and things like the modified time, which are file system metadata and not part of the file itself, tell you nothing about the file contents and should never be relied upon. If you have two files and want to compare them for equality, hashing both of them is much of the time far more expensive than doing a byte-by-byte comparison. (If you pre-hash and store the hashes for later comparison it can be useful.) Comparing file size first will immediately tell you if the files are different without doing any file I/O. If the file sizes are the same you can proceed with a byte-wise comparison. If you do precompute hashes consider using a very fast hash (crc32, murmurhash, etc) on portions of the file (first 1K, last 1K, etc) and store that for fast invalidation of files that don't match. Note that I can easily pad files with arbitrary extra bytes to make your diff check fail - this is why detection by exact match for things like malware is fragile and not generally a great approach.
bookuha
bookuhaOP•3y ago
Thank you

Did you find this page helpful?