File comparison
I want to compare two files very fast, I've implemented a full comparison (Comparing size / 1024 bytes spans),
but I feel like there might some very unique info in File headers that I can use for my comparison.
What do you think?
17 Replies
If you want a full comparison, then do a full comparison
Also last- first is usually a really quick algorithm to search
@bookuha what do you mean by unique info in file headers? What kind of files are we talking about?
if you want to do it really fast, then look into vectorization with eg.
Vector256
class
I suppose you're comparing for (in)equality?the bottle neck is probably just reading the files
any i guess, there must be some info that describes the file
to the OS
in some of these bytes
yes
what do you mean
info that describes the file
?Headers that OS use to identify it.
I am just guessing
If you're doing full file comparison, why would the file headers matter 😅 ?
I dont really want to do full comparison, I want to do it as fast as possible
I need to compare them, but going full size is more expensive
so I decided to find some array of bytes that can be compared
instead of the entire file
compare sections and stop when you find a non-matching section, makes sense
the important question for you may be: "where do the files usually differ?"
maybe in the beginning, maybe in the end?
yes! i want to find out
The header for two files can be the same but the rest Is different, even for a simple bmp file.
hmm, got it
well.... you're the one with the files
its possible to listen for file chanfes in a folder. maybe make a hash for every changed file so you can compare hashes?
oh yea you can also compare the file name first and then the last modified date
Files are completely arbitrary sequences of bytes. Unless you are restricted to a very specific file format you can't rely on headers - many file formats have no headers. (Think of a text file)
File names and things like the modified time, which are file system metadata and not part of the file itself, tell you nothing about the file contents and should never be relied upon.
If you have two files and want to compare them for equality, hashing both of them is much of the time far more expensive than doing a byte-by-byte comparison. (If you pre-hash and store the hashes for later comparison it can be useful.)
Comparing file size first will immediately tell you if the files are different without doing any file I/O.
If the file sizes are the same you can proceed with a byte-wise comparison.
If you do precompute hashes consider using a very fast hash (crc32, murmurhash, etc) on portions of the file (first 1K, last 1K, etc) and store that for fast invalidation of files that don't match.
Note that I can easily pad files with arbitrary extra bytes to make your diff check fail - this is why detection by exact match for things like malware is fragile and not generally a great approach.
Thank you