C#•3y ago

File comparison

I want to compare two files very fast, I've implemented a full comparison (Comparing size / 1024 bytes spans), but I feel like there might some very unique info in File headers that I can use for my comparison. What do you think?

17 Replies

toddlahakbar•3y ago

If you want a full comparison, then do a full comparison Also last- first is usually a really quick algorithm to search

mtreit•3y ago

@bookuha what do you mean by unique info in file headers? What kind of files are we talking about?

Susko3•3y ago

if you want to do it really fast, then look into vectorization with eg. Vector256 class I suppose you're comparing for (in)equality?

Gooster•3y ago

the bottle neck is probably just reading the files

bookuhaOP•3y ago

any i guess, there must be some info that describes the file to the OS in some of these bytes yes

Kouhai•3y ago

what do you mean info that describes the file?

bookuhaOP•3y ago

Headers that OS use to identify it. I am just guessing

Kouhai•3y ago

If you're doing full file comparison, why would the file headers matter 😅 ?

bookuhaOP•3y ago

I dont really want to do full comparison, I want to do it as fast as possible I need to compare them, but going full size is more expensive so I decided to find some array of bytes that can be compared instead of the entire file

Susko3•3y ago

compare sections and stop when you find a non-matching section, makes sense the important question for you may be: "where do the files usually differ?" maybe in the beginning, maybe in the end?

bookuhaOP•3y ago

yes! i want to find out

Kouhai•3y ago

The header for two files can be the same but the rest Is different, even for a simple bmp file.

bookuhaOP•3y ago

hmm, got it

Susko3•3y ago

well.... you're the one with the files

Gooster•3y ago

its possible to listen for file chanfes in a folder. maybe make a hash for every changed file so you can compare hashes? oh yea you can also compare the file name first and then the last modified date

mtreit•3y ago

Files are completely arbitrary sequences of bytes. Unless you are restricted to a very specific file format you can't rely on headers - many file formats have no headers. (Think of a text file) File names and things like the modified time, which are file system metadata and not part of the file itself, tell you nothing about the file contents and should never be relied upon. If you have two files and want to compare them for equality, hashing both of them is much of the time far more expensive than doing a byte-by-byte comparison. (If you pre-hash and store the hashes for later comparison it can be useful.) Comparing file size first will immediately tell you if the files are different without doing any file I/O. If the file sizes are the same you can proceed with a byte-wise comparison. If you do precompute hashes consider using a very fast hash (crc32, murmurhash, etc) on portions of the file (first 1K, last 1K, etc) and store that for fast invalidation of files that don't match. Note that I can easily pad files with arbitrary extra bytes to make your diff check fail - this is why detection by exact match for things like malware is fragile and not generally a great approach.

bookuhaOP•3y ago

Thank you

Gaming

Programming

File comparison

Did you find this page helpful?