Find file duplicates | Optimize
How can I optimize this?
https://gist.github.com/bookuha/707027336b40663b14c3cb64b039e9ec
Gist
FindDuplicatesInDirectory
FindDuplicatesInDirectory. GitHub Gist: instantly share code, notes, and snippets.
10 Replies
Avoid linq, perhaps
Use pointers, stackallocs
perhapsLinq will create and go through an enumerator. If you can avoid that, you've already made up some time I don't really see how the file length matters 2 files can have the same length but have different content That's a given
There are even cases with .NET 7 now where an explicit implementation would be slower
I dunno
I don't see how making this async has anything to do with the perf of it
Hm, not a bad idea
How does that work? You'd obviously not wanna read the entire file at once
My code isn't great either, going byte by byte
Ideally you'd read in chunks
Then fix those byte arrays and for over the pointers
That'd be the most performant i think
Actually the hashing approach is pretty decent
I'm not sure where that bottlenecks
Would you even need to read in chunks when you use the hash?
Like if you use sha512
Wouldn't that mean the files are equal?
Hm
The sequence equal approach can probably also be improved
For files that are almost similiar. But probably not worth the time. Maybe there are already optimizations happening tho
https://source.dot.net/#System.Private.CoreLib/src/libraries/System.Private.CoreLib/src/System/SpanHelpers.Byte.cs,998a36a55f580ab1
Looks already pretty good
Pepega ass code
Don't know if that one using directive is legal
On my phone lol
Ah yeah that's fair
Obviously
Since it's smaller
But just to make sure
I can’t really dive into the solutions right now, driving
home. Thank you
Will check a bit later
I am currently grouping things up by their size
And then comparing files in the resulting buckets
But my comparison approach is not good
Try to make my approaches async as far as possible. I don't use async programming enough to know what's possible and good. Perhaps WhenAll the binaryreader reads? And cache the hashes
Rare
Oh, yes. Thank you
Hah, yeah
I won't be home for another 5 hours so it's difficult for me
Ah i was assuming the steam directory
Also different for Linux imagine
You guys have like a billion empty config files
Thank you guys!