C
C#3y ago
bookuha

Find file duplicates | Optimize

Gist
FindDuplicatesInDirectory
FindDuplicatesInDirectory. GitHub Gist: instantly share code, notes, and snippets.
10 Replies
ero
ero3y ago
Avoid linq, perhaps Use pointers, stackallocs
perhaps
Linq will create and go through an enumerator. If you can avoid that, you've already made up some time I don't really see how the file length matters 2 files can have the same length but have different content That's a given
qqdev
qqdev3y ago
There are even cases with .NET 7 now where an explicit implementation would be slower kekw
ero
ero3y ago
void FindDuplicateFiles(string directory)
{
var dirInfo = new DirectoryInfo(directory);
var fileInfos = dirInfo.EnumerateFiles("*", SearchOption.AllDirectories);

var skips = 1;
using var eInner = fileInfos.GetEnumerator();

while (eInner.MoveNext())
{
var fiInner = eInner.Current;

var skipped = 0;
using var eOuter = fileInfo.GetEnumerator();

while (eOuter.MoveNext())
{
if (skipped++ < skip)
continue;

var fiOuter = eOuter.Current;

if (FileContentsEqual(fiInner, fiOuter))
Console.WriteLine($"{fiInner.FullName} and {fiOuter.FullName} are equal.");
}

i++;
}
}

bool FileContentsEqual(FileInfo fi1, FileInfo fi2)
{
if (fi1.Length != fi2.Length)
return false;

using var br1 = new BinaryReader(fi1.OpenRead());
using var br2 = new BinaryReader(fi2.OpenRead());

while (br1.BaseStream.Position != br1.BaseStream.Length)
{
var b1 = br1.ReadByte();
var b2 = br2.ReadByte();

if (b1 != b2)
return false;
}

return true;
}
void FindDuplicateFiles(string directory)
{
var dirInfo = new DirectoryInfo(directory);
var fileInfos = dirInfo.EnumerateFiles("*", SearchOption.AllDirectories);

var skips = 1;
using var eInner = fileInfos.GetEnumerator();

while (eInner.MoveNext())
{
var fiInner = eInner.Current;

var skipped = 0;
using var eOuter = fileInfo.GetEnumerator();

while (eOuter.MoveNext())
{
if (skipped++ < skip)
continue;

var fiOuter = eOuter.Current;

if (FileContentsEqual(fiInner, fiOuter))
Console.WriteLine($"{fiInner.FullName} and {fiOuter.FullName} are equal.");
}

i++;
}
}

bool FileContentsEqual(FileInfo fi1, FileInfo fi2)
{
if (fi1.Length != fi2.Length)
return false;

using var br1 = new BinaryReader(fi1.OpenRead());
using var br2 = new BinaryReader(fi2.OpenRead());

while (br1.BaseStream.Position != br1.BaseStream.Length)
{
var b1 = br1.ReadByte();
var b2 = br2.ReadByte();

if (b1 != b2)
return false;
}

return true;
}
I dunno I don't see how making this async has anything to do with the perf of it Hm, not a bad idea How does that work? You'd obviously not wanna read the entire file at once My code isn't great either, going byte by byte Ideally you'd read in chunks Then fix those byte arrays and for over the pointers That'd be the most performant i think Actually the hashing approach is pretty decent I'm not sure where that bottlenecks Would you even need to read in chunks when you use the hash? Like if you use sha512 Wouldn't that mean the files are equal? Hm
qqdev
qqdev3y ago
The sequence equal approach can probably also be improved For files that are almost similiar. But probably not worth the time. Maybe there are already optimizations happening tho https://source.dot.net/#System.Private.CoreLib/src/libraries/System.Private.CoreLib/src/System/SpanHelpers.Byte.cs,998a36a55f580ab1 Looks already pretty good
ero
ero3y ago
void FindDuplicateFiles(string directory)
{
var dirInfo = new DirectoryInfo(directory);
var fileInfos = dirInfo.EnumerateFiles("*", SearchOption.AllDirectories);

var skips = 1;
using var eOuter = fileInfos.GetEnumerator();

while (eOuter.MoveNext())
{
var fiOuter = eOuter.Current;

var skipped = 0;
using var eInner = fileInfo.GetEnumerator();

while (eInner.MoveNext())
{
if (skipped++ < skip)
continue;

var fiInner = eOuter.Current;

if (FileContentsEqual(fiOuter, fiInner))
Console.WriteLine($"{fiInner.FullName} and {fiOuter.FullName} are equal.");
}
}
}

unsafe bool FileContentsEqual(FileInfo fi1, FileInfo fi2)
{
if (fi1.Length != fi2.Length)
return false;

var (s1, s2) = (fi1.OpenRead(), fi2.OpenRead());

if (ComputeHash(s1) == ComputeHash(s2))
return true;

s1.Position = s2.Position = 0;

using (BinaryReader br1, BinaryReader br2) = (new(s1), new(s2));

const int CHUNK_SIZE = 256;

while (br1.BaseStream.Position != br1.BaseStream.Length)
{
var b1 = br1.ReadBytes(CHUNK_SIZE);
var b2 = br2.ReadBytes(CHUNK_SIZE);

fixed (byte* pB1 = b1, pB2 = b2)
{
for (int i = 0; i < b1.Length; i++)
{
if (pB1[i] != pB2[i])
return false;
}
}
}

return true;
}

string ComputeHash(Stream stream)
{
using var sha512 = SHA512.Create();

var hash = sha512.ComputeHash(stream);
return Convert.ToBase64String(hash);
}
void FindDuplicateFiles(string directory)
{
var dirInfo = new DirectoryInfo(directory);
var fileInfos = dirInfo.EnumerateFiles("*", SearchOption.AllDirectories);

var skips = 1;
using var eOuter = fileInfos.GetEnumerator();

while (eOuter.MoveNext())
{
var fiOuter = eOuter.Current;

var skipped = 0;
using var eInner = fileInfo.GetEnumerator();

while (eInner.MoveNext())
{
if (skipped++ < skip)
continue;

var fiInner = eOuter.Current;

if (FileContentsEqual(fiOuter, fiInner))
Console.WriteLine($"{fiInner.FullName} and {fiOuter.FullName} are equal.");
}
}
}

unsafe bool FileContentsEqual(FileInfo fi1, FileInfo fi2)
{
if (fi1.Length != fi2.Length)
return false;

var (s1, s2) = (fi1.OpenRead(), fi2.OpenRead());

if (ComputeHash(s1) == ComputeHash(s2))
return true;

s1.Position = s2.Position = 0;

using (BinaryReader br1, BinaryReader br2) = (new(s1), new(s2));

const int CHUNK_SIZE = 256;

while (br1.BaseStream.Position != br1.BaseStream.Length)
{
var b1 = br1.ReadBytes(CHUNK_SIZE);
var b2 = br2.ReadBytes(CHUNK_SIZE);

fixed (byte* pB1 = b1, pB2 = b2)
{
for (int i = 0; i < b1.Length; i++)
{
if (pB1[i] != pB2[i])
return false;
}
}
}

return true;
}

string ComputeHash(Stream stream)
{
using var sha512 = SHA512.Create();

var hash = sha512.ComputeHash(stream);
return Convert.ToBase64String(hash);
}
Pepega ass code Don't know if that one using directive is legal On my phone lol Ah yeah that's fair Obviously Since it's smaller But just to make sure
bookuha
bookuhaOP3y ago
I can’t really dive into the solutions right now, driving home. Thank you Will check a bit later I am currently grouping things up by their size And then comparing files in the resulting buckets But my comparison approach is not good
ero
ero3y ago
Try to make my approaches async as far as possible. I don't use async programming enough to know what's possible and good. Perhaps WhenAll the binaryreader reads? And cache the hashes Rare
bookuha
bookuhaOP3y ago
Oh, yes. Thank you
ero
ero3y ago
Hah, yeah I won't be home for another 5 hours so it's difficult for me Ah i was assuming the steam directory Also different for Linux imagine You guys have like a billion empty config files
bookuha
bookuhaOP3y ago
Thank you guys!

Did you find this page helpful?