C#•3y ago

Find file duplicates | Optimize

How can I optimize this? https://gist.github.com/bookuha/707027336b40663b14c3cb64b039e9ec

Gist

FindDuplicatesInDirectory

FindDuplicatesInDirectory. GitHub Gist: instantly share code, notes, and snippets.

10 Replies

ero•3y ago

Avoid linq, perhaps Use pointers, stackallocs

perhaps

Linq will create and go through an enumerator. If you can avoid that, you've already made up some time I don't really see how the file length matters 2 files can have the same length but have different content That's a given

qqdev•3y ago

There are even cases with .NET 7 now where an explicit implementation would be slower kekw

ero•3y ago

void FindDuplicateFiles(string directory)
{
  var dirInfo = new DirectoryInfo(directory);
  var fileInfos = dirInfo.EnumerateFiles("*", SearchOption.AllDirectories);

  var skips = 1;
  using var eInner = fileInfos.GetEnumerator();

  while (eInner.MoveNext())
  {
    var fiInner = eInner.Current;

    var skipped = 0;
    using var eOuter = fileInfo.GetEnumerator();

    while (eOuter.MoveNext())
    {
      if (skipped++ < skip)
        continue;

      var fiOuter = eOuter.Current;
  
      if (FileContentsEqual(fiInner, fiOuter))
        Console.WriteLine($"{fiInner.FullName} and {fiOuter.FullName} are equal.");
    }

    i++;
  }
}

bool FileContentsEqual(FileInfo fi1, FileInfo fi2)
{
  if (fi1.Length != fi2.Length)
    return false;

  using var br1 = new BinaryReader(fi1.OpenRead());
  using var br2 = new BinaryReader(fi2.OpenRead());

  while (br1.BaseStream.Position != br1.BaseStream.Length)
  {
    var b1 = br1.ReadByte();
    var b2 = br2.ReadByte();

    if (b1 != b2)
      return false;
  }

  return true;
}

void FindDuplicateFiles(string directory)
{
  var dirInfo = new DirectoryInfo(directory);
  var fileInfos = dirInfo.EnumerateFiles("*", SearchOption.AllDirectories);

  var skips = 1;
  using var eInner = fileInfos.GetEnumerator();

  while (eInner.MoveNext())
  {
    var fiInner = eInner.Current;

    var skipped = 0;
    using var eOuter = fileInfo.GetEnumerator();

    while (eOuter.MoveNext())
    {
      if (skipped++ < skip)
        continue;

      var fiOuter = eOuter.Current;
  
      if (FileContentsEqual(fiInner, fiOuter))
        Console.WriteLine($"{fiInner.FullName} and {fiOuter.FullName} are equal.");
    }

    i++;
  }
}

bool FileContentsEqual(FileInfo fi1, FileInfo fi2)
{
  if (fi1.Length != fi2.Length)
    return false;

  using var br1 = new BinaryReader(fi1.OpenRead());
  using var br2 = new BinaryReader(fi2.OpenRead());

  while (br1.BaseStream.Position != br1.BaseStream.Length)
  {
    var b1 = br1.ReadByte();
    var b2 = br2.ReadByte();

    if (b1 != b2)
      return false;
  }

  return true;
}

I dunno I don't see how making this async has anything to do with the perf of it Hm, not a bad idea How does that work? You'd obviously not wanna read the entire file at once My code isn't great either, going byte by byte Ideally you'd read in chunks Then fix those byte arrays and for over the pointers That'd be the most performant i think Actually the hashing approach is pretty decent I'm not sure where that bottlenecks Would you even need to read in chunks when you use the hash? Like if you use sha512 Wouldn't that mean the files are equal? Hm

qqdev•3y ago

The sequence equal approach can probably also be improved For files that are almost similiar. But probably not worth the time. Maybe there are already optimizations happening tho https://source.dot.net/#System.Private.CoreLib/src/libraries/System.Private.CoreLib/src/System/SpanHelpers.Byte.cs,998a36a55f580ab1 Looks already pretty good

ero•3y ago

void FindDuplicateFiles(string directory)
{
  var dirInfo = new DirectoryInfo(directory);
  var fileInfos = dirInfo.EnumerateFiles("*", SearchOption.AllDirectories);

  var skips = 1;
  using var eOuter = fileInfos.GetEnumerator();

  while (eOuter.MoveNext())
  {
    var fiOuter = eOuter.Current;

    var skipped = 0;
    using var eInner = fileInfo.GetEnumerator();

    while (eInner.MoveNext())
    {
      if (skipped++ < skip)
        continue;

      var fiInner = eOuter.Current;
  
      if (FileContentsEqual(fiOuter, fiInner))
        Console.WriteLine($"{fiInner.FullName} and {fiOuter.FullName} are equal.");
    }
  }
}

unsafe bool FileContentsEqual(FileInfo fi1, FileInfo fi2)
{
  if (fi1.Length != fi2.Length)
    return false;

  var (s1, s2) = (fi1.OpenRead(), fi2.OpenRead());

  if (ComputeHash(s1) == ComputeHash(s2))
    return true;

  s1.Position = s2.Position = 0;

  using (BinaryReader br1, BinaryReader br2) = (new(s1), new(s2));

  const int CHUNK_SIZE = 256;

  while (br1.BaseStream.Position != br1.BaseStream.Length)
  {
    var b1 = br1.ReadBytes(CHUNK_SIZE);
    var b2 = br2.ReadBytes(CHUNK_SIZE);

    fixed (byte* pB1 = b1, pB2 = b2)
    {
      for (int i = 0; i < b1.Length; i++)
      {
        if (pB1[i] != pB2[i])
          return false;
      }
    }
  }

  return true;
}

string ComputeHash(Stream stream)
{
  using var sha512 = SHA512.Create();

  var hash = sha512.ComputeHash(stream);
  return Convert.ToBase64String(hash);
}

void FindDuplicateFiles(string directory)
{
  var dirInfo = new DirectoryInfo(directory);
  var fileInfos = dirInfo.EnumerateFiles("*", SearchOption.AllDirectories);

  var skips = 1;
  using var eOuter = fileInfos.GetEnumerator();

  while (eOuter.MoveNext())
  {
    var fiOuter = eOuter.Current;

    var skipped = 0;
    using var eInner = fileInfo.GetEnumerator();

    while (eInner.MoveNext())
    {
      if (skipped++ < skip)
        continue;

      var fiInner = eOuter.Current;
  
      if (FileContentsEqual(fiOuter, fiInner))
        Console.WriteLine($"{fiInner.FullName} and {fiOuter.FullName} are equal.");
    }
  }
}

unsafe bool FileContentsEqual(FileInfo fi1, FileInfo fi2)
{
  if (fi1.Length != fi2.Length)
    return false;

  var (s1, s2) = (fi1.OpenRead(), fi2.OpenRead());

  if (ComputeHash(s1) == ComputeHash(s2))
    return true;

  s1.Position = s2.Position = 0;

  using (BinaryReader br1, BinaryReader br2) = (new(s1), new(s2));

  const int CHUNK_SIZE = 256;

  while (br1.BaseStream.Position != br1.BaseStream.Length)
  {
    var b1 = br1.ReadBytes(CHUNK_SIZE);
    var b2 = br2.ReadBytes(CHUNK_SIZE);

    fixed (byte* pB1 = b1, pB2 = b2)
    {
      for (int i = 0; i < b1.Length; i++)
      {
        if (pB1[i] != pB2[i])
          return false;
      }
    }
  }

  return true;
}

string ComputeHash(Stream stream)
{
  using var sha512 = SHA512.Create();

  var hash = sha512.ComputeHash(stream);
  return Convert.ToBase64String(hash);
}

Pepega ass code Don't know if that one using directive is legal On my phone lol Ah yeah that's fair Obviously Since it's smaller But just to make sure

bookuhaOP•3y ago

I can’t really dive into the solutions right now, driving home. Thank you Will check a bit later I am currently grouping things up by their size And then comparing files in the resulting buckets But my comparison approach is not good

ero•3y ago

Try to make my approaches async as far as possible. I don't use async programming enough to know what's possible and good. Perhaps WhenAll the binaryreader reads? And cache the hashes Rare

bookuhaOP•3y ago

Oh, yes. Thank you

ero•3y ago

Hah, yeah I won't be home for another 5 hours so it's difficult for me Ah i was assuming the steam directory Also different for Linux imagine You guys have like a billion empty config files

bookuhaOP•3y ago

Thank you guys!

Gaming

Programming

Find file duplicates | Optimize

Did you find this page helpful?