C
C#3mo ago
Daiko Games

✅ Getting duplicates out of a list

Hey, I have a C# code where I have a list of byte arrays (which are extracted from Images of a folder and I don‘t know how many there are - ps. They are too many - the code I wrote the Program in was in Windows forms) and there are duplicates in it, but I don’t know how get the duplicates out of the list dynamically. I couldn’t find a really good solution that does what I want. Thanks for help!
36 Replies
Keswiik
Keswiik3mo ago
What are you considering a duplicate? If you have the exact same image visually, but at a different resolution, is that a duplicate? How about the same image in two different formats?
Daiko Games
Daiko GamesOP3mo ago
nonono, what I meant with that is the byte array itself. The byte arrays are sometimes the same which means everything of some images is the same, isn´t it?
Keswiik
Keswiik3mo ago
Sure, in that specific case. But if the goal is to detect duplicate images, then that kind of comparison will miss a lot of stuff.
Daiko Games
Daiko GamesOP3mo ago
Yes i know but the code will be larger by that, and this converter should be simple - do you know a solution?
Keswiik
Keswiik3mo ago
There are a ton of different methods for finding duplicate images, they're not exactly simple though
Keswiik
Keswiik3mo ago
Stack Overflow
Algorithm to compare two images
Given two different image files (in whatever format I choose), I need to write a program to predict the chance if one being the illegal copy of another. The author of the copy may do stuff like rot...
Daiko Games
Daiko GamesOP3mo ago
ok, if that is so, then I would have to do it complicated too I think..
ero
ero3mo ago
your approach would only really work if the images are genuinely identical like 1:1 copy and paste and even then i'm not sure (what about time stamps?)
Daiko Games
Daiko GamesOP3mo ago
They are, that is the point that is why I need help, is there an easy way for that.
ero
ero3mo ago
so you have a List<byte[]>?
Daiko Games
Daiko GamesOP3mo ago
Yes
ero
ero3mo ago
that won't entirely be enough, you have the file names somewhere? unless you only care about the indices
Daiko Games
Daiko GamesOP3mo ago
Yes I have them Isn´t the hash code normally enough?
MODiX
MODiX3mo ago
ero
REPL Result: Success
byte[] arr = [];
arr.GetHashCode()
byte[] arr = [];
arr.GetHashCode()
Result: int
48389476
48389476
Compile: 293.479ms | Execution: 54.281ms | React with ❌ to remove this embed.
ero
ero3mo ago
no :p or, to be more clear:
MODiX
MODiX3mo ago
ero
REPL Result: Success
byte[] arr1 = [];
byte[] arr2 = [];

(arr1.GetHashCode(), arr2.GetHashCode())
byte[] arr1 = [];
byte[] arr2 = [];

(arr1.GetHashCode(), arr2.GetHashCode())
Result: ValueTuple<int, int>
{
"item1": 27554885,
"item2": 27554885
}
{
"item1": 27554885,
"item2": 27554885
}
Compile: 364.339ms | Execution: 27.222ms | React with ❌ to remove this embed.
ero
ero3mo ago
:ReallyMad: i doubt you can compare arrays with their hashcode
Daiko Games
Daiko GamesOP3mo ago
ok sorry, I don´t understand the topic that much, i am trying to understand it.
ero
ero3mo ago
they're reference types, but i don't know their hashcode implementation
Daiko Games
Daiko GamesOP3mo ago
k how would you do it if it was simple? I mean filter out the byte arrays (duplicates)
ero
ero3mo ago
really good question honestly i mean, it's not hard or anything just hard to get right the easiest and slowest way would of course be a nested loop, comparing each item one by one with all other items you would use SequenceEquals on the arrays to make sure they're identical
Daiko Games
Daiko GamesOP3mo ago
But that is time consuming and also power consuming if they are many images
ero
ero3mo ago
are you doing this more than once?
Daiko Games
Daiko GamesOP3mo ago
No
ero
ero3mo ago
so?
Daiko Games
Daiko GamesOP3mo ago
isn´t hashset the thing that does that better?
ero
ero3mo ago
hashset still needs a comparer
Daiko Games
Daiko GamesOP3mo ago
Ohhhhhhh, that is why it didn´t work. Now I understand, thanks I tested things bevore I wrote here ok, what else is possible instead of sequence equal, or is there a method to dynamically compare two bytes with Sequence equals from my list?
ero
ero3mo ago
public unsafe class ByteArrayComparer : IEqualityComparer<byte[]>
{
public bool Equals(byte[]? x, byte[]? y)
{
if (x == null || y == null)
{
return false;
}

return x.SequenceEqual(y);
}

public int GetHashCode([DisallowNull] byte[] obj)
{
return obj.GetHashCode();
}
}
public unsafe class ByteArrayComparer : IEqualityComparer<byte[]>
{
public bool Equals(byte[]? x, byte[]? y)
{
if (x == null || y == null)
{
return false;
}

return x.SequenceEqual(y);
}

public int GetHashCode([DisallowNull] byte[] obj)
{
return obj.GetHashCode();
}
}
var dups = list
.Select((bytes, i) => (bytes, i))
.GroupBy(t => t.Item1, new ByteArrayComparer())
.Where(g => g.Count() > 1)
.SelectMany(g => g.Select(t => t.Item2));
var dups = list
.Select((bytes, i) => (bytes, i))
.GroupBy(t => t.Item1, new ByteArrayComparer())
.Where(g => g.Count() > 1)
.SelectMany(g => g.Select(t => t.Item2));
this should return an enumerable of duplicate indices
MODiX
MODiX3mo ago
ero
sharplab.io (click here)
var b1 = GetBytes(50, 1337);
var b2 = GetBytes(50, 1337);
var b3 = GetBytes(50, 42);
Console.WriteLine(b1 == b2); // not reference equal
var list = new List<byte[]>([b1, b2, b3]);
var dups = list
.Select((bytes, i) => (bytes, i))
.GroupBy(t => t.Item1, new ByteArrayComparer())
.Where(g => g.Count() > 1)
.SelectMany(g => g.Select(t => t.Item2));
// 34 more lines. Follow the link to view.
var b1 = GetBytes(50, 1337);
var b2 = GetBytes(50, 1337);
var b3 = GetBytes(50, 42);
Console.WriteLine(b1 == b2); // not reference equal
var list = new List<byte[]>([b1, b2, b3]);
var dups = list
.Select((bytes, i) => (bytes, i))
.GroupBy(t => t.Item1, new ByteArrayComparer())
.Where(g => g.Count() > 1)
.SelectMany(g => g.Select(t => t.Item2));
// 34 more lines. Follow the link to view.
React with ❌ to remove this embed.
ero
ero3mo ago
here's some fixed code
Daiko Games
Daiko GamesOP3mo ago
Ok thanks for help!
ero
ero3mo ago
the problem with hashcode not being enough to implement is that the hashcode can still contain collisions it's meant as a pre-check before doing the actual equality check
Daiko Games
Daiko GamesOP3mo ago
👍🏻 k
ero
ero3mo ago
if the hashcodes already don't line up, we don't need to check for equality there's certainly many ways to optimize this, but that's a topic for #allow-unsafe-blocks $close
MODiX
MODiX3mo ago
If you have no further questions, please use /close to mark the forum thread as answered

Did you find this page helpful?