Fast finding duplicate pictures in database.
Hi! I'm working with mongo database on c#. In my web-app I have millions of pictures and now I want to add the ability to find possible duplicates and similar pictures. I tried PHash but it's bad for cropped pics. I also learned about CBRI and Fast Hough Transform. But do not trully understand how to implement it. I use openCV for finding duplicate and it works. The hardest problem is to find duplicates fast without checking every picture for similarity. Is there a way to solve this problem via C# without machine learning? I read about k-means and LSH but also do not understand how to implement them with my database and picture's descriptors. Thanks in advance!
8 Replies
so you want to solve the duplication problem beforehand to have a "simple" hash that sort of represent the content of the image
that's a pretty tough problem, and if you are not familiary with heavy math-based algorithms it's even tougher
i don't know much about this, but i think there are many ways to approach the problem
for example i would think of a single algorithm to detect every duplication, i would instead try to see if there are really heterogeneous smaller groups of images categories such that i would be certain they are independet (and we could say lsh could be part of this) so that in the end you have many small and more tractable problems instead of a giant single one
i can think of many ways to do this, for example analysis on colors, contrast, shapes
but then again it depends a lot on your data, without analysis you can't predict much
for example if you have to detect between images and their negative that can exclude some techniques (if for example image doesn't have metadata that it's a negative and you would know it)
i guess i would try quadtrees to have a sort of compression of the image, it could be a start to get an hash of them, and then compare them within a certain α of certainty, say 95%, maybe less
but it would require manual work for sure
i don't even know how could you approach this without being an expert or even just knowledgeable in the field
I thought about clastering my descriptors of each picture with k-means to create about 1000 groups. And store every picture with one more field of group which i ll use as an index. Then I when i ll add new picture my algorithms will compute its descriptors and with k-means find its group. I ll take all pictures of this group and compare their descriptors with the descriptors of a new one like how i did in the screen above
But yeah... all in all its a hard task and it is hard to categorize them by size or colors or shapes because in my application users can add any kinds of pictures with differens colors and shapes
I am not experienced in this topic but azure ai servies might help https://learn.microsoft.com/en-us/azure/search/cognitive-search-skill-image-analysis
i wouldn't really say strictly "categorizing" for shapes or colors, more like indexing or finding common properties to separate images from and reduce the size of the giant images mixed set to process
azure doesn't work in my country🫠
anyway somehow i need to find something in common between duplicates😵💫
that's why i think you have to analyze data and get some an idea of what you are working with
try extracting some properties and look if they can work as indexes (if they are univocal enough), then take decisions
I tried to make some manipulations with descriptors, find some middle descriptor or something else for k-means but it all doesn't work. I sent an email to tineye, maybe they'll help
purchase US based VPS, expose API outside 😄
payment could be hard anyways :/