Not an expert, but I think you can just hash using an image hashing algorithm an...

bombolo · on Feb 1, 2023

This doesn't really work if the image is a meme that has been scaled and converted millions of times.

yojo · on Feb 1, 2023

Perceptual hashes are reasonably robust to small perturbations. Scaling and converting shouldn’t be a problem, though applying filters that change the colors or blur the lines might.

If it’s been changed enough that it hashes to a different value, then it might be reasonable to treat it as a different image. At some point a human is also going to say “that’s not the same.” You can always change your hashing algorithm if you find it’s missing too many dupes.

Regardless, for the domain we’re talking about (deduping training data), a few false negatives should be acceptable.

bombolo · on Feb 1, 2023

If it's more than a few you end up with overfitting though.

HelloNurse · on Feb 1, 2023

Why not? Many meme variants end up in the same bucket, all colliding. They are as duplicates as random rare images.

A large bucket can also be inspected by humans or cut up by applying more perceptual hash function and decreasing tolerance, but it would be counterproductive cheating in this case.