Perceptual hashes are reasonably robust to small perturbations. Scaling and converting shouldn’t be a problem, though applying filters that change the colors or blur the lines might.
If it’s been changed enough that it hashes to a different value, then it might be reasonable to treat it as a different image. At some point a human is also going to say “that’s not the same.” You can always change your hashing algorithm if you find it’s missing too many dupes.
Regardless, for the domain we’re talking about (deduping training data), a few false negatives should be acceptable.
Why not? Many meme variants end up in the same bucket, all colliding. They are as duplicates as random rare images.
A large bucket can also be inspected by humans or cut up by applying more perceptual hash function and decreasing tolerance, but it would be counterproductive cheating in this case.
You can then either have a human check collisions, or just accept a false positive rate and move on.
This is a decent write up of someone doing this on a (smaller) dataset.
https://towardsdatascience.com/detection-of-duplicate-images...