Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Not an expert, but I think you can just hash using an image hashing algorithm and bucket by hash. Should be linear in time/space.

You can then either have a human check collisions, or just accept a false positive rate and move on.

This is a decent write up of someone doing this on a (smaller) dataset.

https://towardsdatascience.com/detection-of-duplicate-images...



This doesn't really work if the image is a meme that has been scaled and converted millions of times.


Perceptual hashes are reasonably robust to small perturbations. Scaling and converting shouldn’t be a problem, though applying filters that change the colors or blur the lines might.

If it’s been changed enough that it hashes to a different value, then it might be reasonable to treat it as a different image. At some point a human is also going to say “that’s not the same.” You can always change your hashing algorithm if you find it’s missing too many dupes.

Regardless, for the domain we’re talking about (deduping training data), a few false negatives should be acceptable.


If it's more than a few you end up with overfitting though.


Why not? Many meme variants end up in the same bucket, all colliding. They are as duplicates as random rare images.

A large bucket can also be inspected by humans or cut up by applying more perceptual hash function and decreasing tolerance, but it would be counterproductive cheating in this case.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: