If by "computationally heavy" you're talking about the financial overhead of hir...

londons_explore · on Feb 1, 2023

At the sort of scales of these datasets, there is no way to filter by hand.

But there are lots of ways to identify near identical images algorithmically. Typically the process is to download each image, run it through a neural net to make an image embedding vector (a list of a few hundred floats). Save all those in a database. Then, for each image, if it is too close in 'embed space' to another in the database, then it is a duplicate, and should be removed.

This algorithm might catch 'duplicates' that it shouldn't, like multiple people taking photos of the eiffel tower from the same public viewpoint.

It might miss real duplicates such as an image failing to match with a collage containing the same image.

But it's still better than not removing duplicates at all...

ben_w · on Feb 1, 2023

That's an unreasonable presumption.

Deduplication, to the best of my knowledge, requires every image be compared to every other image. This is necessarily O(n^2) on n images.

IIRC the training set is 2.3 billion images, if so that's 0.5 * 5.29e18 comparisons[0], which, if done by humans, would require employing literally all humans for approximately a year even if we compared 12 images per second 24/7.

This has to be done computationally, not by humans.

[0] half because (a = b) <=> (b = a)

yojo · on Feb 1, 2023

Not an expert, but I think you can just hash using an image hashing algorithm and bucket by hash. Should be linear in time/space.

You can then either have a human check collisions, or just accept a false positive rate and move on.

This is a decent write up of someone doing this on a (smaller) dataset.

https://towardsdatascience.com/detection-of-duplicate-images...

bombolo · on Feb 1, 2023

This doesn't really work if the image is a meme that has been scaled and converted millions of times.

yojo · on Feb 1, 2023

Perceptual hashes are reasonably robust to small perturbations. Scaling and converting shouldn’t be a problem, though applying filters that change the colors or blur the lines might.

If it’s been changed enough that it hashes to a different value, then it might be reasonable to treat it as a different image. At some point a human is also going to say “that’s not the same.” You can always change your hashing algorithm if you find it’s missing too many dupes.

Regardless, for the domain we’re talking about (deduping training data), a few false negatives should be acceptable.

bombolo · on Feb 1, 2023

If it's more than a few you end up with overfitting though.

HelloNurse · on Feb 1, 2023

Why not? Many meme variants end up in the same bucket, all colliding. They are as duplicates as random rare images.

A large bucket can also be inspected by humans or cut up by applying more perceptual hash function and decreasing tolerance, but it would be counterproductive cheating in this case.

babel_ · on Feb 1, 2023

You can lower the bounds significantly. Images do not require comparison to all others, and the presented approach to detecting and deduplicating images in the paper is easily adapted to be near linear (trading for storage or cleverness).

Put simply, do you expect google image search to compare your image to every other possible image? No, they're going to embed it to a vector (512d in the paper) and only compare to probable matches; in the paper they start by brute forcing pairwise comparison of the vectors for the dataset, and then use clique finding to go faster when checking their generated images.

raincole · on Feb 1, 2023

You can roughly classify the images first (we already have very good AI model for this) and use humans as an additional layer. If two images don't get the same top-3 labels from the classifier, the chance that they're duplicates is neglectable.