At the sort of scales of these datasets, there is no way to filter by hand.
But there are lots of ways to identify near identical images algorithmically. Typically the process is to download each image, run it through a neural net to make an image embedding vector (a list of a few hundred floats). Save all those in a database. Then, for each image, if it is too close in 'embed space' to another in the database, then it is a duplicate, and should be removed.
This algorithm might catch 'duplicates' that it shouldn't, like multiple people taking photos of the eiffel tower from the same public viewpoint.
It might miss real duplicates such as an image failing to match with a collage containing the same image.
But it's still better than not removing duplicates at all...
But there are lots of ways to identify near identical images algorithmically. Typically the process is to download each image, run it through a neural net to make an image embedding vector (a list of a few hundred floats). Save all those in a database. Then, for each image, if it is too close in 'embed space' to another in the database, then it is a duplicate, and should be removed.
This algorithm might catch 'duplicates' that it shouldn't, like multiple people taking photos of the eiffel tower from the same public viewpoint.
It might miss real duplicates such as an image failing to match with a collage containing the same image.
But it's still better than not removing duplicates at all...