I work at Ghostery. Yes, Cliqz bought Ghostery for the Human Web data, since we ...

ThePhysicist · on March 4, 2021

I'm not saying it's not anonymous, just that it's impossible to assert the anonymity.

Also, I saw a lot of "anonymous" clickstream data offered by other companies, which was often trivial to de-anonymize. We did a DEF CON 25 talk about it, just google "Dark Data DEF CON 25". Robustly anonymizing high-dimensional data like user clickstreams is practically impossible, and often knowing a combination of 4-7 websites a user regularly visits is enough to identify him/her in a pool of millions of users (see the talk for details), so I'm highly doubtful about any company that claims it can robustly anonymize such data. If you're confident your data is anonymous why not release a large sample and have researchers look at it?

So while I'm not saying Ghostery is also doing that I don't have a lot of good faith in these data collection practices in general (also, I think before Cliqz acquired Ghostery it collected a lot of data like cookies from the users). Again, it's a smart way to collect data but I wouldn't call it very privacy-friendly.

solso · on March 4, 2021

It is trivial to de-anonymize if records are linkable, which is the case you mention on Dark Data DEFCON25. Another famous case was the de-anonymization of the Netflix data set.

However, you are assuming that HumanWeb data collection is record-linkable, which is not the case, precisely to avoid this attack.

If what is being collected is linkable: e.g. (user_id, url_1), ... (urser_id, url_n). No matter how you anonymize user_id, it will eventually leak. A single url containing personal identifiable information, e.g. a username, will compromise the whole session. No matter how sophisticated the user_id generation is. The real problem, privacy-wise, is the fact that record can be linked to the same origin. An attacker (or the collector) has the ability to know if two records have the same origin.

The anonymization of HumanWeb, however, ensures that linkability across data points is not present. Hence, an attacker cannot know if two records come from the same origin. As a consequence, the fact that one url might give away user data, for instance a username, it would not compromise all the urls sent by that person.

If you are interested in more details I recommend this article: https://0x65.dev/blog/2019-12-03/human-web-collecting-data-i...

[Disclaimer I'm one of the authors]

ThePhysicist · on March 4, 2021

I still see a lot of ways in which users could be de-anonymized, sometimes a single URL is already sufficient and side channels like the quorum mechanism might leak information as well. Maybe it's really anonymous, but personally I don't trust any mechanism that doesn't have a statistical anonymity guarantee, differential privacy being the preferred one as it's the only anonymity model that hasn't been broken yet.

Anyway, it's great that Cliqz did this work and I don't want to diminish it, I'm just very cautious when companies claim they're only collecting anonymous data, there were just too many cases in which promises have been broken.