Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

There's still valid reason to be concerned here. For example, this implies building a diffusion model based on private data can leak it. If I can generate a whole bunch of prompts like "MRI Joe Biden brain tumor" and 1 out of a million times I get a consistent result, that's unacceptable.

Github Copilot could be leaking private code as well.



It seems the only way they can get an extraction is if a highly duplicated (vs the average) training image is used which is counter to a privacy concern.


This thread shows that there are outliers that are non-duplicated in the training set that still show up in results.

https://twitter.com/alexjc/status/1620466058565132288

Specifically, this post confirms cases of a single image in the training set being "memorized"

https://twitter.com/Eric_Wallace_/status/1620475626611421186...


Ah I see the section on this in the paper now, the 2nd half of 7.1 on page 14.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: