Extracting training data from diffusion models

dang · on Feb 1, 2023

See also https://twitter.com/Eric_Wallace_/status/1620449934863642624. (Thanks to all who posted that. We merged the threads now.)

saurabh20n · on Jan 31, 2023

The last author's tweet thread and replies have some interesting tidbits: https://twitter.com/Eric_Wallace_/status/1620449934863642624

* "We propose to extract memorized images by generating many times with the same prompt and flagging cases where many of the generations are the same."

* "- Diffusion models memorize more than GANs - Outlier images are memorized more - Existing privacy-preserving methods largely fail"

* "Stable Diffusion is small relative to its training set (2GB of weights and many TB of data). So, while memorization is rare by design, future (larger) diffusion models will memorize more."

* "It only memorizes a very small subset of the images that it trains on."

* "our goal is to show that models can output training images when generating in the same fashion that normal users do."

freilanzer · on Feb 1, 2023

> * "It only memorizes a very small subset of the images that it trains on."

An interesting question here would be: why does it memorise these images over others? Can the other images still be synthesised with loss via a suitable prompt? If so, are the memorised images important for this? Can this set be reduced further?

FeepingCreature · on Feb 1, 2023

100 images out of 350,000 that they looked at were memorized.

This seems to mostly happen when an image appears frequently (more than 100 times) in the training data, and/or the dataset is small relative to the model.

wombat_trouble · on Feb 1, 2023

Oh come on. I'm excited about new technologies and I think that image generation can be a net positive for the society, but can we not do that? First, we had people confidently asserting that stuff of this sort absolutely can't happen. Now, we're moving the goalposts to "well, not a legitimate criticism because it doesn't happen often".

The point is that basically all Stable Diffusion / DALL-E / MidJourney output is some shade of this; the only new data is that contrary to prior assertions, in some cases, it goes all the way to a verbatim copy.

I think there are some defensible stances one can take. One is to reject the idea of intellectual property. Another is to advocate for some specific legal or technical bar that the models would have to pass for it to qualify as "not stealing". Yet another is to argue it's a morally-agnostic technology like VHS or a photocopier, and the burden of using it in a socially acceptable way rests with the user.

FeepingCreature · on Feb 1, 2023

> Oh come on. I'm excited about new technologies and I think that image generation can be a net positive for the society, but can we not do that?

What, summarize the submission? This is straight quoted from the link.

> The point is that basically all Stable Diffusion / DALL-E / MidJourney output is some shade of this

Yes, and that point is mistaken, or so generic as to be worthless. The network memorizes art for the same reason humans memorize art: because there's some art pieces we see so often that we can recall them easily.

Ask an artist to duplicate Starry Night or Scream from memory, you'll probably get at least a passable imitation. The more capable the artist, the more faithful it will be.

We know that SD can be made to plagiarize, given repeated training on a specific image. (This is just to say that a neural network can learn to regurgitate a sample, a capability that was not ever in question.) This is a far cry from the assertions that its art is generally plagiarized.

egypturnash · on Feb 1, 2023

A human artist is, however, not a machine built and owned by a corporation, who will draw anything you tell them to.

A human artist has been trained in the ethics and laws of their craft along with the skills required to make images.

A human artist, asked to clone Starry Night, will ask you what you are doing, and knows where the lines are between "a tribute", "plagiarism", and "outright forgery".

A human artist, asked to do work in the style of another artist, will have a certain respect for the other artist's ownership of their style. This is not a thing that is at all protected by intellectual property law but it is still a thing artists are trained to respect. There are exceptions - drawing just like your boss may be your job, drawing just like a living artist for a couple pieces is a useful way to break down their style and take a few parts of it to influence later work without going over the "style swipe" line, building your own work on the obvious foundation of an influential, dead artist's style is fine - but there are lines professionals will be very reluctant to cross.

For a relatively recent example of what happens if you break these unwritten laws, check out what happened when the American cartoonist Steve Giffen started doing a wholesale style swipe from the Argentinean cartoonist Muñoz: https://en.wikipedia.org/wiki/Keith_Giffen#Controversy

Neural networks know none of these unwritten rules. Neither do the people who are training them. Feed it a bunch of work generated by a living artist and start making a profit off of that? Sure, no problem! Bonus points if your response to them getting pissed off about this is to call them a luddite who is resisting the inevitable, and should throw away a lifetime of passionate training and go get a new job.

williamcotton · on Feb 1, 2023

Well we’re in luck! The human artist is the person who is coming up with the prompt and then deciding what to do with the resulting image based on their own ethics. Stable Diffusion is just a tool for artists, not a “computer artist”.

Funny enough, this notion in terms of liability pairs very well with our legal system!

Even funnier, this notion in terms of authorship pairs very well with modern and contemporary art theory!

And here’s some very relevant precedent in both a legal and artistic sense:

https://www.artnews.com/art-in-america/features/richard-prin...

egypturnash · on Feb 6, 2023

Richard Prince is a very rich jerkass who can afford to hire some very high-priced lawyers to argue that his work is right over the line of "sufficiently transformative" in the eyes of the law. He's well over the lines of the unwritten rules of professional artists that I'm talking about.

The corporations engaging in massive abuse of the grey areas of fair use to build these systems are, functionally, also very rich jerkasses who can afford to hire some very high-priced lawyers to make similar arguments.

ForestCritter · on Feb 2, 2023

The human giving the prompt is shopping, not creating. The only way a computer can compute an image is by copying the vectors of imagery made by humans and then labeled and plotted and fed to said computer. Humans paint, draw, do physical things to make art. Computers are fed plotline statistics that they can pull up to fullfil a shopping list. Humans generated the imagery, uploaded the imagery,labeled the imagery and wrote the code to manipulate the imagery. Now, I'm not a big proponent of copyright. If someone can paint just like Rembrant, then I applaud their skill. At the end of the day, 'AI' is not 'learning' ,it is replicating input data by filtering a large amount of data sets with appropriate labels ( affixed by humans) and using a statistical algorithm to approximate somebody's shopping list of art they wish they had the skill to create. I'll take the Rembrant forgery, thank you.

williamcotton · on Feb 2, 2023

Sure, I guess we can make the argument that someone who makes music out of samples or loops is just shopping as well. How about programming a drum machine? Aren't they just shopping for snare hits and quantization patterns?

So fine, whatever, call it all shopping! Dr. Dre is just out there shopping. I don't think he minds if that's what you call it.

So here's the thing. Using a drum machine and sampling old funk songs from the 70s doesn't mean you end up with The Chronic. It's more likely that the average person ends up making something pretty mediocre. Hey, it would sure have sounded really impressive if it had been released in the 1940s, but with mass produced commercial music hardware a funky drum beat is just not that special any more.

The same applies to any kind of commodity tool. It's what the artist does with the tool in the context of a world filled with an audience and other artists.

Ok, time for my opinions! I think that DeviantArt style digital paintings are total trash. I don't care about the skill in rendering cliches hanging off of comically large breasts. Oh, it took a long time? I'm sure it would take a long time to dig a 10 foot deep hole in your backyard and then fill it up again as well, something I'd much rather experience as art than some video game hallucination... but you know what? Just because I don't like it doesn't mean that it isn't art, that it wasn't made by a "real" artist or that there isn't some audience that appreciates it (even if they're only two more YouTube videos away from becoming a full blown incels and driving a trucks through high school track meets).

Richard Prince speaks to me about authorship and what it means to make art in a world completely saturated with commercial imagery and part of that meaning comes from the fact that he had one of his assistants draw on top of someone else's photograph.

Listen, I have my issues with the world of fine art, the market manipulation, collusion, and the general fact that the art is primarily being made for the people who could afford it and not like, my neighbor. Regardless, I've found a lot of intellectual stimulation and new ways of appreciating aesthetic beauty through the works of 20th century modernists and postmodernists. Deeper meaning as opposed to a big shiny sword and a short skirt.

My favorite form of visual art is the watercolor, done quickly and out in public, capturing what the artist sees in the moment. It's the visual equivalent of the folk song played on an acoustic guitar. I like when the artist is a friend. I don't care if it isn't Rembrandt. I care that it moves me.

Stable Diffusion can easily render cliches hanging off of comically large breasts. In fact, I think that's what 90% of SD GPU cycles are currently working on. So to me Stable Diffusion is good at the part of the art that I'm not really that interested in. I'm interested in why the person chose the image that they did given these tools. That's where the meaning comes from! I mean, these tools run the same problems as drum machines and samplers... pretty soon their mediocre outputs become trite and unexpressive. I would imagine that artists that use SD do so in ways and using techniques that are not just the click of a button.

egypturnash · on Feb 6, 2023

Goddamn, dude, you sure have a big hate-on for the work of the honest pornographer. Who is, in fact, primarily making art for, like, your neighbor.

Permit · on Feb 1, 2023

> The point is that basically all Stable Diffusion / DALL-E / MidJourney output is some shade of this; the only new data is that contrary to prior assertions, in some cases, it goes all the way to a verbatim copy.

This is absolutely not the point of the linked paper. It may be something you believe but you’re on the hook for providing evidence for it, this paper does not.

vanderZwan · on Feb 1, 2023

> and the burden of using it in a socially acceptable way rests with the user.

Or, you know, legislation. I'm kind of sick of everything being offloaded as a responsibility of the end-user as an excuse to externalize costs.

Plus, in this case it's not even like VHS or a photocopier, it's more like the printing press or the Jacquard loom: those with capital to invest in it benefit the most, at the expense of individuals being exploited.

williamcotton · on Feb 1, 2023

I’d prefer the liability being on the end-user.

Tools that are a burden to use, like tools that produce too many infringing works, are not going to sell as well as those that are not a burden to use.

This means that if someone makes a tool like this that also alerts the user of likely infringement it would perform better in a corporate, risk-averse marketplace.

williamcotton · on Feb 1, 2023

> Yet another is to argue it's a morally-agnostic technology like VHS or a photocopier, and the burden of using it in a socially acceptable way rests with the user.

There is a lot of case law that supports this interpretation, Sony v Universal being the most important as it introduced the notion of “commercially significant non-infringing uses”, of which there will be many by the time this hits trial and the appeals process.

Lawyers for these companies know this and the faster that can get people building and buying tools that are clearly non-infringing the more likely that these models are seen as fair-use.

However, if they want to keep the lawyers at their customer’s firms happy they will really need to come up with a way to show users if a work is likely to infringe on an existing work.

bioemerl · on Feb 1, 2023

Isn't the solution just to check for existing images - train an AI that can tell if an image came from your dataset, and don't use the generated picture if it matches?

Generating copyright images isn't a problem. Using them to make money is.

yencabulator · on Feb 2, 2023

If the model contains a copy of Mickey Mouse inside it, distributing the model is likely against the US copyright law as lobbied by Disney.

seanhunter · on Feb 1, 2023

Seems like it would be very simple to prevent this from happening by just adding a perceptual hash check[1] for collisions vs the training set before emitting an image. I'm sure someone would be smart enough to make a "perceptual cuckoo hash" type thing so it would be very fast at the expense of sometimes erring on the side of caution and not emitting an image that actually isn't the same as the training data.

I don't know why they didn't do this tbh.

[1] https://en.wikipedia.org/wiki/Perceptual_hashing

crazygringo · on Feb 1, 2023

Yeah it seems so obvious I can't help but wonder if there's something about it that doesn't work in practice.

Like if you set it to a threshold where it's not routinely catching false positives, it doesn't catch the original images either because they're still too subtly different or something.

Maybe they really did just not bother though. Heck, maybe they want to test it in court or something.

justinpombrio · on Feb 1, 2023

Two possible reasons, off the top of my head.

First, it probably wouldn't work. Stable Diffusion is going to be a lot better at telling whether two images are the same than a perceptual hash check. E.g. I bet if you removed all pictures of the Mona Lisa from the training data, it could still produce a pixel perfect copy of it, just from the many times it appeared in the background of pictures, or under weird adjustments or lighting that fooled the perceptual hash.

Second, I would guess they wanted common images to appear in the training data more than once. It should be trained on the Mona Lisa more than someone's snapchat of their dinner. Common images are more salient.

Reminds me of asking Chat-GPT for an obfuscated C contest entry. It produced one verbatim, though it couldn't explain what it did (it looked like syntactic nonsense, but it printed the Twelve Days of Christmas when run). I can only imagine it saw that inscrutable sequence of characters enough times that it memorized it.

seanhunter · on Feb 1, 2023

> Stable Diffusion is going to be a lot better at telling whether two images are the same than a perceptual hash check.

Why would it be? Stable diffusion is a text-to-image model it's not at all focussed on determining whether images are the same.

Secondly, I'm not proposing deduplication of the training set (I know some sibling comments have proposed this). I'm proposing a perceptual hash or similar check on the way out so that if a "generated image" is too similar to an image in the training set it gets dropped rather than returned.

pixl97 · on Feb 1, 2023

I'd say the vast majority of time when working with people programming/testing I hear the statement "Hmm, I didn't think of that".

RC_ITR · on Feb 1, 2023

The reason it’s not implemented yet is because none of these products are being monetized yet, so there’s no incentive.

Google can legally return a copyrighted image in Image search (it’s not selling the image, it’s selling ads against search results) and this would probably fall under the same protection.

Now if stable diffusion was sold as SaaS in the way Dall-E is…

jerf · on Feb 1, 2023

Correlations in the math domain get weird. It wouldn't surprise me in the slightest that those slight distortions you see in the "memorized" image versus the original ones, while small to humans, turn out to be staggeringly large to your perceptual hash. See the things like the little stickers you can stick to stop signs to make some neural nets decide they are people and such.

And it could go the other way too; it could be that the perceptual hashes are even "better" than humans at seeing past those distortions.

My point is that this all gets more complicated than you may think when you're trying to apply a hash designed for real images against the output of an algorithm.

And even if the hash worked perfectly, the false positive and false negative landscape would still be very likely to contain very surprising things.

adamsmith143 · on Feb 1, 2023

You can of course just build a classifier to detect training set images and it would do very well even with subtly different images.

derangedHorse · on Feb 1, 2023

Maybe this would make sense for a commercial service but I don’t see why would be shipped as part of the model itself. Maybe the training of the model can be tweaked to accommodate a new step to prevent reproductions but I don’t know enough to understand the repercussions on performance from doing that.

HelloNurse · on Feb 1, 2023

Because they don't care. See above comments about legislation.

londons_explore · on Feb 1, 2023

Removing duplicates and very near duplicates ought to be one of the first things done to any training dataset...

I wonder why this wasn't done? Too computation heavy?

babel_ · on Feb 1, 2023

The paper itself actually goes to show that this is insufficient (but should still be done, they went from regenerating 1 in 819 to 1 in 1063 images in a small corpus litmus test), and that diffusion is weaker to image extraction attacks than GANs (and seems to be weaker the better it gets).

As to why? Lack of care and vigilance, most likely. If they're willing to spend millions of GPU hours training some of these big models, then this wouldn't be a huge cost, and has near linear complexity over the dataset that parallelises trivially. Sadly, vigilance is a finite resource, and that kind of data cleaning is often dismissed in favour of doubling down on the training and assumming it'll be big enough to handle it (however, as has been shown, that's not the case).

(Edit: thanks FeepingCreature. The deduplication test used for the numbers above was a separate test in the paper for just this, not indicative of their extraction attack in general, with a small corpus and compared a diffusion model trained on both. So I liken it to a litmus test for the efficacy of deduplication.)

f38zf5vdt · on Feb 1, 2023

SD2's dataset was deduplicated prior to training. That's why the paper is about SD1, which was a prototype model completed before Stability even had VC raises.

FeepingCreature · on Feb 1, 2023

Note: they went from regenerating 1 in 819 to 1 in 1063 on a relatively small corpus.

KRAKRISMOTT · on Feb 1, 2023

Could be an overly enthusiastic data augmentation bug

6gvONxR4sf7o · on Feb 1, 2023

The open source community work in ML tends to be terrible at the science and rigor parts of it. You see tons of lazy “we recreated an open version of X” that fails to be remotely as well done as the peer reviewed version, even from the larger community groups.

Xelynega · on Feb 1, 2023

If by "computationally heavy" you're talking about the financial overhead of hiring another layer of human filters to filter duplicates out of the dataset.

londons_explore · on Feb 1, 2023

At the sort of scales of these datasets, there is no way to filter by hand.

But there are lots of ways to identify near identical images algorithmically. Typically the process is to download each image, run it through a neural net to make an image embedding vector (a list of a few hundred floats). Save all those in a database. Then, for each image, if it is too close in 'embed space' to another in the database, then it is a duplicate, and should be removed.

This algorithm might catch 'duplicates' that it shouldn't, like multiple people taking photos of the eiffel tower from the same public viewpoint.

It might miss real duplicates such as an image failing to match with a collage containing the same image.

But it's still better than not removing duplicates at all...

ben_w · on Feb 1, 2023

That's an unreasonable presumption.

Deduplication, to the best of my knowledge, requires every image be compared to every other image. This is necessarily O(n^2) on n images.

IIRC the training set is 2.3 billion images, if so that's 0.5 * 5.29e18 comparisons[0], which, if done by humans, would require employing literally all humans for approximately a year even if we compared 12 images per second 24/7.

This has to be done computationally, not by humans.

[0] half because (a = b) <=> (b = a)

yojo · on Feb 1, 2023

Not an expert, but I think you can just hash using an image hashing algorithm and bucket by hash. Should be linear in time/space.

You can then either have a human check collisions, or just accept a false positive rate and move on.

This is a decent write up of someone doing this on a (smaller) dataset.

https://towardsdatascience.com/detection-of-duplicate-images...

bombolo · on Feb 1, 2023

This doesn't really work if the image is a meme that has been scaled and converted millions of times.

yojo · on Feb 1, 2023

Perceptual hashes are reasonably robust to small perturbations. Scaling and converting shouldn’t be a problem, though applying filters that change the colors or blur the lines might.

If it’s been changed enough that it hashes to a different value, then it might be reasonable to treat it as a different image. At some point a human is also going to say “that’s not the same.” You can always change your hashing algorithm if you find it’s missing too many dupes.

Regardless, for the domain we’re talking about (deduping training data), a few false negatives should be acceptable.

bombolo · on Feb 1, 2023

If it's more than a few you end up with overfitting though.

HelloNurse · on Feb 1, 2023

Why not? Many meme variants end up in the same bucket, all colliding. They are as duplicates as random rare images.

A large bucket can also be inspected by humans or cut up by applying more perceptual hash function and decreasing tolerance, but it would be counterproductive cheating in this case.

babel_ · on Feb 1, 2023

You can lower the bounds significantly. Images do not require comparison to all others, and the presented approach to detecting and deduplicating images in the paper is easily adapted to be near linear (trading for storage or cleverness).

Put simply, do you expect google image search to compare your image to every other possible image? No, they're going to embed it to a vector (512d in the paper) and only compare to probable matches; in the paper they start by brute forcing pairwise comparison of the vectors for the dataset, and then use clique finding to go faster when checking their generated images.

raincole · on Feb 1, 2023

You can roughly classify the images first (we already have very good AI model for this) and use humans as an additional layer. If two images don't get the same top-3 labels from the classifier, the chance that they're duplicates is neglectable.

SkyPuncher · on Feb 1, 2023

Seems to be no different than how people memorize "facts".

I'm watching my daughter learn shapes right now. When I see a square, I call it a square and look for a matching "square" piece. It seems when my daughter sees a square, see is still looking at it's properties, labeling individual attributes (pointy thing here), then trying to find the same attributes on the corresponding piece. Eventually, my daughter will see a square enough that the concept will be re-enforced in her mind as a square.

I wonder if a similar behavior is happening here. This specific image was presented enough times that it essentially "burned into" the AI. Rather than having to attempt to generate this image from scratch, the AI will simply pull out the canonical reference.

dahart · on Feb 1, 2023

A hundred copyright law violations baked into a photocopier would be a lot, right? Shouldn’t we guarantee the number is zero, and not dismiss the problem even if it seems like it should be uncommon? Since the network trainers can’t control what images users produce with the network once it’s public, there’s no ability to predict how often it reproduces a copyrighted image.

> This seems to mostly happen

Sounds like duplicate images caused a minority ~25% of the cases? (My calculation: 1-819/1063) If so, then the network is memorizing single images, which we know is theoretically possible when images are sparse in the network’s latent space. What needs to happen to prevent this type of memorization?

FeepingCreature · on Feb 1, 2023

Duplicate images caused a minority of cases on their reproduction. All duplicates that were found on SD were of repeated images.

"In contrast, we failed to identify any memorization when applying the same methodology to Stable Diffusion - even after attempting to extract the 10,000 most-outlier samples."

Read, SD does not (seem to) memorize unusual and unique images (even though Imagen does).

dahart · on Feb 1, 2023

That’s all true, you’re right, but what does it mean we should do? Shouldn’t the models actually guarantee they can’t reproduce inputs, given that in the U.S. all images are copyrighted by default? The only images that are legal to copy for training, and legal to distribute in the form of a neural network that can spit them out are the images explicitly licensed, for example, by Creative Commons.

The Imagen outlier results validate the accidental memorization of images, even if it’s a small number. And it might be premature to conclude that one paper’s inability to find memorized outliers in SD means that it doesn’t happen. It might be true, but 10k images is less than one ten-thousandth of the training data, and it’s certainly possible that more successful attack methodologies could exist. This represents a single attempt performed under many assumptions and run on a tiny fraction of the inputs, and nothing more.

Even if Stable Diffusion doesn’t memorize outliers, or any non-duplicate images, does that matter? SD will be out of date pretty soon and replaced by another network. If they didn’t take care to prevent memorization, if SD’s memorization behavior (or lack thereof) is accidental, then how do we know it won’t happen more often in the next network? Isn’t this a problem that needs to be explicitly addressed, and not just claim it’s uncommon?

FeepingCreature · on Feb 2, 2023

Humans don't even guarantee that. Composers not-uncommonly accidentally reproduce a melody they've heard somewhere. The only thing that can be done is diligence.

dahart · on Feb 2, 2023

So? When humans do this they are subject to copyright laws, and they can and have been sued for accidentally reproducing a melody.

That’s also not very relevant here. We’re talking about making duplicating machines that effectively memorize pixels, not the same kind of “accident” you’re referring to.

kilgnad · on Feb 1, 2023

This makes perfect sense and is expected. But you have to know, none of the images are memorized. It's a coincidence. I can explain.

Think of the analogy of linear regression. You have X,Y 2D space. X is the input (the English sentence), Y is the output (the image).

You use training data, which is a bunch of coordinates in this 2D space (each coordinate representing an <English, image> pair) to generate a best fit line (aka the machine learning model).

The properties of that line fit exactly what's going on here. That trend line will not touch most of the coordinates, but it likely will touch a few. When you take an X from the coordinate that touches the line and feed it into the equation of that line (aka model) you will get a Y (an image) that matches the coordinate Y exactly because the coordinate and the line intersect. Makes sense.

This is why an image appears to be memorized. But really it's not memorized. The equation of that line is the ONLY thing memorized here.

Why does it happen more when the image appears multiple times in the training data?

Well this is exactly what happens in least squares analysis. The line is generated from an equation that involves the averages of all the training data (aka coordinates) and if you have many samples of the same coordinate (aka image) you will skew the average and therefore the line towards touching that specific coordinate.

There is no complete technical fix for this issue. With enough training data, EVEN when you get rid of duplicates, the line will likely touch or be very close to a coordinate.

If you think about it, for a 2D coordinate system you can select a bunch of coordinates that generate a line that the never touches a single coordinate. But this defeats the purpose of machine learning. You're suppose to sample data you have no knowledge about and derive a model from it.

If you can already pre-select images that form a model that never touches a coordinate, it means you already have an understanding of the model and can likely just manually tune all the weights by hand to get what you want. As regular humans, we don't have the super-human ability to do this. We can only do it for the simplest 2D example and all the higher dimensional stuff can only be understood by analogy and random sampling.

One thing you can do is just add more data to the training set. That will move the trendline and could shift it away from something it touches. But at the same time it could move it towards touching a new coordinate.

babel_ · on Feb 1, 2023

Because the dual of a structure is clearly not another representation of the structure. Isomorphism means nothing. /s

Your intuition is correct, fitting for the data will lead to this arising. However, you misunderstand the interpretation. What is being fit is exactly an encoding of the dataset (a dual of it, like switching from edges of a cube to faces), so the "coincidence" is which specific images show up with extremely high similarity after similar prompting; likely meaning nothing related showed up in the testing, so it never got pushed around and settled nearby. For many images, that encoding was lossy and sufficiently moved that they are not matches to the training data, however for some they have ended up essentially encoded in these "lines" (as your analogy goes) and can therefore be reconstructed with not that much effort.

kilgnad · on Feb 1, 2023

>Because the dual of a structure is clearly not another representation of the structure. Isomorphism means nothing. /s

But it's not an isomorphism. I have a function that has a range that covers every single number in existence.

Does that mean that function is A copy of every single number in existence? No. Is that function an isomorphism of every number in existence? No mathematician would use the term "isomorphism" in such a way.

You could say that if F(3) = 4, then (4, F(3)) are two things that are isomorphic. But, again, no mathematician would call F(X) isomorphic to 4.

Technically in terms of law, what this would mean is that USING the model to generate a COPY would be illegal. The model itself does not hold a copy until function application is executed. Therefore existence of the model itself IS not illegal.

To say it's illegal is to say that a copy of the mona lisa exists in paint and paint brushes. F(X) is the paint and paint brush. F(3) is the painting.

I will say this, I don't think most people in the jury will be technical enough to get this distinction and the technical part isn't even really important in my opinion. The laws should be made based off of what's better for society/humanity not what's technically valid.

Still my comment here is on what is technically happening which SHOULD remain completely separate from the politics surrounding it.

babel_ · on Feb 1, 2023

I mentioned isomorphism in reference to representation, which was kinda the point.

And, not to belabour the point, but a function F that has an isomorphism to the real numbers means that F does indeed encode all numbers (and, when defined as a set, exactly contains a copy of each). That's literally the point of isomorphism. To demonstrate simply, the integers have an isomorphism to the natural numbers: you count them in an alternating pattern. That's an isomorphism between them (and it need not be relation preserving). So, with that, you can interpret all integers as natural numbers; we say they're the same up to their isomorphism. Isomorphisms also exist for the real numbers with other structures, via bijective maps, and so forth.

I'm not going to respond to the other stuff since it's nothing to do with what I said or responded to (even when it was a misinterpretation).

kilgnad · on Feb 1, 2023

Ok sure. This makes sense. Let's be clear so I don't get off track again. Your point is that an isomorphism means an encoding of the picture exists in the network and is therefore a copy.

My point of contention is that it's not a copy.

But imagine this scenario. The model produces a picture that is identical to a picture in the real world, but that picture was NOT part of the training data. This is certainly possible.

It's fuzzy but there comes a point where the layers of abstraction between two entities that are isomorphic become big enough such that they don't violate copyright law.

babel_ · on Feb 2, 2023

Outside the scope of the paper, but sure, it's a valid and interesting question.

If a model were to produce a pre-existing image not in its dataset, then we can strongly suspect that two behaviours have occurred (for the sake of establishing provenance).

First, it was just the ever so small, non-zero chance that it was produced by a random process that was not based on anything learnt, and within the finite resolution and some accuracy of colour, happens to match up to something already existing which, for example, may have been produced independently by a human artist - say, after the dataset was collected, but before the model was trained, such that it is emphatically a "coincidence" and nothing else, no collusion, no learning, no intelligence. Just chance.

Second, and I believe far more likely, this image has been encoded into the model by indirect learning and proxy. For example, say some notable and famous work of art is not part of your dataset (which is believable). However, say that there exist, say, art that references this famous work in some way (as artists sometimes do), which may be up to the point of parody (or some a gag), or may be some small aspect (colour scheme, notable style, outfits, lighting & composition, certain people, etc). Especially if it is what we would literally call "influential", is it not possible to reconstruct the famous work (that was not in the dataset) by indirectly learning about it from other pieces? Now, exact (or near-exact) matches are unlikely, for the most part it would probably tend more towards similarly repeating the references and parody of its dataset, but I feel this would strongly increase the likelihood of randomly producing it as in the first behaviour, and we're inflating the chances certain works are reconstructed entirely by "coincidence"; at least, as according to all observers that merely look at the binary presence of the original in the dataset.

Is this a copy? Is it not a copy? In all honesty, the notion of copy or not is insufficiently defined, because there's always some angle that can be part of it, or may not be, as most copyright law is handled on a case by case basis. Is it a copy because its provenance was an attempt to copy? Was it a copy by cosmic accident or just an independent work? Was it a copy by inferred reconstruction, thus not literal copying from the source, even if a verbatim reproduction is made? Is a vector version of a raster image a copy, and vice versa? What is "sufficiently" transformative to no longer be a copy? Does a copy of the Mona Lisa exist in the mind of a person that experiences it, while as a subject in a photograph it has grounds to not be? I'm trying to avoid anthropomorphising it and state whichever way what this is, I'd rather just illustrate how what we learn, what we experience, what we produce replicas of, and all of this is generally part of an unsolved problem/field: epistemology. Do we understand diffusion models fully? No. Therefore, weighing on one side or another should merely be for the purpose of a leading research question, or to exist on either side of a dialectic; such as in the courtroom. Perhaps we're still just too immature on this topic to really say anything about copying.

Really, the only thing we do know about copying, is that you should never get caught doing it. (For legal reasons, that was a joke.)

kilgnad · on Feb 2, 2023

Paintbrush and paint functions like a NN model. While brush strokes operates as input into the "paint brush" function to output a painting.

Could you say that paint and paintbrushes encode every single painting on the face of the earth into it?

I don't think this meshes with our intuition of the word. The model requires input for the encoding to be complete. If we don't think about it this way then practically anything on the face of the earth "memorizes" everything else. A pencil encodes every sketch that has ever been drawn and will be drawn.

This is a semantic issue. What do we mean by the word "memorize" or "encoding"? Eventually the gap between two isomorphisms becomes so big that the words memorize and encoding no longer apply. NN's are right at the border of this demarcation, but if we want to be consistent then the answer should be that NN's do not encode this information.

babel_ · on Feb 4, 2023

I believe you're mixing your metaphors, as it were. A paintbrush is not like a neural network, nor is it a function. You may describe a paintbrush in a model, and that may be as a mathematical function, but a paintbrush it not itself a function, it is a paintbrush, a real object that's part of the physical world.

On the other hand, a neural network is a function, because that is what it is constructed as, and what it is intended to perform as, and indeed what it "does". It passes the duck test for a function, and then some. A function has a domain and codomain/image, two sets it draws its inputs from and then produces output accordingly, which for the purpose of mathematical definition, are an intrinsic part of the function.

Therein, it may contain and encode exactly every single painting not merely on the Earth but in all possible existence. The rub is that you still need to construct examples by providing input from that domain to have an output in the real world --- whatever your belief in the Platonic ideal, it is agreeable that the infinite or absurdly large finite sets of input and output that are described are not fully reachable/depicatable in the real world until physically performed, which requires the cause and effect of first providing input into a machine performing that function (let alone the literal encoding of a dataset into parameters as per my prior comments).

So, in a sense, with a model that is defined as such, we can indeed state that a paintbrush does encode every painting into it, albeit merely as possibilities that much be realised. Once picked up, it then requires cause and effect to move from one (the input motions) to the other (a resultant image), which has then indeed made one of those possibilities real. The issue with such a metaphor is that the implied model is begging the question, it circularly supposes that the paintbrush is responsible for this, such that it can serve as a function analogue for the metaphor, such that the paintbrush is responsible. Hence why I say you're mixing metaphors, because you've combined two incompatible metaphors (a neural network is a function, and a paintbrush is a "function" due to the cause-effect analogy to input-output) using shared language, that being the word function; which I believe I've shown has very different meanings here in both contexts.

The end result is that this "intuition" of the word/world is correct, but only partially, because it has not yet recognised how these two uses are divergent as per above. Indeed, such a model of a paintbrush would imply everything is also functions that ties together their input and output as possibilities which are retained, with any inversion providing a way to reconstruct inputs from outputs, such as figuring out how specific brushstrokes may have been performed. The critical thing is, again, to not mix the metaphors of the real world and the mathematical world, because that mathematical model can only describe possibilities for our real world. After all, it does not do us much good to state that a pencil contains every sketches as possibilities, because until we make good on it, it's essentially just a truism from the model that hardly makes it any easier to produce such a sketch - the same can be said for every drop of water, and so forth. Therefore, this is not a semantic issue, but a conflation of Platonic idealism with reality such that much is rendered meaningless, when reality has not provided incontravertible evidence for this idealism.

For example, the paintbrush might not have, as possibilities, every single painting. One approach would be to say it is only going to be used for one, and so a more appropriate model would be needed such that "no man stands in the same river twice" and so forth, and so this function metaphor would be demonstrably inappropriate. In contrast, the neural network (or a diffusion model) is still a function in a world that does this, because its parameters have been frozen and its random seed coupled to its input, with no recurrence from any prior usage as part of its operation. Therefore, its model as a function is exact and clear.

This means that "memorise" and "encode" can have very precise definitions for neural networks or diffusion models (as in the paper, memorised images being those reconstructable by given methods to within a known measure by some metric of accuracy), whereas with a paintbrush these do not have such precise definitions (indeed, beyond being reliant upon certain world models that may or may not be incompatible, they at the very least do not have only one obvious interpretation of meaning, unlike a neural network as a function).

As to the final points, it would be more appropriate invoke the notion of transformative derivation and ask whether neural networks or diffusion models are "de facto" justifiable as "always producing derivatives" due to size. I would say your statement can be reconsidered in light of this, the above, and my prior comments, given that they empirically do not always produce derivatives (we can reconstruct input data from prompts alone), and so the determination of whether there is sufficient transformation to be a derivitive work would have to be handled on a case by case basis for each image produced, as with all things - you would not say that an artist be given the carte blanche right to declare all their works wholly original or transformatively derivative when they've clearly just produced a copy of the Mona Lisa.

Additionally, if we were to "be consistent", it would not follow that this implies there is no encoding of such, only that they do not encode images they cannot produce. There is indeed exactly an encoding of possible (and memorised) images, because this is consistent with the mathematical model of a function that the neural network represents. We can state that neural networks do not directly encode all this information in their real world instance, as they do not, their parameters are only directly correlated to the dataset that is a subset of its input domain. However, they do still indirectly encode a much larger dataset than is explicitly given, due to the fact that humans do not produce art completely randomly, and like the paintbrush, we have created art (now used in a dataset) that is not simply a function but a real continuous object that has recurrence and relation to prior and simultaneous art and thought, and this is what gives rise to all their possible outputs that are not exact copies of their inputs (and by combination they should outnumber those memorised works as well). This is as per my previous comments, however to state explicitly in combination with the above, this implies that neural networks do exactly (if indirectly) encode every possible image they may produce (and not every image that may be possible) as the real world instance/approximation of the function it is depicting (which would encode every possible image, not merely those realisable, supposing its codomain/image was an uncomputable set), of which some are memorised by more direct encoding and others are inferred reconstruction. Either way, these images are producable by the model, therefore the model does encode it, either from the perspective of it as a function, or it as a realised machine performing that function with fitted parameters to enable it (within some bounds and margin of error). So, I would say that you may wish to reconsider the statement that a desire to be consistent implies this information is not encoded, when consistency actually implies the exact opposite - I believe that statement to likely be borne from the same conflation as above, so perhaps this paragraph is a clearly redundant summary by the time you finish reading it.

Kye · on Feb 1, 2023

Now imagine you're an expert witness on the stand in the case that eventually settles this. I don't think this will be persuasive even if it's technically sound. How you end up with a copyrighted work doesn't seem relevant: the model still produced something eerily similar to a copyrighted work.

"Copyright laundering" will be the phrase of the era. Throw Picasso in with a thousand reproductions, wash it with DeviantArt, get Picasso back out. Does it matter than it's algorithmically derived rather than a stroke-by-stroke reproduction? There's probably already ample case law around reproductions to deal with this.

kilgnad · on Feb 1, 2023

It's technically sound though. Let me give you an example.

   y = f(x) = x + 2

This is what's memorized. But with that equation you can plug in a specific x. to get:

   3 = f(1) = 1 + 2

The training set here would be (1,3). What is memorized is y = f(x) = x + 2. You can literally see that (1,3) is NOT in the model EVEN though that model CAN produce a (1,3) given the right input.

I think the technical part of this is sound. People will just take the face value explanation which is I see a 3! therefore a THREE was memorized. But technically a 3 WAS not memorized. This is categorically true.

You are Right. The persuasive part of this argument is not very good though as you can see by this thread.

FeepingCreature · on Feb 1, 2023

It kind of sounds like you're just giving a detailed technical explanation of how the network memorized some of the images.

kilgnad · on Feb 1, 2023

How so? I have an equation that makes a number:

  y = 2x + 3

And I have

  (x,y) = (1,5)

I'm saying the y = 2x + 3 is the ONLY thing stored in memory. The (1,5) is not stored anywhere.

(1,5) is training data. y=2x+3 is the model. This makes sense.

WithinReason · on Feb 1, 2023

If the model captures (almost) all information in the training sample, that is memorization. In your example 2x + 3 memorized the point (1,5) because you can recall 5 as f(1).

kilgnad · on Feb 1, 2023

This doesn't make sense. F(X) itself is not (1,5). You have to apply F for it to work.

Paint and paint brushes can be thought of as functions. You apply the right inputs (brush strokes) and you get an output that is a painting.

Therefore could you say that all paintings in the world are encoded into the paintbrush? No. You can't. Not until you use the paintbrush to copy something.

WithinReason · on Feb 1, 2023

If I ask you to recite a poem you're a function f("recite poem") which returns the memorized poem. It's not even possible to have memorisation without a function.

kilgnad · on Feb 1, 2023

But it's not memorized. The "poem" in the case of the machine learning model is constructed from a general algorithm. Similar to how 2x+1 has no concept of 3.

When you input a 1, it does a *2 operation and a +1 operation and produces a 3. Nothing is memorized. 3 arises as a side effect of unrelated transformation operations.

FeepingCreature · on Feb 2, 2023

There's a joke programming language, HQ9, that only has three instructions, each expressed with a single character. One outputs "Hello world!", one prints the language's source code (quine), one prints "99 Bottles of Beer...".

Does the interpreter have 99 Bottles memorized? After all, it is a function (with a 3-element domain).

WithinReason · on Feb 1, 2023

"*2" and "+1" is stored information that's used produce 5. The memory of a poem is stored information that's used to vibrate your vocal cords and produce a recited poem.

kilgnad · on Feb 2, 2023

The memory of a poem is stored information but your vocal cord is NOT stored information.

"*2" and "+1" are vocal cords. They operate on any number of inputs just like how your vocal cord operates on any number of nerve signals and air blowing through it.

If my vocal coords are never used to record a copyrighted song then no violation of the law occured. If I never use the machine learning model to generate an existing picture then no law violation occured.

WithinReason · on Feb 2, 2023

Well you can get around the whole memorisation problem by only using unique training samples in your dataset, same for Copilot. But yes, I do agree that in this case the user is doing the violation, not the network.

kilgnad · on Feb 2, 2023

Unique training samples only makes the problem less frequent, it does not eliminate the problem.

The curve can still intersect a unique point.

WithinReason · on Feb 2, 2023

It seemed to me all the examples in the article (and in copilot examples as well) are repeated training points, so I think it would solve the problem.

kilgnad · on Feb 3, 2023

That just means only for those examples it was fixed.

Logically the mathematics behind it says that it can happen. You present evidence which is refutable by new evidence. Logic if correct, cannot be refuted.

WithinReason · on Feb 3, 2023

The logic is also completely wrong, the dimensionality of a single image is very high, let's just say 2^(32*32) (if it was a 1-bit 32x32 pixel image). That's a probability of 5.56×10^-309 of converging to a sample by coincidence, which is to say it's completely impossible.

kilgnad · on Feb 3, 2023

The logic is not wrong. First off your statistics showed that it CAN happen. Just with low probability.

Second your way of measuring the probability is incorrect. IF we were to randomly pick an image out of ALL possible images of 32x32 1bit images THEN your probability would apply but clearly machine learning doesn't do this.

The actual calculations behind the probability is actually highly, highly dependent on the dataset.

To take the 2D line analogy further if we perform linear regression on the points (0,0), (1,1), (2,2) (5,5),(329,329),.... and a bunch of points with the same x,y coordinates the line would touch ALL training set data points 100% of the time as the equation of the line would be y = x. This isn't even called overfitting, y = x is actually the ONLY solution available. There's no way to prevent this if the data is just really clean.

WithinReason · on Feb 3, 2023

The set of natural images is going to be smaller than 5.56×10^-309, but it's still going to be astronomically large. You're simply not going to get an accidental hit on a 512x512 image, even with training. The fact that in the paper they only managed to reproduce photos from Stable Diffusion that are likely common in the training set proves this and conclusively disproves your hypothesis. In addition, they observe that unique photos are extractable from a larger model (Imagen), making it obvious that this reproduction is due to memorisation from the larger capacity of the model and is definitely not due to an accident (otherwise the probability to reproduce would not correlate with model capacity).

kilgnad · on Feb 4, 2023

> The fact that in the paper they only managed to reproduce photos from Stable Diffusion that are likely common in the training set proves this and conclusively disproves your hypothesis.

There's no hypothesis. My conclusion is not scientific. My conclusion is derived from logic. It cannot be disproved.

>The set of natural images is going to be smaller than 5.56×10^-309, but it's still going to be astronomically large. You're simply not going to get an accidental hit on a 512x512 image, even with training

It might be low probability to get an exact pixel perfect match. But a similar match that is more or less indistinguishable (or even a different but obvious reproduction) to the original from the standpoint of the human eye is a significantly larger probability.

kgwgk · on Feb 1, 2023

> It's a coincidence.

https://en.wikipedia.org/wiki/Pierre_Menard,_Author_of_the_Q...

owlbite · on Feb 1, 2023

If I have 10 points in 2d space, I can always hit them all exactly so long as I have a polynomial of degree 10 or higher. This is overfitting, which is indistinguishable from memorization.

kilgnad · on Feb 1, 2023

Over-fitting is a known problem and deliberately avoided. This does not prevent the curve from intersecting an actual training set.

I mean technically, if there are many copies of the same data set pair, THEN you can call it over-fitting. But removing that does not fully remove the problem. The curve will still intersect datapoints EVEN without over fitting.

Also over-fitting is not memorization.

dogleash · on Feb 1, 2023

That just sounds like memorization with extra steps.

kilgnad · on Feb 1, 2023

https://news.ycombinator.com/item?id=34613042

sdwr · on Feb 1, 2023

That doesn't sound right at all. In 2D space a line bisects the plane. There's no way of moving from one side to the other without touching it. The possibility space of images is much, much larger than 2D.

kilgnad · on Feb 1, 2023

Machine learning is an extension of the concept of linear regression in 2-D space to 9999999-D space. They also make the line more then just a "line" the equation produces more of a best fit complex curve in multi-dimensional space. This is the actuality of what programmers in this area do.

Much of that space consists of pairs of data that aren't relevant. So to understand it in terms of the analogy of the line in 2D space... The only thing that's relevant are points near the line.

For example if we use the line to represent housing costs(Y) over time(X). You can take a bunch of housing prices sold over time and plot it on the graph as dots. Linear regression will form a line that best fits those dots. Not EVERY single section on the plane matters though. Only the dots matter and the space that's very near the trend line have anything meaningful in terms of data.

6gvONxR4sf7o · on Feb 1, 2023

It's work like this that makes me frustrated at the popular discourse around generative models (especially here). There's a ton we don't know about these models, and yet you get tons of people arguing that these models absolutely don't memorize, or that they learn like we do and so their learning should be treated like ours (legally and ethically). Then you get work like this showing that yes they actually do some memorization and regurgitation. There's still so much we don't know here.

My fear is that when things like this come up for lawsuits, overconfident experts are going to talk out of their asses about how these models do or don't work, and that's going to determine how automation affects our society.

On a technical level, I'd love to see a patch-wise version of this investigation. This shows whole images being regurgitated near-exactly rarely. I expect that small part-of-the-image patches are regurgitated even more often. But is it simple stuff like edges being regurgitated or are larger parts regurgitated frequently too? Given the architectures generally used, I'd guess that it's significant.

noobermin · on Feb 1, 2023

The thing I gather is a lot of these people were never experts. Just like crypto, a lot of grifters and hype men gather around the tech because they see it like a get rich quick scheme, and being the type who are grifters, they are far too lazy to learn any of the actual details. They instead just harp on the popular narratives, one among them "your brain is a neural net!" that I've heard repeated ad nauseum even here on HN for almost ten years now.

zamadatix · on Feb 1, 2023

I'm not sure you can do "small part-of-the-image patches" when comparing against 175,000,000 images. And I don't mean that from a scale/processing perspective I mean it would seem you'd always get tons of small patch false positives from any realistic looking image.

6gvONxR4sf7o · on Feb 1, 2023

In the limit of 1x1 patches, for sure. For 256 by 256 patches, who knows? I want to see how regurgitation varies as you vary N in NxN patches.

cma · on Feb 1, 2023

Artists who copy another's style well surely memorize many of their works too.

motoxpro · on Feb 1, 2023

I can memorize your patented design too. Doesn't make it less illegal for me recreate it from memory.

ravel-bar-foo · on Feb 1, 2023

The relevant law here isn't patent law. It is copyright and trademark law. If you memorize a famous (recent) painting and recreate that as closely as you can, is it copyright violation or a transformative derivative work? I guess it depends on your process, intent, and fidelity. Oh, and on local laws.

BaseballPhysics · on Feb 1, 2023

> is it copyright violation or a transformative derivative work?

Short answer: for any country participating in WIPO, yes it is.

Edit: And if you don't believe me, let's replace the word "painting" with the word "song":

> If you memorize a famous (recent) song and recreate that as closely as you can, is it copyright violation or a transformative derivative work?

Hopefully everyone here knows the answer is "absolutely yes!" which is why artists need permission to cover other people's music.

Now, yes, intent factors in, insofar as it affects a potential fair use defense. But that doesn't affect that status of the work, it only determines whether the action to violate copyright is defensible

This is how Google gets away with returning images in their search result, and why I can't just copy an image they return and use it without myself violating copyright.

kurthr · on Feb 1, 2023

It does if it's over 20 years from the application date. It does if you match on a design published before the application date. It does if you change the design just enough not to read onto a single claim.

Honestly, training a generative model on the patent database could be very useful for inventing and invalidating patents. I wouldn't be surprised if examiners are using it in 5+ years. Since they've got required response count and time, I wouldn't be surprised, if they started using it just to speed up their own work.

yetanotheruser8 · on Feb 1, 2023

This study was organized by Google (Technically DeepMind).

I wouldn't be surprised if Google is wanting the lawsuit to lose. It would block open-source models like these from existing and give them potentially a competitive advantage to be able to afford whatever compliance is mandated. They'd be able to offer services that comply, but open-source models would only have access to lower quality data and would be stunted.

random_moonwalk · on Feb 1, 2023

I think there's a concern also that smaller companies can gain a competitive edge via resistance to reputational damage. As Yann LeCun recently tweeted:

'By releasing public demos that, as impressive & useful as they may be, have major flaws, established companies have less to gain & more to lose than cash-hungry startups.

If Google & Meta haven't released chatGPT-like things, it's not because they can't. It's because they won't.' (https://twitter.com/ylecun/status/1617908306420600833)

sangnoir · on Feb 1, 2023

To your point on reputation: Didn't Meta recently unpublish a LLM after it was shown to hallucinate wrong answers? Smaller AI companies would have stuck to their guns, but Meta has internal controls to guard against damaging the "brand". IIRC, the tweet announcing the retraction was particularly salty,<speculation> sounded like someone whose hand was forced</speculation>

dr_zoidberg · on Feb 2, 2023

Yes, the Galactica LLM by Meta. Though LeCun isn't an author in the paper, he is "Chief AI Scientist for Facebook AI Research (FAIR)"[0], and he was quite angry about closing the Galactica demo[1].

[0] https://ai.facebook.com/people/yann-lecun/

[1] https://twitter.com/ylecun/status/1593293058174500865

mxwsn · on Jan 31, 2023

Their extraction: (1) assumes the attacker knows the caption for some training images, and (2) primarily works on images duplicated 100x-3000x in the training dataset. Their attack does not succeed for any singleton images. Deduplicating can be challenging on internet-scale datasets, but their work as presented does not appear to be a major concern for releasing diffusion models trained on other smaller datasets.

On memorization - I suspect this is a great thing for downstream performance, and a positive indicator that diffusion models are actually better generative models than prior methods (VAEs, GANs, etc). This mirrors the finding that feedforward neural networks can memorize randomly labeled data very well. Intuitively it feels like memorization is a quantifiable behavior that is a foundational activity in information processing - it is one type of optimal usage of observed data - that superpowers downstream performance.

PartiallyTyped · on Jan 31, 2023

> are actually better generative models than prior methods (VAEs, GANs, etc).

Diffusion models are VAEs and follow the same variational framework. You could imagine that VAEs are diffusion models with a single step in the forward and backward processes ;). They actually optimize the same VLB objective, but with diffusion models the objective is a trajectory instead of a single step, however, when training we are optimizing single step transitions. This is possible because the objective ends up being a sum of logarithms, thus there is no dependence between terms.

In practice we solve a simplified objective which looks a lot like as we do with standard AutoEncoders ;)

The key component that differentiates the two is in what we expect of the underlying neural network. It is far easier to parameterize small changes than large ones, with VAEs you ask the decoder to produce a large change in the latent variable, whereas with diffusion, we generally split them into 4000 smaller changes assuming you are using the DDPM approach and not the DDIM one.

Because we are improving with very small steps, we avoid the blurriness of VAEs, and we don't go out of distribution when sampling random noise. VAEs are often difficult to synthesize because even with KLD in the objective, the encoder produces a low variance distribution, and so when we sample noise from a high(er) variance gaussian, we are out of distribution rather quickly.

mxwsn · on Jan 31, 2023

I agree it's illuminating to understand diffusion models in relation to VAEs, but I personally consider them different models, but the line in the sand is definitely subjective.

I think this because (reasons I'm sure you're familiar with)

- Diffusion model is closest to a hierarchical VAE, but hierarchical VAEs were significantly less popular than regular VAEs

- The variational objective in diffusion models in practice is weighted

- Diffusion models require unchanging latent dimension while VAEs aren't restricted to this

- Historically, diffusion models grew out of score-based approaches, not from VAEs

PartiallyTyped · on Jan 31, 2023

You raise good points, if anything, it'd have probably been more accurate of me to express that DDPMs and probabilistic variants fit within the same Bayesian framework as VAEs but with the posterior and likelihood functions simply being markov chains instead.

This allows us to separate non probabilistic diffusion models e.g. cold diffusion. But then again, what's the difference between a deterministic model and sampling from a delta function? ;)

Zacharias030 · on Jan 31, 2023

Super interesting post. Tyvm! Which three sources would you recommend for someone fluent in ML to read up on to arrive at your conclusions presented here (or their own)?

PartiallyTyped · on Jan 31, 2023

The Variational Auto Encoder paper [1], and the DDPM paper[2] are pretty much all you need for this, [6[ is also good but covered by [2]. Going through the derivations helped solidify things for me. I haven't read [9] but looks very promising, authors include Jonathan Ho, and D. Kingma who authored [2] and [1] respectively.

From there [3,4] show improvements to DDPMs, [5] shows that diffusion models can be very general. [7,8] show diffusion models from the view of score matching.

[1] AutoEncoding Variational Bayes

[2] Denoising Diffusion Probabilistic Models

[3] Denoising Implicit Models

[4] Improved Denoising Diffusion Probabilistic Models

[5] Cold Diffusion

[6] Deep Unsupervised Learning using Nonequilibrium Thermodynamics

[7] Generative Modeling by Estimating Gradients of the Data Distribution

[8] Score-Based Generative Modeling through Stochastic Differential Equations

[9] Variational Diffusion Models

Imnimo · on Jan 31, 2023

Only 109 retrievable images out of the 350,000 most-duplicated is fewer than I expected. Maybe it's just the stringent definition of retrieval, but I would have expected many famous works of art like the Mona Lisa and Girl with a Pearl Earring to be readily extractable. Maybe these just aren't quite pixel-perfect enough?

GaggiX · on Jan 31, 2023

So wait they only found 109 matches after generating 175 milion images using the prompt from the most duplicated samples from the dataset and SD v1.4? Also almost all of them have more than 300 copies in the dataset, so with a model with the same size and trained on a dedup dataset like SD 2.0/2.1 there will be almost no matches, even after generating 175 mln images and knowing the prompts used in the dataset. Finally Google at el need to explain how an attacker that want to extract images from a trained model somehow has the prompts for the top X duplicated images in the dataset but not the images themselves, and thus will going to spent an incredible amount of money to generate something like 175 mln samples and test them together to find the matches.

Edit: I also want to add that google seems to try really hard to show themselves as the good guys by not releasing its models because it's not safe enough, but in this paper they used an incredible amount of computation and show me otherwise.

babel_ · on Feb 1, 2023

The argument here is more that this attack serves to just "find" these, and while it uses significant compute to do so, it's also not using many of the complex and more efficient extraction approaches that have been used on GANs and such in prior research. They're using a simple method, and exploring what it does; most of their 175 million images comes from effectively "retrying" each prompt 500 times, which could easily be cut down when others are trying attacks, since they're usually much more targetted than this at what they're trying to extract.

Also, "almost no matches" is still problematic, because they still occur, and since they don't seem prevalent, they're unlikely to have been accounted for in any other way (SD 2 being finally based on a deduplicated dataset). So, in a sense, they're "still in there", it's just harder to find by a casual or simple approach as above. Again, one of the central themes is that diffusion seems to be less resistant to attacks than, say, GANs, and that may mean more efficient and complex extraction attacks may translate over even more effectively.

GaggiX · on Feb 1, 2023

>it's also not using many of the complex and more efficient extraction approaches that have been used on GANs and such in prior research

Some links?

>most of their 175 million images comes from effectively "retrying" each prompt 500 times

Prompts from the most duplicated samples in the dataset, a really important aspect if you actually want to used this method in the wild, this is also one of the reasons why I said that this attack seems so implausible.

>they're usually much more targetted than this

Even if you target some images you would still need an absurd amount of luck if with the most duplicated sample you only get 109, we can be generous and think that with the whole dataset will have something like 200 matches, the probability of finding an image with a direct attack is still less than a million (even if you know the prompt) and we're not talking about a model trained on a deduped dataset.

lelandfe · on Feb 1, 2023

Is it implausible if they've done it in this paper?

This paper seems to answer the question of, "can SD, even just in theory, produce copyright-infringing work?" with "yes, it can."

For other images that are a product of thousands - if not millions - of source images, it becomes murkier.

GaggiX · on Feb 1, 2023

>Is it implausible if they've done it in this paper?

Extracting images in the wild yes, the authors of the paper have access to the dataset, they could sort prompts and images based on their presence in the dataset and they have an incredible amount of computation to do so, generating 175 mln using a diffusion model is an extremely resource-intensive task.

lelandfe · on Feb 1, 2023

I believe everyone has access to their dataset, no? https://laion.ai/blog/laion-5b/

Anyway, I don't think the point of this was to indicate that people can stumble on these incidents, but rather that it is possible. It's hard to see how this won't affect the ongoing suit.

GaggiX · on Feb 1, 2023

In the case of Stable Diffusion yes the dataset publicly available but these types of attacks would make much more sense if the attacker wants to extract some private data.

lelandfe · on Feb 1, 2023

Ah, I understand what you're getting at. True.

dang · on Feb 1, 2023

(I've merged the extra bit from https://news.ycombinator.com/item?id=34612005 into this comment so people can read everything you wrote. If you don't like that, let me know and I'll reopen the comment so you can change it.

I will move the replies here as well.

But please see https://news.ycombinator.com/item?id=34614429. This sort of surgery is time-consuming!)

babel_ · on Feb 1, 2023

The tweet/paper co-author posted the paper (https://arxiv.org/abs/2301.13188) yesterday on HN (https://news.ycombinator.com/item?id=34596187) and ironically the top comment there is referencing to this exact tweet thread (which was posted yesterday as well). Evidence for the metacircular evaluation of HN comments?

I think the paper is well worth the read, it's not particularly long (much is references and appendices), and nicely written, with at least a quick bit on most things I would think to test as part of something like this. Good stuff.

dang · on Feb 1, 2023

We'll merge the threads. Thanks!

(Normally I wouldn't have moved your comment as part of the merge but (a) you said something about the paper, and (b) after name-dropping metacircularity how could I not)

danaris · on Feb 1, 2023

I know it's not cool to say "I told you so", but...

This was entirely predictable, and is one prong of the primary arguments that these ML models, trained on datasets including copyrighted images taken without permission, infringe on the copyright of those images' creators.

Train the damn things on public domain images and images you have explicit permission for, and you'll be fine. Stop acting like you have a right to just vacuum up every image ever created because it's "AI".

Glyptodon · on Feb 1, 2023

My problem with this line of thinking is that it implies that humans remembering or using copyrighted things like normal humans do is wrong. E.g. can you remember a scene from a recent movie? Copyright infringement. Did you quote a recent book in conversation or a presentation? Absolutely not authorized!

haswell · on Feb 1, 2023

> it implies that humans remembering or using copyrighted things like normal humans do is wrong

I don't agree that this is the resulting implication. I'd argue this is only implied if you believe that the software is so similar to humans that making judgements about the software is equivalent to making judgements about humans.

Put another way, when a human does those things it is not infringement, because we have already determined that it's not, and the laws are based on humans interacting with the content. A computer doing those things is arguably a new thing entirely, and requires new rules. This does not imply that the old rules would or should stop applying to humans the way they do now.

danaris · on Feb 1, 2023

ML models are not comparable to human memory. ML programs are not comparable to a human brain, not even by claiming that they use some of the same conceptual structures.

Stable Diffusion has no agency, cannot think, and cannot create on its own. A human brain, even if it has been given no art to learn about, can still create. The art influences the form of its creations, true, but in ways that are fundamentally different than ML art generators, if only because of the presence of a conscious human will actively directing it. (And no, a human writing a prompt is not the same thing either.)

Glyptodon · on Feb 2, 2023

My issue is not that ML models are comparable, it's that the issue in both cases should be a matter of what is done with output, not the ability to create it.

People can doodle Micky on napkins or in notebooks or into their Van Gogh poster all they want. In many ways ML is just making that easier. The problems are all what you do with it, given it being so much easier, not with the capability.

danaris · on Feb 2, 2023

...Well, you're certainly entitled to your opinion, but that's not actually how copyright law works.

Regardless, people are already, right now, using these ML art generators to cut real human artists out of the loop while producing products for commercial sale.

Glyptodon · on Feb 2, 2023

I mean the systemic effects are going to be rough.

But however copyright law works, if little Timmy draws a Pokemon it's pretty normal and nobody gets fined or goes to jail.

The systemic effects from this are going to be because you can get good enough "original" output without needing an actual artist, not because the good enough output will inherently violate copyright.

danaris · on Feb 2, 2023

Part of the problem right now is not just people using generators to produce "good enough" art, but using them to produce art in the style of specific, living, active artists, who rely on being able to sell their art for their livelihoods.

The only reason it is able to produce art in their style reliably is because it has been trained on their work without their (or, in fact, anyone's) permission.

jeroenhd · on Feb 1, 2023

The difference between humans and computer algorithms is that humans don't run on software.

Computers are tools and neural networks only resemble human brains on a surface level. The human learning process is a lot more involved than just interconnections and weights, there is an entire array of biological concepts that neural networks don't even try to simulate in their approach.

The comparison between neural networks and the human brain is the same as the comparison between a hard drive and the human brain: the recollection may be automatic and similar, but the concepts behind it and the legal implications aren't.

Glyptodon · on Feb 2, 2023

I don't think these algorithms are too much like brains, more so the issue I have is that I view infringement as a "what you do with output" issue, not a "what output can be produced" issue, so demonizing ML models as a whole over people who are irresponsible with them seems sketchy to me. Like blaming human brains for dreaming of Disney and Loony Tunes characters at the same time, instead of blaming someone stupid enough to put the two together outside of parody and attempting to trademark it.

jeroenhd · on Feb 2, 2023

I have no problem with ML models or AI tech in general. The problem is and will always be the data set that the current models are trained on.

You can generate cartoon characters without training on pictures of Mickey Mouse. Just use pictures that don't carry any copyright requirements. The tech and its many possibilities won't change. If the code is any good, the generated model will be just as good as the current one.

snickerbockers · on Feb 1, 2023

not sure if youre aware, but people are actually not allowed to misrepresent other peoples creations as their own.

Glyptodon · on Feb 2, 2023

Sure, but you're getting in trouble for the act of misrepresenting, not because you drew a doodle for your friend or had a dream with Twilight vampires in Harry Potter or something. More so than your brain's ability to output ideas it's what you do with that output that matters. Whereas with these models people are acting like it's having the capability that's the bad part, not the usage.

nine_k · on Feb 1, 2023

It's an interesting legal question.

If a human cannot publicly use a copyrighted image without a license, why / how a non-human can?

If some image are free to use with attribution, how can an ML model track and provide such attribution?

anileated · on Feb 1, 2023

It’s not interesting or complicated in any way: because a non-human has no free will and is operated by a human, the human (or in this case OpenAI/Microsoft/etc.) is ultimately infringing.

If and when the non-human is granted human rights, this can be revisited.

madeofpalk · on Feb 1, 2023

> If some image are free to use with attribution, how can an ML model track and provide such attribution?

Easy - if they cannot provide attribution, they cannot use the image to train an ML model.

snickerbockers · on Feb 1, 2023

> If some image are free to use with attribution, how can an ML model track and provide such attribution?

That's a problem for the people who create the models to solve.

This is what's so frustrating about the ML/AI community, they think the onus is on everybody else to overcome problems created by their products.

nine_k · on Feb 1, 2023

Exactly. I'm interested in approaches to a solution.

Glyptodon · on Feb 1, 2023

Humans use copyrighted images without licenses all the time. Even trivial things like sharing a photo of a book cover can qualify.

snickerbockers · on Feb 1, 2023

Yes but do you pass it off as your own original artwork that you created all by yourself, or do you present the book cover in a way that makes it obvious that you are taking a photo of somebody else's book?

Glyptodon · on Feb 2, 2023

So it's what you do with the output and not the output itself that matters, right?

snickerbockers · on Feb 2, 2023

No, its the part where you misrepresent somebody else's creation as your own. if its not alright for a person to do then its not alright to automate it. stop trying to play dumb semantic games, its not nearly as clever as you think it is.

Glyptodon · on Feb 2, 2023

I don't think it's being clever. If someone prompts an AI generator and acts like they created the output, that's on them, not on the AI generator.

snickerbockers · on Feb 2, 2023

It would be on them if the AI generator was forthcoming about how it "created" the image. If a company like MS is advertising that they have a program that creates new content, but that program actually has a known tendency to output content from it's training set then they bear responsibility, especially if that program is being run remotely on their servers.

astrange · on Feb 1, 2023

They literally have such a right because there is literally such a right written into German law.

And UK law is adding that right even for commercial models.

danaris · on Feb 1, 2023

So as long as you can find (or bribe) one country that will let you skate by, you think it's fair game to just declare that your model's country of residence, then take all the images you want and put out your trained model to the whole world?

Somehow, I don't think that one's likely to fly.

astrange · on Feb 2, 2023

Yeah. That's how countries work, they're sovereign except as bound by treaties. This is a general EU right as well, not just Germany.

omnicognate · on Feb 1, 2023

Which laws are you referring to? It's the first I've heard of countries adding explicit rights for training models.

astrange · on Feb 1, 2023

https://www.clarin.eu/content/clic-copyright-exceptions-germ...

https://www.linklaters.com/en/insights/blogs/digilinks/2022/...

(Note, it's looking less likely the UK one will actually get extended that way now.)

omnicognate · on Feb 1, 2023

"... for non-commercial purposes only ..." seems an important limitation in the first link.

titaniumtown · on Feb 1, 2023

If Google Books if public domain and considered fair use (https://en.wikipedia.org/wiki/Authors_Guild,_Inc._v._Google,....) these ML models definitely are.

BaseballPhysics · on Feb 1, 2023

Nope. Totally different case based on a totally different set of circumstances and outcomes.

This is a great example of coming to a conclusion about copyright based on how you think the system should work vs how it actually works.

Google was able to convince the court their actions constituted fair use.

My guess is training a generative AI will also be fair use. The question is, what about the output of the resulting model? And that is a question it'll take a court to answer.

haswell · on Feb 1, 2023

> My guess is training a generative AI will also be fair use. The question is, what about the output of the resulting model? And that is a question it'll take a court to answer.

I suspect that you are right, and that it will place legal responsibility on the individual operating the software for any infringement.

I think this will force the companies building these models to behave as if training the model was also infringement, because I cannot imagine a scenario where the average end-user has enough understanding/awareness of the implications of their prompts to avoid generating infringing work, and end-users getting sued would create an instant chilling effect on the use of such software.

BaseballPhysics · on Feb 1, 2023

I strongly suspect you're right about that.

My bet is the court will determine that whether the output of a model is or isn't subject to copyright isn't a black-and-white answer, but rather depends on the work.

Fundamentally, the test as to whether a work represents a copyright violation is about "substantial similarity" (https://en.wikipedia.org/wiki/Substantial_similarity):

> To win a claim of copyright infringement in civil or criminal court, a plaintiff must show he or she owns a valid copyright, the defendant actually copied the work, and the level of copying amounts to misappropriation. Under the doctrine of substantial similarity, a work can be found to infringe copyright even if the wording of text has been changed or visual or audible elements are altered

Okay, so let's say I take a thousand copyright images, averaged their pixels, and produced a single uniform grey output. No jury is going to conclude that work has "substantial similarity" with any of the original works, and I'm clear.

But now suppose I do the same, but weight it so 99% of the pixel colour comes from one image, and the remaining 1% comes from the rest.

Well, in that case, odds are very good a jury would find me guilty of violating the copyright of that original work that represents 99% of the image.

So my bet is the courts will conclude that the model, itself, doesn't in any way violate copyright, nor did the training itself run afoul of the law, but that any given output might, depending on the substantial similarity test.

And that means every single work is suspect and a potential target for litigation.

jhbadger · on Feb 1, 2023

The thing is most models for Stable Diffusion aren't created by companies but rather by end-users. There are literally hundreds of models for stable diffusion that you can download, from landscapes, to animals, to (of course) porn. A few of them were created by Stability or Huggingface, but most are trained by end-users. It isn't hard at all to train a model with existing tools -- you don't have to be an AI expert to do it.

haswell · on Feb 1, 2023

I believe those user-created models are still based on the core Stable Diffusion models though, and bring with them all of the same issues.

My understanding is that it's not difficult to tune existing models as an end-user, but to start from zero would be impossible for most individuals financially and technically.

jhbadger · on Feb 1, 2023

It's not clear what you mean by "based on". For example, the model Anything is trained on the Danboru anime image site. These images aren't in the standard Stable Diffusion model. The issue with that model is with the legality of including those images which the standard Stable Diffusion model does not.

haswell · on Feb 1, 2023

But isn't that model still mixed with the core SD model? I was under the impression that all of these specialized models are created by training against a particular image type/dataset, and then mixing the result with the core SD model.

This is how those specialized models can still generate just about anything. Without the core model mixed in, the specialized model would be nearly useless.

yencabulator · on Feb 2, 2023

Train one on Disney cartoons, share it on the internet, and see what happens!

lelandfe · on Feb 1, 2023

Could you expand on why you feel that’s the case?

ericwallace_ucb · on Jan 31, 2023

The paper shows that Stable Diffusion and Google's Imagen regenerate individual images from their training sets. They show it is very rare, but can be found reliably.

pavlov · on Jan 31, 2023

Seems relevant to the Getty Images lawsuit against Stable Diffusion.

roenxi · on Jan 31, 2023

No doubt, but how relevant? If you could somehow go through my brain you'd also find the occasional piece of art and literature - I've memorised a few good poems and songs. And I can recognise certain paintings on sight which would be difficult if they weren't accurately encoded in my mind somewhere. The fact that I have memorised them doesn't mean that I'm violating anyone's copyright if I attempted to compose poems and songs.

haswell · on Feb 1, 2023

The question is, will the legal system see this software as equivalent to your brain?

Anthropomorphizing this software seems problematic.

pxoe · on Feb 1, 2023

why do people bring up humans and their brains, as if it's gonna affect how models create and store image data/derivative data in a very definitive form of bits on a disk? just as a distraction?

freilanzer · on Feb 1, 2023

I can't load your brain into RAM with a Python library yet, though.

roenxi · on Feb 1, 2023

Absolutely, but that is what makes the relevancy of the fact interesting. If you could load my brain into Python, would it then become copyright infringement, despite the content being logically unchanged?

Memorising a picture using a fleshy model is fine, so the raw fact that art has been found in a black box model here isn't necessarily relevant. Might be. Might not be.

haswell · on Feb 1, 2023

I understand what you’re saying, but if you could load Python into your brain, I suspect the laws would quickly change to reflect that new reality.

The biological/evolutionary limits of humans are core assumptions of current laws, and I’d argue that the operating environment has changed enough to make those assumptions outdated.

> Memorising a picture using a fleshy model is fine

I imagine the fleshy model is fine not because it’s fleshy, but because it’s the model that people were targeting when writing current laws.

Even if no memorization occurred, there are still big questions about why such a model should be treated like anything other than just another computer program from a legal perspective.

sdenton4 · on Jan 31, 2023

Eh... The mp3 decoder can generate copyrighted music if you feed it the right inputs...

Likewise, in this work they prime the pump by using exact training prompts of highly duplicated training images. And then you have to generate 500 images from that prompt to find 10 duplications. You've really gotta want to find the duplicates, which indicates that these are going to be extremely rare in practice, and even more rare once the training data is hardened against the attack by deduplication.

BaseballPhysics · on Feb 1, 2023

> Eh... The mp3 decoder can generate copyrighted music if you feed it the right inputs...

Good analogy! An MP3 decoder takes an input and produces an output. If the output is copyrighted material, it's well understood that the inputs is simply a transformed version of that same copyrighted material and is similarly copyrighted.

The SD model is very much analogous. The prompt causes the algorithm to extract some output from the input model. If the output is copyrighted material then similarly the input model must carry a transformed version of that same copyrighted material and is therefore also subject to copyright.

Right?

By the way, I pose this, but I highly doubt this is actually how the courts will rule. I think they'll find the model itself is fine, that the training is subject to a fair use defense, but that the outputs may be subject to copyright if there's substantial similarly to an existing work in the training set.

PartiallyTyped · on Jan 31, 2023

The probability that synthesis of similar enough training samples given that the dataset does not contain duplicates is astronomically small, in this work they purposefully manipulated the dataset, used specific prompts and gave all sorts of advantages to an adversary, and created many images per prompt to be able to find such cases.

amelius · on Jan 31, 2023

What do you mean by the last sentence?

nickthegreek · on Jan 31, 2023

I imagine reliability means that with the exact prompt, seed, cfg scale, model checkpoint, blah blah and on the same hardware, they can continue to get an image that they consider close enough to the original.

jjcon · on Jan 31, 2023

Is there any reason we shouldn’t view diffusion models as any other tool? I can infringe copyright with photoshop too… even accidentally. If I generate original work with either that seems like fair game.

I imagine with the right prompt one could coax out a copywritten image even if it hadn’t ever seen it before

yunwal · on Jan 31, 2023

There's still valid reason to be concerned here. For example, this implies building a diffusion model based on private data can leak it. If I can generate a whole bunch of prompts like "MRI Joe Biden brain tumor" and 1 out of a million times I get a consistent result, that's unacceptable.

Github Copilot could be leaking private code as well.

zamadatix · on Feb 1, 2023

It seems the only way they can get an extraction is if a highly duplicated (vs the average) training image is used which is counter to a privacy concern.

yunwal · on Feb 1, 2023

This thread shows that there are outliers that are non-duplicated in the training set that still show up in results.

https://twitter.com/alexjc/status/1620466058565132288

Specifically, this post confirms cases of a single image in the training set being "memorized"

https://twitter.com/Eric_Wallace_/status/1620475626611421186...

zamadatix · on Feb 1, 2023

Ah I see the section on this in the paper now, the 2nd half of 7.1 on page 14.

Glyptodon · on Feb 1, 2023

To me this is kind of like being shocked that people who've seen the Starry Night can remember what it looks like.

sct202 · on Feb 1, 2023

People can remember Starry Night, but if you asked people to reproduce it from memory the reproductions would be all over the place.

People are generally not capable of creating realistic copies of anything straight from memory. Realistic paintings from the the Renaissance could only be done on still life, but photo realistic paintings were only possible after color photography was widespread and artists could use full scale photographs to base their paintings off of.

sangnoir · on Feb 1, 2023

> People are generally not capable of creating realistic copies of anything straight from memory

The average person can't even draw a bicycle from memory[0]. Gp is overestimating human capacity for reproduction.

However, whatever humans are capable of or not is a red herring. If humans memorize and reproduce a copyrighted work - it counts as a performance or copy. Someone typing Dan Browns novel on their blog from photographic memory does not get a pass, and noether should the AI models.

0. If you have 10 seconds, sketch a bicycle before opening the link https://twistedsifter.com/2016/04/artist-asks-people-to-draw...

welshwelsh · on Feb 1, 2023

That's because most people can't draw.

A better example would be: could the average person accurately imagine a bicycle? Can their brain generate the correct image, even if they lack the skills to draw it?

sangnoir · on Feb 2, 2023

The link I provided shows that people can't visualize a bicycle accurately; people were asked to doodle what they imagined a bicycle is laid out, and most of the drawings were structurally unsound, to say the least

Glyptodon · on Feb 2, 2023

IMO if you asked someone to think of a picture of a bicycle and had a way to turn their mental image into a jpg, you'd get 1000x better output than asking people to try and doodle one because drawing is much more difficult than visualizing, and most people know they have trouble drawing so kind of goof their doodles as almost an unconscious self-deprecation to express their self-consciousness about it.

Glyptodon · on Feb 2, 2023

People often have highly realistic, mental images, and just lack the ability to get them onto paper, screen, or other medium. Because humans need to develop the ability to move stuff from their head to an image as skill that takes a lot of effort and time to learn. But if you could load an image from someone's brain to a screen, I think many images people think up could easily be quite good.

jayd16 · on Feb 1, 2023

I don't know. I have a few friends who thought they were spinning gold only to realize they reverse engineered the tabs for Pinball Wizard or Sweet Home Alabama. Certainly their work was infringing even if a little different.

That said, I don't think it's an excuse for these AI tools

6gvONxR4sf7o · on Feb 1, 2023

This kind of anthropomorphism isn’t generally accurate in the ML world. We make analogies on how it’s like people, then later work shows that actually no it just finds some weird shortcut. Over and over again.

Glyptodon · on Feb 2, 2023

I don't think trying to keep "what you do with output" a distinct thing from "has the ability to produce output," "produced output," or "requested output" categorically is really anthropomorphism.

tomasreimers · on Feb 1, 2023

I think until now this has been a complaint about generative AI applications that has been handwaved away ("what if co-pilot generates GPL code?" -> "Well it will generate a blend of many pieces of code")

If there is now evidence that it does memorize and emit training data, that argument had new life.

That said, I imagine it's easy enough to add a post-process step to ensure your result isn't a member of the training set.

Glyptodon · on Feb 1, 2023

I don't know. It's not like remembering GPL'd code you've read makes your brain subject to GPL or code you write that uses similar structure or variable names for a loop or something GPL.

BaseballPhysics · on Feb 1, 2023

And the reaction among many, here, seems to be shock or disbelief that someone who's memorized Starry Night and then proceeded to reproduced the work would run afoul of copyright law.

Glyptodon · on Feb 1, 2023

I don't know, people training themselves by trying to emulate well known or successful artists and styles has long been a thing if my memory of Deviant Art 20 years ago is accurate. And most every comic con I've attended is overrun with examples of the same kind of thing even now. Or tattoo artists giving people tattoos of movies characters and scenes. Or the multiple times in school that an art assignment involved more or less copying or reinterpretation of an existing work.

That said, with humans it's not like anyone tries to prevent us from having the capability, it's just that usually people know better. Otherwise going to art galleries would be problematic.

BaseballPhysics · on Feb 1, 2023

> I don't know, people training themselves by trying to emulate well known or successful artists and styles has long been a thing

Which is not the same as memorizing and reproducing a work, which is equivalent to, say, memorizing and covering/performing a song, something which is well understood to be a violation of copyright law.

> Or tattoo artists giving people tattoos of movies characters and scenes.

Which violates copyright.

> Or the multiple times in school that an art assignment involved more or less copying or reinterpretation of an existing work.

Also violates copyright, but is probably covered by a fair use defense due to the scholarship aspect of the work.

pixl97 · on Feb 1, 2023

Run afoul of copyright law how is the question.

My wife works in design. It's amazingly easy to try to come up with a new logo design that somehow nearly exactly matches other existing logos that are in use by other companies. They have to spend a huge amount of time making sure their 'original work' doesn't violate someone else's copyright/trademark.

Is 'creation' the act of violating copyright? I wouldn't think so.

We tend to talk about copyright violation in light of distribution. Is the file transfer from the SD server to your browser an act of distribution? Who knows what the law thinks at this point.

BaseballPhysics · on Feb 1, 2023

> My wife works in design. It's amazingly easy to try to come up with a new logo design that somehow nearly exactly matches other existing logos that are in use by other companies. They have to spend a huge amount of time making sure their 'original work' doesn't violate someone else's copyright/trademark.

> Is 'creation' the act of violating copyright? I wouldn't think so.

It absolutely can be. There's numerous cases of courts finding cases of copyright violation due to accidental copying, particularly in the music industry.

I know, that might sound unintuitive, but the law is the law:

https://www.law.uci.edu/faculty/full-time/reese/reese_innoce...

> But since 1931, a defendant’s mental state has clearly not been relevant under U.S. copyright law to the question of liability for direct copyright infringement. As the Supreme Court stated that year, “[i]ntention to infringe is not essential under the Act.” So innocent infringers are just as liable as those who infringe knowingly or recklessly.

As an aside, logo design, which you mentioned, actually comes with a whole other set of considerations, as you're less concerned about copyright and more about trademark, which is a whole different branch of IP law.

6gvONxR4sf7o · on Feb 1, 2023

I’m disappointed in all the anthropomorphizing in this thread. Time and time again, we make analogies for how black box ML algos must work like people, only for researchers to come along and show that they actually just use shortcuts that don’t remotely resemble human learning/thinking.

When will we learn to stop being overconfident about how these things work? Just say “we don’t know yet.” Anthropomorphism and overconfidence are dangerous in that we could set the wrong precedents (culturally and legally) for how these are used and how automation affects society.

welshwelsh · on Feb 1, 2023

That only answers half of the question. Maybe ML uses some weird shortcut, but how do we know the human brain doesn't use the same shortcut? If it's possible to use some simple hack to do something, why didn't we evolve to work that way?

6gvONxR4sf7o · on Feb 1, 2023

Here's a concrete famous example. I'm sure I'll remember the details wrong, but the gist is the kind of thing that keeps happening.

A while back, people built a model for medical imaging that learned to distinguish between images of patients with vs without some disease (can't remember that detail). It did well, but failed in the real world. It turned out that instead of learning to recognize features of the disease at hand, it learned to recognize some tiny feature of whether the image came from a specific hospital that collected part of the dataset, or something stupid like that.

Saying "maybe model X does the same thing as humans" is proven wrong for X after X after X. At this point, the default assumption should be that ML techniques are different from humans unless proven otherwise.

Rastonbury · on Feb 1, 2023

I can't say with 100% certainty but I don't think our brains turn words into numbers and perform math calculations equations to make images. It's a bit of a cop out saying "we don't know how the brain works". If you apply the same analogy to ChatGPT we don't use math calculations to write words. If anything GPUs are not using shortcut, they use brute computation, to get to what our brains perceive as similar.

unixlikeposting · on Feb 1, 2023

>Maybe ML uses some weird shortcut,

You do realize that there are people in this thread who can explain to you in fine grain detail how an ML model actually comes to conclusions, without speculating abstract "weird shortcuts".

rileyphone · on Feb 1, 2023

Brains are radically different from GPUs.

celu · on Feb 1, 2023

The same calculations can be performed by an abacus. What is doing the calculation is irrelevant. The question is what the calculation is

jerf · on Feb 1, 2023

This argument would be somewhat more compelling in a world where little bits of silicon were not on the order of quadrillions of times faster than we are at arithmetic, whereas those bits of silicon struggle to do at all things we casually do every waking second.

The calculation is theoretically unimportant. Practically, it is of great importance.

vlovich123 · on Feb 1, 2023

That's kind of a philosophical question no? Is being able to model an analog behavior accurately the same thing as the analog behavior itself? While we know from math there's an equivalence / bounded error rate, it certainly seems like emulation in the digital domain is far more power intensive which would indicate that it's not the thing itself. A clearer example is photon collisions. Simulating that behavior on a computer is not the same thing as colliding the photons. Could be wrong though.

kzrdude · on Feb 1, 2023

Does it matter if the simulations of photons is on an abacus or using a GPU? I think that's the question. Neither of those are "reality", just a simulation.