Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

If it was about this why do OpenAI and Anthropic lose their minds when people are training off their output or trying to scrape their systems.

I actually don't have an issue with training off the mass of everyones work if the models are open and free to build upon, it's locking them away and then throwing your toys out the pram when people try and do the same thing that bothers me.



Good question. I actually have a technical answer, believe it or not.

Pre-training is: training a model from scratch on cheap data that sets the foundation of a model's capabilities. It produces a base model.

Post-training is: training a base model further, using expensive specialized data, direct human input and elaborate high compute use methods to refine the model's behavior, and imbue it with the capabilities that pre-training alone has failed to teach it. It produces the model that's actually deployed.

When people perform distillation attacks, they take an existing base model and try to post-train it using the outputs of another proprietary model.

They're not aiming to imitate the cheap bulk pre-training data - they're aiming to imitate the expensive in-house post-training steps. Ones that the frontier labs have spent a lot of AI-specialized data, compute, labor and hours of R&D work on.

This is probably not "fair use", because it directly tries to take and replicate a frontier lab's competitive edge, but that wasn't tested in courts. And a lot of the companies caught doing that for their own commercial models are in China. So the path to legal recourse is shaky at best. But what's on the table is restricting access to full chain of thought, and banning the suspected distillation attackers from the inference API. Which is a bit like trying to stop a sieve from leaking - but it may slow the competitors down at least.


>Ones that the frontier labs have spent a lot of AI-specialized data, compute, labor and hours of R&D work on.

Granted thats time and money but it's an absolute minuscule amount of human hours compared to the scraped data.

We know this for a fact because of parallelization, work of hundreds of millions vs the work of 20-100 even of OpenAIs team worked for the entire lifetimes of the current team and the lifetimes of the offspring of that team and the lifetimes of their offspring even with several lifetimes they still wouldnt have even made a dent in recreating that initial scraped training data.


This is like trying to apply "labor theory of value" to datasets. It doesn't work any better there than it does in economics in general.

It doesn't matter how many human hours went into making a Twitter shitpost. What matters is: how much value does it add to pre-training run, and how easy is it to substitute it for another data source.

"Cheap data" has low training value and is easy to replace. Twitter shitposts are worthless except in aggregate. "Expensive data" is what has high training value and is hard to replace. Things like SFT traces, domain expert RLHF guidance, RLVR bits - that's what the "moat" is.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: