Hidden datasets can be replaced with model predictions collected from a public API. So they can be "exfiltrated" from the trained model. And we already maxed out on the accessible online text and the good quality sources.
What is going to make a difference is running models to generate more text for training, because relying on humans alone doesn't scale. For example we could be using LLMs to do brute force problem solving and then fine-tuning on solutions.
AlphaZero is the shining example of a model trained on its own generated data and surpassing us at our own game. The self generated data approach has potential to reach super human levels of performance.
How about illegal datasets like all the phone calls the NSA has been collecting domestically? Someone is going to train a private ChatGPT with that for queries.
What is going to make a difference is running models to generate more text for training, because relying on humans alone doesn't scale. For example we could be using LLMs to do brute force problem solving and then fine-tuning on solutions.
AlphaZero is the shining example of a model trained on its own generated data and surpassing us at our own game. The self generated data approach has potential to reach super human levels of performance.