This works great for software, math and games where you can have cheap validation. But what about messy real world tasks? I think hindsight learning from chat logs could fit the bill. What do I mean?
Imagine a long conversation. It is hard to judge if an AI response was useful or not immediately, but if you know the following 20 messages, it might be easy to infer. Not only you can see how it went, but sometimes you get real world validation.
For example a user comes to a LLM with a task, takes an idea, tries it in reality. Later they return, maybe in a new chat session, and continue iterating. You get real world testing of LLM responses through people.
This can be used to generate "preference scores", and train a preference model, with which you can do RLHF. So the user privacy is protected.
I call this the human-AI experience flywheel. Of course the larger the user base, the more experience the model collects. At the moment OpenAI has 500M users, they probably generate 0.5T interactive tokens/day. Those tokens go both into human brains and LLM logs.
It’s not about environment engineering anymore, it's about consequence harvesting. Meaningful validation emerges from systems actually being used by humans for real purposes.
I worked through this for a tax company. They had a huge pile of artifacts from tax questions worked up for clients. What we did is we "reverse engineered" the process of the questions that would lead to that tax memo and the research steps to find the sources and conclusions. It worked well and we were able to replicate the process which the SME's created these memos.
For a given tax question, could you come up with the same memo quoting the same sources and same conclusion?
Imagine a long conversation. It is hard to judge if an AI response was useful or not immediately, but if you know the following 20 messages, it might be easy to infer. Not only you can see how it went, but sometimes you get real world validation.
For example a user comes to a LLM with a task, takes an idea, tries it in reality. Later they return, maybe in a new chat session, and continue iterating. You get real world testing of LLM responses through people.
This can be used to generate "preference scores", and train a preference model, with which you can do RLHF. So the user privacy is protected.
I call this the human-AI experience flywheel. Of course the larger the user base, the more experience the model collects. At the moment OpenAI has 500M users, they probably generate 0.5T interactive tokens/day. Those tokens go both into human brains and LLM logs.
It’s not about environment engineering anymore, it's about consequence harvesting. Meaningful validation emerges from systems actually being used by humans for real purposes.