The article feels, I don't know… maybe like someone calmly sitting in a rocking chair staring at the sea. Then the camera turns, and there's an erupting volcano in the background.
> If it was a life or death decision, would you trust the model? Judgement, yes, but decision? No, they are not capable of making a decision, at least important ones.
A self-driving car with a vision-language-action model inside buzzes by.
> It still fails when it comes to spatial relations within text, because everything is understood in terms of relations and correspondences between tokens as values themselves, and apparent spatial position is not a stored value.
A large multimodal model listens to your request and produces a picture.
> They'll always need someone to take a look under the hood, figure out how their machine ticks. A strong, fearless individual, the spanner in the works, the eddy in the stream!
Language allows higher-level and longer-term planning and better generalization than purely vision-action models. Waymo uses VLM in their Driver[1]. Tesla decided to add some kind of a language model to their VA stack[2]. Nvidia's Alpamayo uses VLA[3].
Python is not a natural language. A natural language communicates things that are not here and now (essentially everything that occurs naturally). Moreover, language can convey abstractions that exist because of language itself.
Is that the hang-up? Like are people so unimaginative to see that none of this was here five years ago and now this machine is -- if still only in part -- assembling itself?
And the details involved in closing some of the rest of that loop do not seem THAT complicated.
You don't know how involved it was. I would imagine it helped debug some tools that they used to create it. Getting it to actually end to end produce a more capable model without any human help absolutely is that complicated.
> If it was a life or death decision, would you trust the model? Judgement, yes, but decision? No, they are not capable of making a decision, at least important ones.
A self-driving car with a vision-language-action model inside buzzes by.
> It still fails when it comes to spatial relations within text, because everything is understood in terms of relations and correspondences between tokens as values themselves, and apparent spatial position is not a stored value.
A large multimodal model listens to your request and produces a picture.
> They'll always need someone to take a look under the hood, figure out how their machine ticks. A strong, fearless individual, the spanner in the works, the eddy in the stream!
GPT‑5.3‑Codex helps debug its own training.