I don't find this very compelling. If you look at the actual graph they are refe...

yorwba · 2026-03-12T13:14:42 1773321282

Yes, I think this is basically an instance of the "emergent abilities mirage." https://arxiv.org/abs/2304.15004

If you measure completion rate on a task where a single mistake can cause a failure, you won't see noticeable improvements on that metric until all potential sources of error are close to being eliminated, and then if they do get eliminated it causes a sudden large jump in performance.

That's fine if you just want to know whether the current state is good enough on your task of choice, but if you also want to predict future performance, you need to break it down into smaller components and track each of them individually.

thesz · 2026-03-13T00:04:21 1773360261

  > until all potential sources of error are close to being eliminated

This is what PSP/TSP did - one has to (continually) review its' own work to identify most frequent sources of (user facing) defects.

  >  if you also want to predict future performance, you need to break it down into smaller components and track each of them individually.

This is also one of tenets of PSP/TSP. If you have a task with estimate longer that a day (8 hours), break it down.

This is fascinating. LLM community discovers PSP/TSP rules that were laid over more than twenty years ago.

What LLM community miss is that in PSP/TSP it is an individual software developer who is responsible to figure out what they need to look after.

What I see is that it is LLM users who try to harness LLMs with what they perceive as errors. It's not that LLMs are learning, it is that users of LLMs are trying to stronghold these LLMs with prompts.

aspenmartin · 2026-03-13T13:21:36 1773408096

I don’t know it’s fair to characterize the LLM community as being ignorant and rediscovering PSP/TCP. I in fact see that as programmers rediscovering survival analysis, and most LLM folks I know have learned these perspectives from that lens. Could be wrong about PSP, maybe things are more nuanced? But what is there that isn’t already covered by foundational statistics?

maest · 2026-03-13T04:56:59 1773377819

What is PSP/TSP?

kqr · 2026-03-13T05:08:42 1773378522

One of many ways people have branded the idea of process improvement for software engineering.

Bombthecat · 2026-03-12T21:36:01 1773351361

That's how the public perceive it though.

It's useless and never gets better until it suddenly, unexpecty got good enough.

ForHackernews · 2026-03-12T22:43:18 1773355398

My robo-chauffer kept crashing into different things until one day he didn't.

Mielin · 2026-03-13T07:52:38 1773388358

Robot vacuum is allowed to crash into things and is still quite useful. You add bumpers, maybe some sort of proximity sensors to make the crash less damaging. It is safe by construction - cant harm humans because it is too small.

Things have improved a bit? Now robot shelves becomes a possibility. Map everything, use more sensors, designate humans to a particular area only. Still quite useful. It is safe by design of areas, where humans rarely walk among robots.

Improved further? Now we can do food delivery service robot. Slow down a bit, use much more sensors, think extra hard how to make it safer. Add a flag on a flagpole. Rounded body. Collisions are probably going to happen. Make the robot lighter than humans so that robot gets more damage than the human in a collision. Humans are vulnerable to falling over - make the robot hight just right to grab onto to regain balance, somewhere near waist hight.

Something like that... Now I wish this would be an actual progress requirement for a robo taxy company to do before they start releasing robo taxies onto our streets. But at least we do it as mankind, algorithm improvements, safety solutuon still benefit the whole chain. And benefit to humanity grows despite it being not quite good enough for one particular task.

roxolotl · 2026-03-12T13:05:53 1773320753

I don't know that graph to me shows Sonnet 4.5 as worse than 3.7. Maybe the automated grader is finding code breakages in 3.7 and not breaking that out? But I'd much prefer to add code that is a different style to my codebase than code that breaks other code. But even ignoring that the pass rate is almost identical between the two models.