Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I don't find this very compelling. If you look at the actual graph they are referencing but never showing [1] there is a clear improvement from Sonnet 3.7 -> Opus 4.0 -> Sonnet 4.5. This is just hidden in their graph because they are only looking at the number of PRs that are mergable with no human feedback whatsoever (a high standard even for humans).

And even if we were to agree that that's a reasonable standard, GPT 5 shouldn't be included. There is only one datapoint for all OpenAI models. That data point more indicative of the performance of OpenAI models (and the harness used) than of any progression. Once you exclude it it matches what you would expect from a logistic model. Improvements have slowed down, but not stopped

1: https://metr.org/assets/images/many-swe-bench-passing-prs-wo...

 help



Yes, I think this is basically an instance of the "emergent abilities mirage." https://arxiv.org/abs/2304.15004

If you measure completion rate on a task where a single mistake can cause a failure, you won't see noticeable improvements on that metric until all potential sources of error are close to being eliminated, and then if they do get eliminated it causes a sudden large jump in performance.

That's fine if you just want to know whether the current state is good enough on your task of choice, but if you also want to predict future performance, you need to break it down into smaller components and track each of them individually.


  > until all potential sources of error are close to being eliminated
This is what PSP/TSP did - one has to (continually) review its' own work to identify most frequent sources of (user facing) defects.

  >  if you also want to predict future performance, you need to break it down into smaller components and track each of them individually.
This is also one of tenets of PSP/TSP. If you have a task with estimate longer that a day (8 hours), break it down.

This is fascinating. LLM community discovers PSP/TSP rules that were laid over more than twenty years ago.

What LLM community miss is that in PSP/TSP it is an individual software developer who is responsible to figure out what they need to look after.

What I see is that it is LLM users who try to harness LLMs with what they perceive as errors. It's not that LLMs are learning, it is that users of LLMs are trying to stronghold these LLMs with prompts.


I don’t know it’s fair to characterize the LLM community as being ignorant and rediscovering PSP/TCP. I in fact see that as programmers rediscovering survival analysis, and most LLM folks I know have learned these perspectives from that lens. Could be wrong about PSP, maybe things are more nuanced? But what is there that isn’t already covered by foundational statistics?

What is PSP/TSP?

One of many ways people have branded the idea of process improvement for software engineering.

That's how the public perceive it though.

It's useless and never gets better until it suddenly, unexpecty got good enough.


My robo-chauffer kept crashing into different things until one day he didn't.

Robot vacuum is allowed to crash into things and is still quite useful. You add bumpers, maybe some sort of proximity sensors to make the crash less damaging. It is safe by construction - cant harm humans because it is too small.

Things have improved a bit? Now robot shelves becomes a possibility. Map everything, use more sensors, designate humans to a particular area only. Still quite useful. It is safe by design of areas, where humans rarely walk among robots.

Improved further? Now we can do food delivery service robot. Slow down a bit, use much more sensors, think extra hard how to make it safer. Add a flag on a flagpole. Rounded body. Collisions are probably going to happen. Make the robot lighter than humans so that robot gets more damage than the human in a collision. Humans are vulnerable to falling over - make the robot hight just right to grab onto to regain balance, somewhere near waist hight.

Something like that... Now I wish this would be an actual progress requirement for a robo taxy company to do before they start releasing robo taxies onto our streets. But at least we do it as mankind, algorithm improvements, safety solutuon still benefit the whole chain. And benefit to humanity grows despite it being not quite good enough for one particular task.


I don't know that graph to me shows Sonnet 4.5 as worse than 3.7. Maybe the automated grader is finding code breakages in 3.7 and not breaking that out? But I'd much prefer to add code that is a different style to my codebase than code that breaks other code. But even ignoring that the pass rate is almost identical between the two models.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: