Hacker Newsnew | past | comments | ask | show | jobs | submit | isege's commentslogin

> Claude Code is the best autonomous coding agent.

If you look at the terminal-bench@2.0 leaderboard, you'll quickly see it's actually one of the weakest agentic harnesses. Anthropic's own models score lower with Claude Code than with virtually any other harness.

So it's quite the opposite. Claude Code is arguably the worst harness to run models with.


Okay, but not all results on there are valid, ForgeCode for instance has been cheating in the past:

https://debugml.github.io/cheating-agents/#sneaking-the-answ...


Those benches are completely and totally meaningless when it comes down to real world work tasks, and everyone knows it.

Then the benchmarks are wrong.

One I noticed with gemini, especially 3 flash: "this is the classic _____".

Isn't that what terminal-bench does?

Christmas has come early! Thank you for sharing this


This comment allows ycombinator to steal ideas from their user's comments, using their huge mass news platform. Temendous overlap indeed.


This is not just about timestamps but how the traditional chat UI is simply not a good interface for information retrieval and organization.


I've had this exact experience. I used gnome for just one week before getting a macbook and after 3+ years of MacOS I still its find multi desktop handling absurd and unintuitive.

What makes this worse is that Apple's refusal to expose any public APIs to control workspace behavior so you can't even work around their shitty choices.

Instead of iterating on existing functionality, they launch flashy additions like Stage Manager only to abandon them immediately.


I'm also developing a similar branching interface though mine is structured differently. I hope we can make a dent in the LLM space, best of luck!


Nice! Excited to see what you come up with. Best of luck


The chat interface has regrettably become the universal mold for LLM interaction. There are no dissenters. Every provider has the exact same experience. Just off the top of my head I can think of more than a dozen different features that would make LLM interactions infinitely more intuitive and efficient.


a) He has a “I use the web in a very niche way that nobody cares about an your browser sucks if it doesn’t meet those exact needs” way of thinking

b) He is an investor in helium


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: