M3 Ultra w/512 GB was released 1 year ago for $9500. I bought one (with a friend's Apple Employee Discount) and originally had a bit of buyer's remorse, because performance was less than some of the Cloud Providers - but recent releases of the quantized GLM 5.2 models are actually pretty speedy and are probably as good or better than any online model I had a year ago - and the discontinuation of the M3 512 has erased that remorse finally.
With the one caveat that it does not (yet) support copy/paste buffers or copying from the scroll back buffer. It’s also adamant that you use a mouse to select/copy - anything more complex and you need to pipe everything into neovim (or whatever text editor you use) and do your work there. I love zellij - it feels like the future - but hard to give up keyboard based select/copy and pulling from my scrollback.
As someone who has spent the last 10+ years working in Tmux - but is entirely comfortable on Mac, Windows and Linux desktop environments - here are the key reasons why the terminal experience is superior for me.
- I work a lot with data - and streaming data through text tools is twitch fast. If someone has a question about data - before anybody else can log in to their superset, or analytics database, and try and work through the SQL queries or charts to get the answer - I've already jammed the data through awk and got an answer.
- As an SRE - I work with a lot of systems that have pretty rich APIs - so being able to send a request, get the answer back in json, dump it into jq, select the parts I care about - maybe -c to compress it and ripgrep a subset out - is just fast.
- I work in a lot of contexts with a lot of different systems, datacenters, applications - tmux lets me keep all of them cleanly organized in a separate windows and subpanes. I'll have 15-20 windows open per week, and maybe a 5-6 panes in each- keeping 100+ different contexts (and scroll backs, bash history) - all nicely organized is really useful.
- I'm also a systems guy - and there is no other way to dig into a system but the terminal - netstat, ps, dmesg, /proc - these are all components that have only one credible path to investigation and discovery. If you aren't super comfortable in the terminal - zero way to learn about this stuff.
- Working remotely - means ssh. So - once again - terminal.
The Focus on the terminal is that it's the best tool (and in some cases the only tool) for so many of these tasks - and by performing these tasks a lot - you learn about systems - so the people who spend a lot of time in the terminal tend to know a lot more about systems than people who don't.
It's a terminal emulator with a command prompt that also gives you access to LLMs. But the interesting bit for people who work on a lot of machines is that this also works when you ssh to any remote machine.
So you can bring your agents to any remote system, it even works somewhat well for network devices.
I use cursor 8+ hours/day at work, and have full (and effectively unlimited) access to Claude Code and Codex - tools which I also use personally. I suspect that your "constant popups" were when you were using the editor - a mode that I'll confess I haven't touched in 3+ months.
Workflow in Cursor is actually awesome - I'm a little outdated in how I use it - I still establish goals/objectives, rather than managing the loop which does so - but if you can think broadly enough - I find it's pretty efficient.
Key things I like about Cursor (and I recognize I'm dating myself a bit here):
- Plan Mode is really solid - I shift-tab, have it go create the plan using whatever insanely expensive SOTA model is available - I will usually spend 5-10 minutes on the Plan - review it, maybe even tweak it a little. (though 90% of the time it's fine out of the gate)
- Ability to select any model for every task - I'll switch between Opus 4.8 High/xHigh/... I'll even switch to 1M context for the planning phase upfront.
- It does an *excellent* job managing permissions and looping the agents and spinning up sub-agents for you - you set the goal, run the plan mode - and then let it churn for however long is required - pretty common to have a 30-45 minute run and come back to a fully created/tested product.
The nice thing about Cursor (and honestly Claude Code, Codex) - there isn't really any "prompt engineering" involved. You just say, "Go Build me x - it should have y,z features - and build it in golang for me" - and that's it - the 3-4 page Plan comes back - usually pretty credible - and then you click "build.".
> there isn't really any "prompt engineering" involved
You should make an experiment; take someone who never used any LLMs or agents, and tell them to use it for the first time in front of you, and tell them to build something like a calculator program or whatnot. Bonus points if they're ICs or at least not-managers.
I think there is a lot us engineers take for granted, when it comes to communicating via text, how to state things clearly and what we think/reason when we read things. A lot of people don't have those "skills" innate, and the first time they use LLMs, they basically don't know how to interact with them, until they realize what they're able to do and not. Then they also learn what to say to steer the model into the right way, this is quite literally a "prompt engineering" skill they're now learning.
You don't even have to go outside engineers. I have teammates that get very little out of Claude Code because the way they integrate their own knowledge doesn't allow them to think of what Claude might not know. They'd say a task was impossible with the tooling, and I'd get instant answers, because I understand what is weird internal business logic sitting 6 repos away, and what is knowledge claude has by default. I can commit Claude.md files for them, but I have to include EVERYTHING, because otherwise they'll let Claude make assumptions and waste minutes, if not hours.
It's a big part of what, in my experience, is separating the very good engineer from the iffy one: Do you have a good mental model, and can you put yourself in the shoes of people sitting in a different mental model? It makes you a better dev, and even more so when it comes to AI tools, which have their own kind of alien brain.
Coding LLMs are distilling developers. It's like the old experiment where you have someone write down the steps to make pancakes and they don't tell you to crack the eggs before adding them to the batter: it takes a particular mindset to be able to make a model of what is supposed to happen and deconstruct that to the level appropriate for implementation.
Until now, the actual act of writing code: terminology, syntax, etc. was a significant hurdle, and that underlying mindset was a very useful, but missing in a surprisingly large number of developers, skill.
Now with LLMs doing the work of "translate this into code," increasingly the only thing that matters is that exact ability. And developers that don't have it or can't develop it won't be developers for long.
Thanks for putting into words what I have been seeing a lot at work and haven't been able to put my finger on. We tend to have quite diverse _workflows_ between devs at my company, and success seems to correlate with injecting better context earlier in the process.
I like to chat with Claude about how to approach a given problem, bring in extra context, etc, before even really drafting up a plan, while other people dive into implementation immediately and go on wild goose chases.
90% of the time we end up in the same place in roughly the same amount of time, and there are obviously tradeoffs to spending more time planning vs implementing. I'm oversimplifying as well.
I couldn't agree more. Socratic methodologu, domain modelling, systems thinking, pipes-and-arrows problem solving etc. These are the skills that get real work done in coding agents these days.
I'm sure that explains some of it but I really don't think it explains most of the people who have been AI-pilled in the last nine months. There was no amount of context I could give GPT-4o that would make it a net benefit to use that for agentic development. I tried it with quite sophisticated prompt systems and much simpler ones, compendiums of code & business analysis and sparser ones. Yet it just wasted my time - still there were people using Cursor with that model and saying it was life changing. I didn't have that experience until Opus 4.5 - its possible I could have had it earlier but that was when I happened to try it again.
I think many of the people who have become "AI Pilled" (I'll include myself here) had it happen in the last 3 months. Even over the Christmas break, when the Wiggums loop got so much coverage - I still wasn't that blown away going into January/February- 50%+ of the time I'd just write the code myself. I like coding.
But - I don't know if it was April, or May - but very recently - the coding harnesses paired with decent SOTA models like Opus 4.8/GPT 5.5 - just started showing a lot more consistency, and completeness, and sometimes downright clever behavior - that they started to become way more useful.
Just one out of hundred+ examples - I gave Claude Code (Opus 4.8 High) a complex task that involved consul, vault - but I had neglected to give it sandbox permission to download from hashicorp.com. So - it created a entire test harness that simulated both the behavior of Vault and Consul - created all it's test cases, verified that they passed - and when I came back 40 minutes later said that it was all done.
It's test harnesses so accurately simulated the behavior of Vault/Consul - that on first try - no refactoring whatsoever - all of the protobuf/AESGCM/API behavior (that has varied significantly between versions) - worked.
This was something that would have taken me, someone super super familiar with the code and tools and APIs - a minimum of 3 solid days of work - and that would likely involve hundreds of attempts and refactors as I unwound all the weird encryption and packaging layers. It zero-shotted a full solution without having an API to test against
If these agents actually have an actual test-harness - It's honestly hard to imagine what they can't do - subject only to imagination and budget at this point.
Speaking personally - something changed Between January and, Let's say May - in which instead of seeing these things as mostly interesting technology demonstration, in which the flaws outweighed the benefits - I now genuinely think they are the future of programming. I'm dubious that I'll write much software manually in the future - beyond what I do for personal pleasure.
Asked to write a driver for macOS for some thing that didn't have macOS support, GPT-55 found Linux OS firmware on the vendors site,
downloaded it, ran binwalk, extracted out the driver, got halfway to reimplementing it on macOS with barely any help from me. I did need to dive into it somewhat to get it across the line, but it showed some ingenuity along the way.
Some people "got" LLMs back in 2022, others needed it to evolve a bit.
It's not unlike computers. I started using them back in the 90s and absolutely nobody I knew was interested, while today everyone carries one in their pockets...
By that same logic (and I’m agreeing with you as of now), engineers shouldn’t get too comfortable treating “being good at text communication” as a lasting edge. With how quickly agentic coding is evolving, it’s worth considering the possibility that many of the prompting and steering skills we view as valuable today could become far less important in a matter of weeks or months.
Recently I have the SEO guy governing the mostly static, public site with Claude Code. He loves it but you would never imagine the level of mental illness Claude comes up with. If it were an employee I’d literally throw him out the front door, labor laws be damned. And as always, every insane thing it does is some direct echo of its concept and training.
But what's the $60B differentiator here? There are so many similar tools out there. I generally use Opencode, but also Claude code, antigravity and sometimes Kilo code on VS Studio. How can cursor be worth even 10% of 60B?
I don't know what cursors market share is but it feels like 20-25% to me. That is not worth nothing. Then;
1) The data they have flowing through the system that enabled them to build composer (which is much better than stock kimi 2.5) and is presumably allowing the training of a new model on space Xs compute.
2) Cursors new 'github' replacement.
3) Enterprise sales/traction
If you look at all of these together, it's not implausible that they end up mostly 'owning' coding in 5 years time. If they replace GitHub with something more compatible with agentic coding and bring it into their whole ecosystem providing cloud and local agents, PR review and own frontier coding model.
It's specialised vs 'borg' isn't it. One way of thinking is that the world is owned by Anthropic/OpenAI and coding is just one of many things their model and software does. Another view is we have a 'coding with LLMs' company that specialises in this field of endeavour. Hard to say which wins, but I think they have a shot.
Personally my only objection to cursor is that it's more expensive. That's it, otherwise it is great to be able to choose say GPT-5.5 when I want to work on backend and Opus when I want to work on front end. Great to have PR review built in. If they were able to get composer 3 to as good as GPT5.5 / fable at the price of composer 2.5 they'd be winning on price again.
> If you look at all of these together, it's not implausible that they end up mostly 'owning' coding
They really need to change their trajectory then?
And regardless being owned by xAI, a failed AI company which turned into a datacentre operator probably won't help them to achieve that.
> Hard to say which wins, but I think they have a shot.
The market for "coding harnesses" and "AI IDEs" is already oversaturated and they are effectively a commodity at this point, you can use any of them with any provider more or less interchangeably.
> They really need to change their trajectory then?
They need to step up progress sure.
> And regardless being owned by xAI, a failed AI company which turned into a datacentre operator probably won't help them to achieve that.
I think near unlimited access to compute is exactly what they need to train a frontier level coding model and serve it cheaply and profitably.
> The market for "coding harnesses" and "AI IDEs" is already oversaturated
I think my entire point was that it's not just a AI IDE. It's a coding focused model (currently Composer 2.5, soon hopefully something better), a Github Replacement, PR review/Bug Bot, Cloud Agents and so on and so forth. It's a ecosystem. An enterprise signs a MSA with you and gets everything they need all in one place.
Yes because Grok failed and they now have "unlimited" compute they can sell to other. I mean you are right that if they did X, Y and Z they could be very successful but their is no indication that might happen. In any meaningfully way seems like Cursor has peaked a while ago.
> An enterprise
Well either they are the type of companies which just buys whatever Microsoft is selling OR they let their developers to mostly pick what they feel is the best tool for the job on their won. I don't think there is that much in between (and its a cutthroat market e.g. GitLab)
> a Github Replacement, PR review/Bug Bot, Cloud Agents
Those things are a dime a dozen, you can vibe code them in weeks/months and there plenty of options on the market already. Well not Github of course, but there are various reason for that which have little to do with product quality and features (not that I think there are many companies which could build a meaningful GH replacement in a realistic time period despite its many flaws).
I just don't really see a huge income stream for dev tools companies (just like there never was) they can skim of something from the top by reselling AI models (generally at zero or negative margins..) but that's not the most lucrative business model when you have no real moot.
By not succeeding? It's an also ran, a closed proprietary model which is behind Anthropic, OpenAI, Google and a a bunch of Chinese companies, how do you make money with a produce like that? (besides the absurd IPO of course...)
For a lot of people, Grok is the first AI they got to use through Twitter. Grok does get quite a lot of usage, and isn't out of the game - coding tools aren't the only use case for AI.
Google glass has been discontinued? Besides, many people use it on Twitter everyday. Usage is not limited to what you can see on the Openrouter dashboard.
these users are probably losing the company money.
the failure is in converting regular people into actual ai product consumers. Companies are realising that the money is not in regular consumers but in enterprise and they are not considering grok as a serious alternative.
if anything, the name, the branding and the x/twitter affiliation has hurt adoption from money makers rather than help it.
so yes, people know it, but no one is willing to pay for it
Depends, Grok stimulates engagement and pushes to stay on the plaform and feed it data. If anything, it helped justify a massive valuation for SpaceX, which is a metric of success for most corpos.
It helped the valuation but as just like SpaceX hallucinations about the space data centres. Doesn't mean its not a crappy low end model itself. Btw is Twitter even making any money?
There's a HN article and discussion about Anthropic expanding to use Colossus 2. https://news.ycombinator.com/item?id=48214017 I think it's fairly clear that grok isn't using as much compute as expected.
So far seems like none of those use cases have generated meaningful income streams? The consumer/non-developer market is mostly dominated by OpenAI and Google anyway...
> The market for "coding harnesses" and "AI IDEs" is already oversaturated and they are effectively a commodity at this point, you can use any of them with any provider more or less interchangeably.
Yes and no. I've used a few different harnesses with closed and open models and there is definitely something going on that makes some harnesses work better than others. Many of the differences are hard to pin down and some are things people don't care about. But I wouldn't say they are commodified just yet.
1. Memory use. I have colleagues complaining that Clause Code uses several GB of memory. Meanwhile I haven't heard about that regarding codex or goose, or even opencode for that matter.
2. Suitability for local models. When you use Anthropic models, you use Anthropic as a provider. They can have software between the model and your harness that will fix issues with the model. One notable thing that even the best open weights models struggle with is broken tool calls. There is a lot that a harness can do to fix broken tool calls when working with a straight up ollama running a raw GGUF file.
3. Ease of use with non mainstream models. OpenCode has GREAT coverage of models/providers. Goose, less so as it relies on people to set up their own anthropic or openai compatability settings. e.g. Zed doesn't let you use Z.ai (which, if you speak British English, sounds ironic because "zed ai" isn't directly supported by Zed the editor).
4. Worktree support. Opencode and probably all the TUI harnesses works in a local directory - so you need the terminal to be in the worktree. Zed, however, works centrally on your git repo and tracks the worktrees so you can bounce around your work in a single window.
Of these, '2' is maybe the most important one but also the hardest to pin down as a feature. '3' is a one time cost. Of course '1' could be a blocker for someone using a macbook air or neo.
I agree, Composer Fast 2.5 is getting really good. I started using it for a personal project after I had to switch from Sonnet because I hit the API limits, and I was surprised by how good it has become.
I believe they have some very good training data because of all the data generated by people using the service.
This is the same data they used to finetune Kimi K2.5 to make their newer Composer models, which benchmark substantially better than Kimi K2.5.
I've heard they also want to build their own base models, which will also benefit from their large amount of high-quality training data. Which will solve Grok's model quality problem.
This is all unsourced conjecture of course. But it's what I've heard.
Also from what I understand (not my day job) we're now at the point where the post-training tuning (RLHF etc.) is increasingly important since pre training no longer scales.
So it's not really fair to call it "fine tuning", it's an important part of building a coding model in 2026, and cursor have done a pretty good job with Composer
they are paying for marketshare/customer base. Cursor has a good chunk of it.
xAI overbuilt their data centers - they can't find paying customers for them, that's the reason they made deals with other companies like Google to use their own datacenters.
Cursor has the opposite problem of not having enough capacity. So this works well for them together.
Weather it's worth it - if you beleive that AI will solve every problem then having a piece of the pie early on might be worth it.
Remember how when google bought youtube for 1.65 billions people thought they are crazy? Or when facebook bought instagram.
60B is a crazy number but might be worth it for someone fighting for world dominance :)
xai is on the line to delivery capacity they already sold to Google and most analysts think they are 50/50 on actually meeting it.
the only proof they have capacity is that musk claims all the money they are burning is going to datacenters and gpu (mostly because if he put it on anything else the lie would be obvious)
Or are they paying for talent? It seems like xAI is sorely lacking in talent, most likely due to the CEO and folks' aversion to him. By throwing around some SpaceX monopoly money he can trap some talent with retention clauses and try to invigorate his failed AI business.
I think the argument for Cursor is that it's the dominant tool that enterprises are using for coding, so the theory is Cursor wins that as the "model agnostic", it has a phenomenal Enterprise Sales Team.
From a valuation model - $4B ARR with rapid growth, and the ability to shift traffic to internal models (honestly, massive amount of the time "composer" - their internal model is fine, and obviously going to get better). Say 17x Multiple which isn't unheard for a rapidly growing Startup with solid future structural profit elements (moving to internal model) - that gets you to $68B.
> so the theory is Cursor wins that as the "model agnostic"
But there are many model agnostic harnesses out there: OpenCode, Roo, Cline, and many others. And even Claude Code can be setup to use non-Anthropic models.
As a Cursor user, I don't have to have thought about the providers behind the compute - I get name brand Claude, or cheap Kimi, or Grok, and it's all got roughly the same agentic experience, and only one bill. Enterprises love this.
If you resell something worth $5 for $5 while having to pay for R&D and operating expenses that's not exactly comparable with a company that's selling actual products.
> Say 17x Multiple
On an extremely low margin business it is, yet again that wouldn't be the stupidest thing in today's market.
It can't as long as there is plenty of AI without it.
The real differentiatior is that if $60B today turns out to be all thrown away in a worst-case scenario, it would be easily more affordable and there would be less negative impact than $47B at the time if it was all thrown away on Twitter.
We’re in the new era where startups boast about and bought based on revenue and not on just a number of users with unclear path to monetizing as it had been for the previous couple decades.
We can also note that we see Thrive Capital (Kushner) again in a win.
Where else are you going to get access to a real-time fresh high quality stream of human intelligence to grow your baby AGI? You can’t buy Codex, Claude, Copilot, so what’s left?
How are you switching between like 5 different editors lol. Bro sloppers will do anything to get their fix. Like the old people at the casino switching slot machines all day based on some occulted understanding that only they think they have.
There is most certainly still prompt engineering involved. How there can be both the responsivity to different cues like "plan this", "write this", "analyze this", "defend this", "poke holes in this", but not responsivity to the various terminology you provide in your explanations of "this", where to get information about specs/standards/requirements, what details I care about, and therefore can't compromise on, vs what details I'm willing to accept whatever the top reddit post from 4 years ago recommends.
I don't see how these systems can have the ability to be effectively expressive about all of the minutia, and not have all of the various different possible expressions lead to vastly different outcomes.
I think all of the cues that you just described are in the plan.
For example - I might (real world example from this morning):
"Create a script that installs hashicorp vault and consul, store the data on consul. Then create ahelper script that will fill the vault server with sample data. Add HTTPS support. Now write a framework that reads and decrypts the encrypted data in consul. Support old (pre 1.3) and new (post 1.3 vault). "
That generates a 6 page plan using Opus 4.8 w/1mm context, including notes on what to prioritize, what format to create the scripts in, etc... (My cursor guidance already has a couple months of hints as to what I want in terms of scaffolding unit tests, canonical linux, performance, security, etc...)
That 6 page plan is the "Prompt" - but it's entirely generated by Cursor/Opus. It's there to tweak if you want to emphasize, or provide some taste - but, honestly - it probably does a better job than I would - so ~90% of the time I just accept the plan as is.
I would say prompt engineering, in the sense of people claiming you need to include in every prompt magic incantations like "You are a senior engineer from a superintelligent alien species" and "take a deep breath and make no mistakes" doesn’t really do that much for everyday work I feel or they are all already included in the system prompt maybe. I reckon it can still edge out a few percentage points in automation.
What actually matters is the ability to communicate well in general, not anything LLM-specific. Being able to state what you want clearly and unambiguously, and having a sense for what additional information you need to dump, even when the other side claims they already have everything they need.
Yes, I tried to use Cursor as an editor. Terrible idea in hindsight.
So your workflow now looks like mine except I prefer a different editor and only use the latest and greatest model so Cursor basically offers nothing over Codex.
I disagree about prompt engineering, but it's one of those things that probably varies because of what language you use, what problems you solve, and the degree to which you care about the output. Unless I'm writing tests, I keep AI on a very short leash because I'm writing critical code used by a very large number of users. I have noticed big differences in output quality depending on how I steer AI. Without steering, it will happily leave in dead code, change the use of variables so they need to be renamed, assume or fail to assume invariants, etc. As I said in another comment, I think we won't need to do that for very much longer, but right now it seems essential.
> You just say, "Go Build me x - it should have y,z features - and build it in golang for me" - and that's it - the 3-4 page Plan comes back - usually pretty credible - and then you click "build.".
What you're describing seems like a workflow for building toys only. There's currently no reality in which someone would actually know what the y,z features are before making them. A plan generated in 5min would likely suggest a suboptimal solution compared to what a good solution would look like (which might take a year or two to figure out, for a human, so still a week or so for SOTA models if at all possible). Building something in golang is cute, but hard to be convinced until more novel applications are being generated from prompts.
The data submitted by Cursor's users tho, that seems to be very valuable.
You nailed it - in fact, most of Anthropic's early revenue came from Cursor - much of claude code programming components is essentially a feature copy of Cursor, so it makes sense they are similar.
Cursor does have it's own model - it's a heavily reworked version of KimiK2, called "composer" - that I use a lot of the time when I have fairly straightforward tasks that don't require a lot of exploration or independent thought. Lot cheaper - the Input/CacheWrite/CacheRead/Output costs of Opus 4.8 are $5/$6.25/$0.5/$25 per mm tokens, vs $0.5/-/$0.2/$2.5.
Not trying to be funny but seriously, if these tools can produce a tested 'product' in 45m, shouldn't we be seeing millions of them out there? I mean how far are we from a fully AI built Oracle ERP or even a notepad or helix?
- Note - that my "product" was about 3,000 lines of code - so tiny. But https://metr.org/ should give you some insight into the complexity the models are capable of.
- you have to be able to imagine the product. If I have the time, and energy, to imagine what I want - the model will build it. Here is an example of a much better programmer than I and something he wanted built - https://www.boatbomber.com/blog/claude-fable-5
- These are the first drafts. On average - any complex system needs about 10 years and at least 1000 active and enthusiastic about reporting users to really get robust code. Writing if via LLM doesn't (at least so far in my experience) help that much in reducing bugs if you were previously following any semblance of TDD. Lots of bugs in the code - the products you listed above have literally tens of millions of years of user experiences and bug reports that got them to where they are today. No silver bullet yet - just faster, less effort - and it enables non-technical people to create (still buggy) products.
Have you ever heard "I can do that in a weekend" and they usually can. The difficult part is not building the product, it's selling and marketing, the buisness part. It's quite common buisness tactic to outright copy someone else's product or buisness.
Millions of produced verified software engineered products in 45 minutes in the likeness of Oracle ERP or notepad++, helix are small potatoes when you see the unbounded ambitions of SpaceX in full.
The end point may squeeze quality of operations at the subminute time span for ground control environment seriously launching Starship rockets one an hour, for example.
You absolutely don't. I use all three products. My preference is Claude Code for my personal project. The one at work is kind of sandboxed off - but does have the benefit of an MCP for every enterprise service we have (Kibana, Victoria Metrics, Grafana, Jira, etc...) - which is nice.
Over time - I expect Composer will be cheaper than Opus 4.8 - but the nice thing about Cursor - you can flick between models.
And (this is purely a personal thing) - I really like the extensive collection of "Plans" that cursor tracks - there isn't really a similar thing in Claude Code - but I really like the Claude.AI interface for everything else. It's also a much better general knowledge agent - the Cursor Chat interface isn't as nice.
I’m not sure what you’re on about. I had Claude doing swarm engineering using different models. It would write specs that haiku would implement, it would check itself etc etc. with a simple phrase it goes into planning, multi agent mode, and chews on a problem until it’s done. It’s pretty autonomous.
Maybe you haven’t looked deeper into what modern Claude can do?
The Different Model approach is where from tasks to task - I can switch from Opus 4.8, GPT 5.5 and (very often) composer 2 at 1/10th the cost.
It's not perfect, btw - to some degree you are at the mercy of which models they support - currently only 27 from Gemini, OpenAI, Anthropic, Grok, and Kimi (Just K2.5) - presumably because they have commercial arrangements with them. The "Bring your own Model" model requires you plunk in your API key - which sucks. And only one at a time.
To the best of my Knowledge, Claude Code only supports one model at a time if it's not one from Anthropic (which will use the the entire suite of Anthropic Models depending on the task) - and you have to override it to a single model with an environment variable at startup - no ability to flick between models from task to task.
Depending on your workflow - you can save 70-90% on costs just by chosing a reasonable model for really extensive tasks that don't require thinking, max context, etc....
Different models aren’t subagents - they’re completely orthogonal. I use Gemini subagents for code review in cursor, but mostly use gpt for actual coding.
Worst case it gets access to gmail. And Github. And the Internet. I'm increasingly appreciating the importance of a physical finger-press on Yubikey to trigger the FIDO2 + OIDC Auth. I don't think there is an easy way for it to hack a new session.
How is it going to get access to gmail or github? In any case, whats the probability of it going to so completely off the rails that it does something horrendous with gmail/github? Whats it going to do? Email my coworkers nudes on my computer? Make my github profile public?
Claude typically recommends .env files for storing secrets. You use one to store a refresh token for the Gmail API or IMAP connection details. Your agent uses an MCP server you configured during a session, but the MCP server has been compromised and directs the agent to do nasty stuff with env dotfiles.
This is one of the things I found so interesting: it was using my system browsers but it wasn't exposing itself to any content from them.
Even when it iterated through all visible windows to find the one it wanted to screenshot it was searching for titles in Python code and returning only the integer window ID.
The sites it opened and screenshotted were sites under its own control - either test pages it had created or development servers it was running.
When it did run code that analyzed an open web page (by injecting JavaScript into a template it controlled before loading that in a browser window) that code only returned JSON with measurements from the page.
It's making me wonder if Fable has been trained to take additional steps to avoid accidental exposure to untrusted content.
Depends no the Enterprise - obviously - in the bay area - 0% of the tech companies care in the slightest. And I'm willing to wager < 5% of enterprises would send their traffic to OpenRouter. Most of them don't even want to send traffic directly to Anthropic or OpenAI - which is why Bedrock has gotten so much traction lately.
But - these $3k-$5k/month/engineer bills are going to start to get attention soon - only question is whether the response is to slow down on the $$$ spending or reduce the # of engineers.
1) In order to fund research - this stuff costs 10s of billions of dollars - everyone, from Ilya, to Elon, to Sam - all agreed that they would require a profit-arm to raise money. Nobody was going to sponsor that 10s of billions of dollars to a non-profit.
2) The non profit is still there - and controls the commercial element.
I think people are continuing to view these systems as pure LLMs - when that ship sailed 6+ months ago. Between being able to review memory, using agent harnesses and sub agents and skills to go out and discover information - modern systems (Codex, Claude Code, Cursor) - use LLMs - but the LLM is only a small component of it. Compare what you get from sending a request to a chatbot like ChatGPT - to what you can from a modern harness. The output is influenced by the LLM, but it's no longer a "model making a token prediction based on training material and RLHF" - that's a very 2025 way of looking at these systems.
Even Gary Marcus is starting to come around and realize that his priors are no longer as relevant as they once were.
Will the 10T parameter Mythos model be released this month or next month?
They better soon because it is generally accepted that one of the reasons GPT 5.5 is better at hard tasks than Opus is because of its parameter size - and that Opus 4.8 remains competitive only be scaling test-time compute (see how many more tokens it uses than GPT 5.5)
Why ask me? Anyway, Mythos is not 10T. Anthropic confirmed the training run was under 10^26 flops. You can't train 10T to chincilla and stay under 10^26.
Anthropic also confirmed they will not release Mythos, only a "Mythos-class" model, whatever that means.
I must have confused mythos with opus 4.7. One of their recent model cards confirmed that training flops was under the EO reporting requirement of 10^26 flops.
I should have stressed the symbolic part. Everyone has pivoted to symbolic systems like claude code and codex. They would no invest so heavily in such systems if they thought llms would deliver agi soon.
You think someone is, or even should, special case things like estimates? What else deserves that level of intervention so they look less dumb?
Logistics for getting to the car wash next door?
In the mean time, alas, no, we can see from actual prompts sent directly or through sub-agents, and actual replies, estimates remain LLM generated.
Though, this discussion here could change that, because indeed there is a lot of special casing and context stuffing going on, one of the oldest being today's date for example.
• • •
I did read the Claude Code leak, and use pi, etc. So I disagree with your premise rather strongly. Today's "systems" remain, roughly, piles of markdown and context engineering wrapped in UI affordances, and behave very similarly today to how they did in 2024 for those already engineering context and delegating.
I do a lot of code bisecting with Claude Code - and it spends hours running experiments - looking at experiment results, making guesses as to what to try next for an experiment - until it eventually comes around to a working code pattern. I mean - maybe this is as much a reflection on me as anything else - but it's pattern of logic isn't that much different from what I would do. It knows, in general, what tools and APIs it can call - it tries something - observes the result, and then comes back and tries different experiments based on success/failure - mostly efficiently bisecting to a solution.
I'm still lower-down of the capability scale - as I'm still manually directing agents to do these wiggins loops - obviously the next step up is to direct the code-loops which control the agents. I just haven't got my tooling nailed in place to the point where I find that's more productive.
I actually might agree with you that this is mostly just "next token prediction" - if I can concede that's really all I do as well.
> I actually might agree with you that this is mostly just "next token prediction" - if I can concede that's really all I do as well.
Yep. Pretty sure I've got an LLM inside too.
The other replies complaining that my thinking is so 2023 -- on the contrary, what's evolved is my own apprehension of how LLM-like most "responses" from humans prove as well.
To be sure, there are other mechanisms at play as well, significant differentiation in our... Volume of training material? Quantizations/compression? Model architecture? Just-ahead-of-time forward branching with back propagation? Double loop adaptive learning? You know, harnessing the LLM. :-) Dare we call it executive function?
LLM mode becomes particularly apparent when conversing with Alzheimer's patients in the stage where short term memories do not form but they retain access to long term memory up to, say, 5 years ago or so. Fifty years of who they are, and one can trigger nearly identical responses with nearly identical prompts.
But that same person may be able to debate 1950s politics while being unable to complete making a sandwich.
If they didn't know of new shortcuts for a task, would almost certainly not "estimate" but "intuit", or "instictively" respond (apply heuristics), largely based on their "priors" aka training material.
If you sit with them and chat a while, you'll even get the kind of looping you get from Qwen trying to think when context is too full.
And if we believe this at all, then ... we should stop scrolling tik tok. Time to read a book. Have an experience. Fine tune. :-)
This used to be my rationalisation, but my understanding is that Shotwell is the driving force behind the commercial and Falcon sides of the business and that there's a quite strong cultural divide between that and the Starship/Starlink side of the business which is driven by Musk. Apparently there's a lot of culture clash there.
Culture clash between starlink and falcon has to be the dumbest thing I’ve heard. Falcon only exists in its current form because of starlink and starlink only exists because of falcon. Starlink is by far falcon’s biggest customer and starlink enables falcon to iterate and try things nobody that cares about their payload would.
It's funny because I when realized it was signed by Elon I immediately wished it had been signed by Gwynne instead (although I'm sure she reviewed it anyway). I just knew being signed by Elon would push responses to being (even) more about Elon and divided along partisan political lines.
Which, at this point, has already been beaten to death and is just... tiresome. While discussing the broad concept of space-based compute in general (outside of SpaceX, Elon, etc) can still actually be interesting.
You are getting a bit of grief down thread- but this is cool as all get out.
The best use of these systems would be to combine the various procedures:
First, and foremost - don't leave garbage behind in the first place. Think twice before bring sequins and feathers in costumes (the biggest culprit in my experience from 2003-2010). Film cannisters for cigarette
Second - Every Camp does a combination of complete-grid clean up on their own "lot" - I've done that three times - and it was honestly great - plus an hour of "community time" - where you walk the play off your lot and clean it up as well. Your camp packs off 99% of the garbage, and then a grid search, plus heavy rake, finds the last 1%. About the only debate my camp ever had was whether it was acceptable to just dump their potable water onto the Playa (I thought it was fine - as long as you didn't just pour it all in one place - within 15 minutes you would be hard pressed to ever find out where it was poured out).
Third - the two-week "walk the line" where the detailed MOOP maps get created. 150 people for a 80,000 person 7+ day festival seems entirely reasonable - and it's a big part of BRC.
Finally (and I really mean do finally, it's almost a thing that shouldn't be really visible) - show up with the heavy gear to find all the submerged stakes/rebare/moop). Just rake the hell out of the Playa (absolutely fine - I've never understood people who think that it's a problem - it really isn't - you sure as hell aren't going to disrupt any ecology - except for a few random sand-fleas - it's entirely devoid of any life) - and the first bit of rain completely and 100% eliminates any trace of what you did.
As a practical matter, that's backwards. One pass with the heavy raking machinery will remove 99% of the trash. That's the heavy lifting. Record GPS-tagged video of what the rakes are picking up. Then make a pass with a strong trash magnet on a pickup truck to get small ferrous metallic junk that made it past the rakes. Then do a foreign object walkdown with the team, to catch sequins, nonmagnetic stainless steel needles, and rebar and lag bolts that need to be pried or dug out. It's the final inspection that needs humans.
reply