Some basic math supports it. A GB300 NVL72 is about $6.5 million. Lets say that you need $6 million worth of cooling and another $6 million worth of electricity. At current rates, that's 720 billion tokens worth of Claude Opus 4.7. At 100,000 tokens per second, it pays for itself in about 3 months.
Obviously this is an extremely rough calculation. I can even be off by a factor of 10 and it's still a pretty good return.
Unless you're serving Chinese open-weight models - you have to consoder training costs. If you're off my 10x, then the amortization period is 30 months - far longer than the useful lifetimes of SoTA models. Frontier model development is a Red Queens race: you have to run as fast as you can, just to maintain your position.
It's quite easy to sell something for a profit if you ignore the costs. Ultimate free money hack. I will start selling canned beans for the price of the beans plus a few cents. I will just ignore the cost of the cans, labor, power, machines, maintenance, distribution, storage and facility space. If I do that the few cents extra are pure profit.
We don’t know the models sizes, requirements, and optimisations, but we could take a guess using the infrastructure costs of the largest open weight alternatives that perform slightly worse.
In my opinion, it’s a profitable kind of service. They probably don’t pay the public prices for the cloud GPUs though.
Just looking at infra cost is not enough. If the token price doesn't contain all the costs they are losing money and they eventually have to raise prices more.
I'm not familiar with that analysis, its accuracy, or its evidence. I would be surprised by this given it seems like providers are still in the growth phase.
Typically the burden of proof is on the one making the claim.
They have some of the best publicly available analysis on these topics. The full details and numbers are hidden behind the institutional accounts which are priced for investors (not something you sign up for personally) but they're generous with what they send out in their newsletter.
If you're not familiar with resources like this I could understand how you'd assume that the providers are hemorrhaging money on inference costs, because that is that story that gets parroted around spaces like Hacker News.
You could ignore all of that, though, and go check OpenRouter to see how much providers are selling high parameter count models. They're not entirely at the level of the SOTA models, but the biggest open weight models are not that far behind in complexity either. They're being sold an order of magnitude cheaper than what you pay for the APIs from the major players. We don't know exactly how big the major models are, but it's unlikely that they're more than 10X more compute intensive from the leaks we do have.
If you’re demanding rigorous proof for only one side of an argument while assuming the other side must be true, you’re not interested in honest debate.
The cost of AI inference has been a heavily analyzed topic. I trust the professional analysts much more than the casual Hacker News commenter claiming they’re losing money per token because they’re repeating what they heard some other Hacker News commenter say
There is absolutely no evidence to support this.