> Reduce your expectations about speed and performance! Wildly understating this...

andai · 2026-02-05T01:12:06 1770253926

Yeah this is why I ended up getting Claude subscription in the first place.

I was using GLM on ZAI coding plan (jerry rigged Claude Code for $3/month), but finding myself asking Sonnet to rewrite 90% of the code GLM was giving me. At some point I was like "what the hell am I doing" and just switched.

To clarify, the code I was getting before mostly worked, it was just a lot less pleasant to look at and work with. Might be a matter of taste, but I found it had a big impact on my morale and productivity.

Aurornis · 2026-02-05T02:10:02 1770257402

> but finding myself asking Sonnet to rewrite 90% of the code GLM was giving me. At some point I was like "what the hell am I doing" and just switched.

This is a very common sequence of events.

The frontier hosted models are so much better than everything else that it's not worth messing around with anything lesser if doing this professionally. The $20/month plans go a long way if context is managed carefully. For a professional developer or consultant, the $200/month plan is peanuts relative to compensation.

deaux · 2026-02-05T02:55:18 1770260118

Until last week, you would've been right. Kimi K2.5 is absolutely competitive for coding.

Unless you include it in "frontier", but that has usually been used to refer to "Big 3".

bigiain · 2026-02-05T03:19:32 1770261572

Looks like you need at least a quarter terabyte or so of ram to run that though?

(At todays ram prices upgrading to that for me would pay for a _lot_ of tokens...)

tkz1312 · 2026-02-05T15:42:51 1770306171

unfortunately running anything locally for serious personal use makes no financial sense at all right now.

4x rtx 6000 pro is probably the minimum you need to have something reasonable for coding work.

deaux · 2026-02-05T16:03:22 1770307402

That's the setup you want for serious work yes, so probably $60kish all-in(?). Which is a big chunk of money for an individual, but potentially quite reasonable for a company. Being able to get effectively _frontier-level local performance_ for that money was completely unthinkable so far. Correct me if I'm wrong, but I think Deepseek R1 hardware requirements were far costlier on release, and it had a much bigger gap to market lead than Kimi K2.5. If this trend continues the big 3 are absolutely finished when it comes to enterprise and they'll only have consumer left. Altman and Amodei will be praying to the gods that China doesn't keep this rate of performance/$ improvement up while also releasing all as open weights.

tracker1 · 2026-02-05T17:39:25 1770313165

I'm not so sure on that... even if one $60k machine can handle the load of 5 developers at a time, you're still looking at 5 years of service to recoup $200/mo/dev and that doesn't even consider other improvements to hardware or the models service providers offer over that same period of time.

I'd probably rather save the capex, and use the rented service until something much more compelling comes along.

deaux · 2026-02-06T03:22:40 1770348160

At this point in time, 100% agreed. But what matters is the trend line. Two years ago nothing came close, if you wanted frontier-level "private" hosting you'd need an enterprise contract with OpenAI for many $millions. Then R1 came, it was incredibly expensive and still quite off. Now it's $60k and basically frontier.

tracker1 · 2026-02-06T16:13:25 1770394405

Of course... it's definitely interesting. I'm also thinking that there are times where you insource vs outsource to a SaaS that's going to do the job for you and you have one less thing to really worry about. Comparing cost to begin with was just a point I was curious about, so I ran the numbers. I can totally see a point where you have that power in a local developer workstation (power requirements notwithstanding), good luck getting that much power to an outlet in your home office. Let alone other issues.

Right now, I think we've probably got 3-5 years of manufacturing woes to work through and another 3-5 years beyond that to get power infrastructure where it needs to be to support it... and even then, I don't think all the resources we can reasonably throw at a combination of mostly nuclear and solar will get there as quickly as it's needed.

That also doesn't consider the bubble itself, or the level of poor to mediocre results altogether even at the frontier level. I mean for certain tasks, it's very close to human efforts in a really diminished timeframe, for others it isn't... and even then, people/review/qa/qc will become the bottleneck for most things in practice.

I've managed to get weeks of work done in a day with AI, but then still have to follow-up for a couple days of iteration on following features... still valuable, but it's mixed. I'm more bullish today than even a few months ago all the same.

Aurornis · 2026-02-05T03:28:29 1770262109

> Kimi K2.5 is absolutely competitive for coding.

Kimi K2.5 is good, but it's still behind the main models like Claude's offerings and GPT-5.2. Yes, I know what the benchmarks say, but the benchmarks for open weight models have been overpromising for a long time and Kimi K2.5 is no exception.

Kimi K2.5 is also not something you can easily run locally without investing $5-10K or more. There are hosted options you can pay for, but like the parent commenter observed: By the time you're pinching pennies on LLM costs, what are you even achieving? I could see how it could make sense for students or people who aren't doing this professionally, but anyone doing this professionally really should skip straight to the best models available.

Unless you're billing hourly and looking for excuses to generate more work I guess?

deaux · 2026-02-05T05:10:36 1770268236

I disagree, based on having used it extensively over the last week. I find it to be at least as strong as Sonnet 4.5 and 5.2-Codex on the majority of tasks, often better. Note that even among the big 3, each of them has a domain where they're better than the other two. It's not better than Codex (x-)high at debugging non-UI code - but neither is Opus or Gemini. It's not better than Gemini at UI design - but neither is Opus or Codex. It's not better than Opus at tool usage and delegation - but neither is Gemini or Codex.

ianlevesque · 2026-02-05T08:03:28 1770278608

Yeah Kimi-K2.5 is the first open weights model that actually feels competitive with the closed models, and I've tried a lot of them now.

deaux · 2026-02-05T16:06:03 1770307563

Same, I'm still not sure where it shines though. In each of the three big domains I named, the respective top performing closed model still seems to have the edge. That keeps me from reaching for it more often. Fantastic all-rounder for sure.

VladVladikoff · 2026-02-05T17:59:36 1770314376

What hardware are you running it on?

deaux · 2026-02-06T03:37:36 1770349056

I'm not running it locally, just using cloud inference. The people I know who do use RTX 6000s, picking the quant based on how many of them they've got. Chained M3 ultra setups are fine to play around with but too slow for actual use as a dev.

triage8004 · 2026-02-05T08:11:34 1770279094

Disagree it's behind gpt top models. It's just slightly behind opus

miroljub · 2026-02-05T12:43:07 1770295387

I've been using MiniMax-M2.1 lately. Although benchmarks show it comparable with Kimi 2.5 and Sonnet 4.5, I find it more pleasant to use.

I still have to occasionally switch to Opus in Opencode planning mode, but not having to rely on Sonnet anymore makes my Claude subscription last much longer.

bushbaba · 2026-02-05T05:39:08 1770269948

For many companies. They’d be better to pay $200/month and layoff 1% of the workforce to pay for it.

apercu · 2026-02-05T12:08:05 1770293285

The issue is they often choose the wrong 1%.

undeveloper · 2026-02-05T06:38:07 1770273487

what tools / processes do you use to manage context

PeterStuer · 2026-02-05T08:43:33 1770281013

My very first tests of local Qwen-coder-next yesterday found it quite capable of acceptably improving Python functions when given clear objectives.

I'm not looking for a vibe coding "one-shot" full project model. I'm not looking to replace GPT 5.2 or Opus 4.5. But having a local instance running some Ralph loop overnight on a specific aspect for the price of electricity is alluring.

davidwritesbugs · 2026-02-05T07:55:43 1770278143

Similar experience to me. I tend to let glm-4.7 have a go at the problem then if it keeps having to try I'll switch to Sonnet or Opus to solve it. Glm is good for the low hanging fruit and planning

icedchai · 2026-02-05T03:44:41 1770263081

Same. I messed around with a bunch of local models on a box with 128GB of VRAM and the code quality was always meh. Local AI is a fun hobby though. But if you want to just get stuff done it’s not the way to go.

MuffinFlavored · 2026-02-05T01:22:34 1770254554

Did you eventually move to a $20/mo Claude plan, $100/mo Claude plan, $200/mo, or API based? if API based, how much are you averaging a month?

andai · 2026-02-05T02:16:50 1770257810

The $20 one, but it's hobby use for me, would probably need the $200 one if I was full time. Ran into the 5 hour limit in like 30 minutes the other day.

I've also been testing OpenClaw. It burned 8M tokens during my half hour of testing, which would have been like $50 with Opus on the API. (Which is why everyone was using it with the sub, until Anthropic apparently banned that.)

I was using GLM on Cerebras instead, so it was only $10 per half hour ;) Tried to get their Coding plan ("unlimited" for $50/mo) but sold out...

(My fallback is I got a whole year of GLM from ZAI for $20 for the year, it's just a bit too slow for interactive use.)

lostmsu · 2026-02-05T15:13:32 1770304412

Try Codex. It's better (subjectively, but objectively they are in the same ballpark), and its $20 plan is way more generous. I can use gpt-5.2 on high (prefer overall smarter models to -codex coding ones) almost nonstop, sometimes a few in parallel before I hit any limits (if ever).

holoduke · 2026-02-05T08:32:42 1770280362

I now have 3 x 100 plans. Only then I an able to full time use it. Otherwise I hit the limits. I am q heavy user. Often work on 5 apps at the same time.

auggierose · 2026-02-05T09:38:54 1770284334

Shouldn't the 200 plan give you 4x?? Why 3 x 100 then?

holoduke · 2026-02-05T13:36:31 1770298591

Good point. Need to look into that one. Pricing is also changing constantly with Claude

zozbot234 · 2026-02-04T22:05:11 1770242711

The best open models such as Kimi 2.5 are about as smart today as the big proprietary models were one year ago. That's not "nothing" and is plenty good enough for everyday work.

Aurornis · 2026-02-05T02:07:07 1770257227

> The best open models such as Kimi 2.5 are about as smart today as the big proprietary models were one year ago

Kimi K2.5 is a trillion parameter model. You can't run it locally on anything other than extremely well equipped hardware. Even heavily quantized you'd still need 512GB of unified memory, and the quantization would impact the performance.

Also the proprietary models a year ago were not that good for anything beyond basic tasks.

reilly3000 · 2026-02-04T22:11:11 1770243071

Which takes a $20k thunderbolt cluster of 2 512GB RAM Mac Studio Ultras to run at full quality…

0xbadcafebee · 2026-02-05T00:46:00 1770252360

Most benchmarks show very little improvement of "full quality" over a quantized lower-bit model. You can shrink the model to a fraction of its "full" size and get 92-95% same performance, with less VRAM use.

MuffinFlavored · 2026-02-05T01:23:17 1770254597

> You can shrink the model to a fraction of its "full" size and get 92-95% same performance, with less VRAM use.

Are there a lot of options how "how far" do you quantize? How much VRAM does it take to get the 92-95% you are speaking of?

bigyabai · 2026-02-05T01:33:54 1770255234

> Are there a lot of options how "how far" do you quantize?

So many: https://www.reddit.com/r/LocalLLaMA/comments/1ba55rj/overvie...

> How much VRAM does it take to get the 92-95% you are speaking of?

For inference, it's heavily dependent on the size of the weights (plus context). Quantizing an f32 or f16 model to q4/mxfp4 won't necessarily use 92-95% less VRAM, but it's pretty close for smaller contexts.

MuffinFlavored · 2026-02-05T02:01:59 1770256919

Thank you. Could you give a tl;dr on "the full model needs ____ this much VRAM and if you do _____ the most common quantization method it will run in ____ this much VRAM" rough estimate please?

omneity · 2026-02-05T05:07:23 1770268043

It’s a trivial calculation to make (+/- 10%).

Number of params == “variables” in memory

VRAM footprint ~= number of params * size of a param

A 4B model at 8 bits will result in 4GB vram give or take, same as params. At 4 bits ~= 2GB and so on. Kimi is about 512GB at 4 bits.

polynomial · 2026-02-05T06:23:52 1770272632

Depending on what your usage requirements are, Mac Minis running UMA over RDMA is becoming a feasible option. At roughly 1/10 of the cost you're getting much much more than 1/10 the performance. (YMMV)

https://buildai.substack.com/i/181542049/the-mac-mini-moment

danw1979 · 2026-02-05T12:26:54 1770294414

I did not expect this to be a limiting factor in the mac mini RDMA setup ! -

> Thermal throttling: Thunderbolt 5 cables get hot under sustained 15GB/s load. After 10 minutes, bandwidth drops to 12GB/s. After 20 minutes, 10GB/s. Your 5.36 tokens/sec becomes 4.1 tokens/sec. Active cooling on cables helps but you’re fighting physics.

Thermal throttling of network cables is a new thing to me…

cat_plus_plus · 2026-02-05T15:48:36 1770306516

I admire patience of anyone who runs dense models on unified memory. Personally, I would rather feed an entire programming book or code directory to a sparse model and get an answer in 30 seconds and then use cloud in rare cases it's not enough.

polynomial · 2026-02-05T17:26:43 1770312403

Luckily we're having a record cold winter and your setup can double as a personal space heater.

deaux · 2026-02-05T02:59:05 1770260345

And that's at unusable speeds - it takes about triple that amount to run it decently fast at int4.

Now as the other replies say, you should very likely run a quantized version anyway.

bigyabai · 2026-02-05T00:23:00 1770250980

"Full quality" being a relative assessment, here. You're still deeply compute constrained, that machine would crawl at longer contexts.

PlatoIsADisease · 2026-02-04T23:46:11 1770248771

[flagged]

zozbot234 · 2026-02-04T23:53:08 1770249188

70B dense models are way behind SOTA. Even the aforementioned Kimi 2.5 has fewer active parameters than that, and then quantized at int4. We're at a point where some near-frontier models may run out of the box on Mac Mini-grade hardware, with perhaps no real need to even upgrade to the Mac Studio.

PlatoIsADisease · 2026-02-05T00:00:09 1770249609

>may

I'm completely over these hypotheticals and 'testing grade'.

I know Nvidia VRAM works, not some marketing about 'integrated ram'. Heck look at /r/locallama/ There is a reason its entirely Nvidia.

hnfong · 2026-02-05T02:09:16 1770257356

> Heck look at /r/locallama/ There is a reason its entirely Nvidia.

That's simply not true. NVidia may be relatively popular, but people use all sorts of hardware there. Just a random couple of recent self-reported hardware from comments:

- https://www.reddit.com/r/LocalLLaMA/comments/1qw15gl/comment...

- https://www.reddit.com/r/LocalLLaMA/comments/1qw0ogw/analysi...

- https://www.reddit.com/r/LocalLLaMA/comments/1qvwi21/need_he...

- https://www.reddit.com/r/LocalLLaMA/comments/1qvvf8y/demysti...

PlatoIsADisease · 2026-02-05T12:00:43 1770292843

I specifically mentioned "hypotheticals and 'testing grade'."

Then you sent over links describing such.

In real world use, Nvidia is probably over 90%.

hnfong · 2026-02-05T16:40:39 1770309639

r/locallamma/ is not entirely Nvidia.

You have a point that at scale everybody except maybe Google is using Nvidia. But r/locallama is not your evidence of that, unless you apply your priors, filter out all the hardware that don't fit your so called "hypotheticals and 'testing grade'" criteria, and engage in circular logic.

PS: In fact locallamma does not even cover your "real world use". Most mentions of Nvidia are people who have older GPUs eg. 3090s lying around, or are looking at the Chinese VRAM mods to allow them run larger models. Nobody is discussing how to run a cluster of H200s there.

K0balt · 2026-02-05T01:44:32 1770255872

Mmmm, not really. I have both a4x 3090 box and a Mac m1 with 64 gb. I find that the Mac performs about the same as a 2x 3090. That’s nothing stellar, but you can run 70b models at decent quants with moderate context windows. Definitely useful for a lot of stuff.

PlatoIsADisease · 2026-02-05T11:58:09 1770292689

>quants

>moderate context windows

Really had to modify the problem to make it seem equal? Not that quants are that bad, but the context windows thing is the difference between useful and not useful.

K0balt · 2026-02-06T19:08:19 1770404899

Equal to the 2x3090? Yeah it’s about equal in every way, context windows included.

As for useful at that scale?

I use mine for coding a fair bit, and I don’t find it a detractor overall. It enforces proper API discipline, modularity, and hierarchal abstraction. Perhaps the field of application makes that more important though. (Writing firmware and hardware drivers).

It also brings the advantage of focusing exclusively on the problems that are presented in the limited context, and not wandering off on side quests that it makes up.

I find it works well up to about 1KLOC at a time.

I wouldn’t imply they were equal to commercial models, but I would definitely say that local models are very useful tools.

They are also stable, which is not something I can say for SOTA models. You cal learn how to get the best results from a model and the ground doesn’t move underneath you just when you’re on a roll.

sealeck · 2026-02-05T00:03:18 1770249798

Are you an NVIDIA fanboy?

This is a _remarkably_ aggressive comment!

PlatoIsADisease · 2026-02-05T01:11:06 1770253866

Not at all. I don't even know why someone would be incentivized by promoting Nvidia outside of holding large amounts of stock. Although, I did stick my neck out suggesting we buy A6000s after the Apple M series didn't work. To 0 people's surprise, the 2xA6000s did work.

teaearlgraycold · 2026-02-04T22:13:37 1770243217

Which while expensive is dirt cheap compared to a comparable NVidia or AMD system.

SchemaLoad · 2026-02-04T22:39:29 1770244769

It's still very expensive compared to using the hosted models which are currently massively subsidised. Have to wonder what the fair market price for these hosted models will be after the free money dries up.

whatsupdog · 2026-02-05T03:23:38 1770261818

I wonder if the "distributed AI computing" touted by some of the new crypto projects [0] works and is relatively cheaper.

0. https://www.daifi.ai/

cactusplant7374 · 2026-02-04T23:01:22 1770246082

Inference is profitable. Maybe we hit a limit and we don't need as many expensive training runs in the future.

paxys · 2026-02-05T00:18:23 1770250703

Inference APIs are probably profitable, but I doubt the $20-$100 monthly plans are.

cactusplant7374 · 2026-02-05T22:47:31 1770331651

I wouldn’t be so sure. Most users aren’t going to use up their quota every week.

teaearlgraycold · 2026-02-05T00:08:37 1770250117

For sure Claude Code isn’t profitable

bdangubic · 2026-02-05T00:50:04 1770252604

Neither was Uber and … and …

plagiarist · 2026-02-05T01:50:19 1770256219

Businesses will desire me for my insomnia once Anthropics starts charging congestion pricing.

bdangubic · 2026-02-05T13:32:52 1770298372

that is coming for sure to replace the "500" errors

blharr · 2026-02-04T22:35:19 1770244519

What speed are you getting at that level of hardware though?

paxys · 2026-02-04T22:24:43 1770243883

LOCAL models. No one is running Kimi 2.5 on their Macbook or RTX 4090.

deaux · 2026-02-05T03:01:40 1770260500

Some people spend $50k on a new car, others spend it on running Kimi K2.5 at good speeds locally.

No one's running Sonnet/Gemini/GPT-5 locally though.

DennisP · 2026-02-05T00:12:02 1770250322

On Macbooks, no. But there are a few lunatics like this guy:

https://www.youtube.com/watch?v=bFgTxr5yst0

HarHarVeryFunny · 2026-02-05T12:52:41 1770295961

Wow!

I've never heard of this guy before, but I see he's got 5M YouTube subscribers, which I guess is the clout you need to have Apple loan (I assume) you $50K worth of Mac Studios!

I'll be interesting to see how model sizes, capability, and local compute prices evolve.

A bit off topic, but I was in best buy the other day and was shocked to see 65" TVs selling for $300 ... I can remember the first large flat screen TVs (plasma?) selling for 100x that ($30K) when they first came out.

danw1979 · 2026-02-05T13:02:48 1770296568

He must be mad, accepting $50k of free (probably loaned?) hardware from Apple !

Great demo video though. Nice to see some benchmarks of Exo with this cluster across various models.

corysama · 2026-02-04T23:38:21 1770248301

The article mentions https://unsloth.ai/docs/basics/claude-codex

I'll add on https://unsloth.ai/docs/models/qwen3-coder-next

The full model is supposedly comparable to Sonnet 4.5 But, you can run the 4 bit quant on consumer hardware as long as your RAM + VRAM has room to hold 46GB. 8 bit needs 85.

teaearlgraycold · 2026-02-04T22:12:59 1770243179

Having used K2.5 I’d judge it to be a little better than that. Maybe as good as proprietary models from last June?

0xbadcafebee · 2026-02-05T00:43:51 1770252231

Kimi K2.5 is fourth place for intelligence right now. And it's not as good as the top frontier models at coding, but it's better than Claude 4.5 Sonnet. https://artificialanalysis.ai/models

EagnaIonat · 2026-02-05T07:54:57 1770278097

The secret is to not run out of quota.

Instead have Claude know when to offload work to local models and what model is best suited for the job. It will shape the prompt for the model. Then have Claude review the results. Massive reduction in costs.

btw, at least on Macbooks you can run good models with just M1 32GB of memory.

BuildTheRobots · 2026-02-05T08:12:43 1770279163

I don't suppose you could point to any resources on where I could get started. I have a M2 with 64gb of unified memory and it'd be nice to make it work rather than burning Github credits.

EagnaIonat · 2026-02-05T08:36:15 1770280575

https://ollama.com

Although I'm starting to like LMStudio more, as it has more features that Ollama is missing.

https://lmstudio.ai

You can then get Claude to create the MCP server to talk to either. Then a CLAUDE.md that tells it to read the models you have downloaded, determine their use and when to offload. Claude will make all that for you as well.

shen · 2026-02-05T16:39:15 1770309555

Which local models are you using for the 32gb MacBooks?

EagnaIonat · 2026-02-06T04:23:57 1770351837

Mainly gpt-oss-20b as the thinking mode is really good. I occasionally use granite4 as it is a very fast model. But any 4GB model should easily be used.

eek2121 · 2026-02-05T15:58:05 1770307085

LM Studio is fantastic for playing with local models.

kilroy123 · 2026-02-05T13:00:48 1770296448

I strongly think you're on to something here. I wish Apple would invest heavily in something like this.

The big powerful models think about tasks, then offload some stuff to a drastically cheaper cloud model or the model running on your hardware.

dheera · 2026-02-04T22:54:46 1770245686

Maybe add to the Claude system prompt that it should work efficiently or else its unfinished work will be handed off to to a stupider junior LLM when its limits run out, and it will be forced to deal with the fallout the next day.

That might incentivize it to perform slightly better from the get go.

kridsdale3 · 2026-02-04T23:38:29 1770248309

"You must always take two steps forward, for when you are off the clock, your adversary will take one step back."

tracker1 · 2026-02-05T17:33:12 1770312792

For my relatively limited exposure, I'm not sure if I'd be able to tolerate it. I've found Claude/Opus to e pretty nice to work with... by contrast, I find Github Copilot to be the most annoying thing I've ever tried to work with.

Because of how the plugin works in VS code, on my third day of testing with Claude Code, I didn't click the Claude button and was accidentally working with CoPilot for about three hours of torture when I realized I wasn't in Claude Code. Will NEVER make that mistake again... I can only imagine anything I can run at any decent speed lcoally will be closer to the latter. I pretty quickly reach a "I can do this faster/better myself" point... even a few times with Claude/Opus, so my patience isn't always the greatest.

That said, I love how easy it is to build up a scaffold of a boilerplate app for the sole reason to test a single library/function in isolation from a larger application. In 5-10 minutes, I've got enough test harness around what I'm trying to work on/solve that it lets me focus on the problem at hand, while not worrying about doing this on the integrated larger project.

I've still got some thinking and experimenting to do with improving some of my workflows... but I will say that AI Assist has definitely been a multiplier in terms of my own productivity. At this point, there's literally no excuse not to have actual code running experiments when learning something new, connecting to something you haven't used before... etc. in terms of working on a solution to a problem. Assuming you have at least a rudimentary understanding of what you're actually trying to accomplish in the piece you are working on. I still don't have enough trust to use AI to build a larger system, or for that matter to truly just vibe code anything.

cat_plus_plus · 2026-02-05T15:30:45 1770305445

Depends on whether you want a programmer or a therapist. Given clear description of class structure and key algorithms, Qwen3-Code is way more likely to do exactly what is being asked than any Gemini model. If you want to turn a vague idea into a design, yeah cloud bot is better. Let's not forget that cloud bots have web search, if you hook up a local model to GPT Researcher or Onyx frontend, you will see reasonable performance, although open ended research is where cloud model scale does pay off. Provided it actually bothers to search rather than hallucinating to save backend costs. Also local uncensored model is way better at doing proper security analysis of your app / network.

bityard · 2026-02-04T23:40:55 1770248455

Correct, a rack full of datacenter equipment is not going to compete with anything that fits on your desk or lap. Well spotted.

But as a counterpoint: there are whole communities of people in this space who get significant value from models they run locally. I am one of them.

kamov · 2026-02-04T23:53:37 1770249217

What do you use local models for? I'm asking generally about possible applications of these smaller models

Lio · 2026-02-05T07:49:03 1770277743

Well for starters you get a real guarantee of privacy.

If you’re worried about others being able to clone your business processes if you share them with a frontier provider then the cost of a Mac Studio to run Kimi is probably a justifiable tax right off.

Gravey · 2026-02-04T23:50:19 1770249019

Would you mind sharing your hardware setup and use case(s)?

CamperBob2 · 2026-02-04T23:55:28 1770249328

Not the GP but the new Qwen-Coder-Next release feels like a step change, at 60 tokens per second on a single 96GB Blackwell. And that's at full 8-bit quantization and 256K context, which I wasn't sure was going to work at all.

It is probably enough to handle a lot of what people use the big-3 closed models for. Somewhat slower and somewhat dumber, granted, but still extraordinarily capable. It punches way above its weight class for an 80B model.

redwood_ · 2026-02-05T00:04:15 1770249855

Agree, these new models are a game changer. I switched from Claude to Qwen3-Coder-Next for day-to-day on dev projects and don't see a big difference. Just use Claude when I need comprehensive planning or review. Running Qwen3-Coder-Next-Q8 with 256K context.

paxys · 2026-02-05T03:13:51 1770261231

"Single 96GB Blackwell" is still $15K+ worth of hardware. You'd have to use it at full capacity for 5-10 years to break even when compared to "Max" plans from OpenAI/Anthropic/Google. And you'd still get nowhere near the quality of something like Opus. Yes there are plenty of valid arguments in favor of self hosting, but at the moment value simply isn't one of them.

lostmsu · 2026-02-05T15:30:38 1770305438

If you are not planning to batch, you can run it much cheaper with Ryzen AI Max SoC devices.

Hell, if you are willing to go even slower, any GPU + ~80GB of RAM will do it.

CamperBob2 · 2026-02-05T04:20:05 1770265205

Eh, they can be found in the $8K neighborhood, $9K at most. As zozbot234 suggests, a much cheaper card would probably be fine for this particular model.

I need to do more testing before I can agree that it is performing at a Sonnet-equivalent level (it was never claimed to be Opus-class.) But it is pretty cool to get beaten in a programming contest by my own video card. For those who get it, no explanation is necessary; for those who don't, no explanation is possible.

And unlike the hosted models, the ones you run locally will still work just as well several years from now. No ads, no spying, no additional censorship, no additional usage limits or restrictions. You'll get no such guarantee from Google, OpenAI and the other major players.

eek2121 · 2026-02-05T16:00:01 1770307201

I run it on my machine, which has a a 4090 and 64gb RAM.

CamperBob2 · 2026-02-05T17:33:58 1770312838

How fast is it?

zozbot234 · 2026-02-05T00:01:22 1770249682

IIRC, that new Qwen model has 3B active parameters so it's going to run well enough even on far less than 96GB VRAM. (Though more VRAM may of course help wrt. enabling the full available context length.) Very impressive work from the Qwen folks.

dust42 · 2026-02-05T08:59:11 1770281951

The brand new Qwen3-Coder-Next runs at 300Tok/s PP and 40Tok/s on M1 64GB with 4-bit MLX quant. Together with Qwen Code (fork of Gemini) it is actually pretty capable.

Before that I used Qwen3-30B which is good enough for some quick javascript or Python, like 'add a new endpoint /api/foobar which does foobaz'. Also very decent for a quick summary of code.

It is 530Tok/s PP and 50Tok/s TG. If you have it spit out lots of the code that is just copy of the input, then it does 200Tok/s, i.e. 'add a new endpoint /api/foobar which does foobaz and return the whole file'

anon373839 · 2026-02-05T02:29:05 1770258545

It's true that open models are a half-step behind the frontier, but I can't say that I've seen "sheer intelligence" from the models you mentioned. Just a couple of days ago Gemini 3 Pro was happily writing naive graph traversal code without any cycle detection or safety measures. If nothing else, I would have thought these models could nail basic algorithms by now?

cracki · 2026-02-05T08:03:47 1770278627

Did it have reason to assume the graph to be a certain type, such as directed or acyclic?

majormajor · 2026-02-05T05:54:33 1770270873

The amount of "prompting" stuff (meta-prompting?) the "thinking" models do behind the scenes even beyond what the harnesses do is massive; you could of course rebuild it locally, but it's gonna make it just that much slower.

I expect it'll come along but I'm not gonna spend the $$$$ necessary to try to DIY it just yet.

richstokes · 2026-02-05T00:25:24 1770251124

This. It's a false economy if you value your time even slightly, pay for the extra tokens and use the premium models.

seanmcdirmid · 2026-02-05T01:27:40 1770254860

> (ones you run on beefy 128GB+ RAM machines)

PC or Mac? A PC, ya, no way, not without beefy GPUs with lots of VRAM. A mac? Depends on the CPU, an M3 Ultra with 128GB of unified RAM is going to get closer, at least. You can have decent experiences with a Max CPU + 64GB of unified RAM (well, that's my setup at least).

QuantumNomad_ · 2026-02-05T01:36:12 1770255372

Which models do you use, and how do you run them?

seanmcdirmid · 2026-02-05T05:16:56 1770268616

I have a M3 max 64GB.

For VS Code code completion in Continue using a Qwen3-coder 7b model. For CLI work Qwen coder 32b for sidebar. 8 bit quant for both.

I need to take a look at Qwen3-coder-next, it is supposed to have made things much faster with a larger model.

acchow · 2026-02-05T06:19:43 1770272383

I agree. You could spin for 100 hours on a sub-par model or get it done in 10 minutes with a frontier model

mycall · 2026-02-05T01:22:54 1770254574

There is tons of improvements in the near future. Even Claude Code developer said he aimed at delivering a product that was built for future models he betted would improve enough to fulfill his assumptions. Parallel vLLM MoE local LLMs on a Strix Halo 128GB has some life in it yet.

0xbadcafebee · 2026-02-05T00:52:55 1770252775

The best local models are literally right behind Claude/Gemini/Codex. Check the benchmarks.

That said, Claude Code is designed to work with Anthropic's models. Agents have a buttload of custom work going on in the background to massage specific models to do things well.

girvo · 2026-02-05T02:03:15 1770256995

The benchmarks simply do not match my experience though. I don’t put that much stock in them anymore.

Balinares · 2026-02-05T13:04:44 1770296684

I've repeatedly seen Opus 4.5 manufacture malpractice and then disable the checks complaining about it in order to be able to declare the job done, so I would agree with you about benchmarks versus experience.

mlrtime · 2026-02-05T03:11:35 1770261095

The local ones yeah...

I have claude pro $20/mo and sometimes run out. I just set ANTHROPIC_BASE_URL to a localllm API endpoint that connects to a cheaper Openai model. I can continue with smaller tasks with no problem. This has been done for a long time.

altern8 · 2026-02-05T16:05:42 1770307542

I was wondering the same thing, e.g. if it takes tens or hundreds of millions of dollars to train and keep a model up-to-date, how can an open source one compete with that?

gpm · 2026-02-05T16:44:23 1770309863

Less than a billion of dollars to become the arbiter of truth probably sounds like a great deal to the well off dictatorial powers of the world. So long as models can be trained to have a bias (and it's hard to see that going away) I'd be pretty surprised if they stop being released for free.

Which definitely has some questionable implications... but just like with advertising it's not like paying makes the incentives for the people capable of training models to put their thumbs on the scales go away.

DANmode · 2026-02-05T00:11:52 1770250312

and you really should be measuring based on the worst-case scenario for tools like this.

bicx · 2026-02-04T22:53:05 1770245585

Exactly. The comparison benchmark in the local LLM community is often GPT _3.5_, and most home machines can’t achieve that level.

amelius · 2026-02-05T11:17:35 1770290255

And at best?

nik282000 · 2026-02-04T22:38:52 1770244732

> intelligence

Whether it's a giant corporate model or something you run locally, there is no intelligence there. It's still just a lying engine. It will tell you the string of tokens most likely to come after your prompt based on training data that was stolen and used against the wishes of its original creators.