Whisper.cpp has a coreml option which gives 3x speed up over cpu only according ...

zozbot234 · on May 3, 2025

Some outdated information about bare-metal use of the ANE is available from the Whisper.cpp pull req: https://github.com/ggml-org/whisper.cpp/pull/1021 Even more outdated information at: https://github.com/eiln/ane/tree/33a61249d773f8f50c02ab0b9fe... In short, the early (M1/M2) versions of ANE are unlikely to be useful for modern LLM inference due to their seemingly exclusive focus on statically scheduled FP16 and INT8 MADDs.

More extensive information at https://github.com/tinygrad/tinygrad/tree/master/extra/accel... (from the Tinygrad folks, note that this is also similarly outdated) seems to basically confirm the above.

(The jury is still out for M3/M4 which currently have no Asahi support - thus, no current prospects for driving the ANE bare-metal. Note however that the M3/Pro/Max ANE reported performance numbers are quite close to the M2 version, so there may not be a real improvement there either. M3 Ultra and especially the M4 series may be a different story.)

kamranjon · on May 3, 2025

I wouldn't say that they aren't useful for inference (there are pretty clear performance improvements even from the asahi effort you linked) - it's just that you have to convert the model ahead of time to be compatible with the ANE which is explained in the readme docs for whisper.cpp that I linked above.

I would say though that this likely excludes them from being useful for training purposes.

zozbot234 · on May 3, 2025

Note that I was only commenting on modern quantized LLM's that basically avoid formats like FP16 or INT8, preferring lower precision wherever feasible. When in-memory model values must be padded to FP16/INT8 this slashes your effective use of memory bandwidth, which is what determines token generation speed. So the only feasible benefits are really in the prompt pre-processing phase, and even then only in lower power use compared to GPU, not really in higher speed.

kamranjon · on May 3, 2025

That's really interesting! I didn't know that about the padding behavior here. I am interested to know which models this would include? I know Gemma 3 raw is bf16 - are you just talking about the quantized versions of these? Or are models being released purely as quantized versions these days? I know Google just released a QAT (Quantization Aware Training) model of Gemma 3 27b - but that base model was already released.

zozbot234 · on May 3, 2025

Models may be released as unquantized (and even then they are gradually shifting towards lower precisions over time), but most people are going to be running them in a quantized version simply because that gives you the best bang for your buck (you can fit more interesting models on the same hardware). Of course this is strictly about local LLM inference, though one may reasonably assume that the big players are also doing something similar.

conradev · on May 3, 2025

My understanding is that model throughput is fundamentally limited at some point by the fact that the ANE is less wide than the GPU.

At that point, the ANE loses because you have to split the model into chunks and only one fits at a time.

smpanaro · on May 3, 2025

What do you mean by less wide? The main bottleneck for transformers is memory bandwidth. ANE has a much lower ceiling than CPU/GPU (yes, despite unified memory).

Chunking is actually beneficial as long as all the chunks can fit into the ANE’s cache. It speeds up compilation for large network graphs and cached loads are negligible cost. On M1 the cache limit is 3-4GB, but it is higher on M2+.

conradev · on May 3, 2025

I was referring to both the lower memory bandwidth and lower FLOPs. The GPU can just do… more at once? For now. Or is that changing?

I had also assumed that loading a chunk from the cache was not free because I’ve seen cache eviction on my M1, but it’s good to know that it’s no longer as big of a limitation.

also, I’m a big fan of your work! I played around with your ModernBERT CoreML port a bit ago

smpanaro · on May 4, 2025

For single batch inference of anything remotely LLM you'll hit the memory bound way before FLOPs, so I haven't actually looked at FLOPs much. For raw performance GPU is certainly better. ANE is more energy efficient, but you need larger batches to really benefit.

Maybe cache is the wrong word. This is a limit to how much can be mmap'd for the ANE at once. It's not too hard to hit on M1 if your model is in the GB range. Chunking the model into smaller pieces makes it more likely to "fit", but if it doesn't fit you have to unmap/remap in each forward pass which will be noticeable.

Awesome to hear about ModernBERT! Big fan of your work as well :)

anemll · on May 4, 2025

Right.I was thinking about it, you still need batch refill, however, Apple Core ML tools were failing for attention activations quantization. Long context, pre-fill is still compute bound.

echelon · on May 3, 2025

> coreml option which gives 3x speed up over cpu

Which is still painfully slow. CoreML is not a real ML platform.

jorvi · on May 3, 2025

.. who is running LLMs on CPU instead of GPU or TPU/NPU

kamranjon · on May 3, 2025

Actually that's a really good question, I hadn't considered that the comparison here is just CPU vs using Metal (CPU+GPU).

To answer the question though - I think this would be used for cases where you are building an app that wants to utilize a small AI model while at the same time having the GPU free to do graphics related things, which I'm guessing is why Apple stuck these into their hardware in the first place.

Here is an interesting comparison between the two from a whisper.cpp thread - ignoring startup times - the CPU+ANE seems about on par with CPU+GPU: https://github.com/ggml-org/whisper.cpp/pull/566#issuecommen...

conradev · on May 3, 2025

It essentially never makes sense to run on the CPU and you will only ever see enthusiasts doing it.

Yes, hammering the GPU too hard can affect the display server, but no, switching to the CPU is not a good alternative

kamranjon · on May 3, 2025

Not switching to the CPU - switching to the ANE (Neural Cores) - if you read the research papers Apple has released - the example I gave is pretty much how it's being used - small image classification models running on the ANE, alongside a graphics app that needs the GPU to be free.

conradev · on May 3, 2025

Oh, yes, I misread! It’s great for that

fc417fc802 · on May 4, 2025

Depends on the size of the model and how much VRAM you have (and how long you're willing to wait).

yjftsjthsd-h · on May 3, 2025

Not all of us own GPUs worth using. Now, among people using macs... Maybe if you had a hardware failure?

thot_experiment · on May 3, 2025

[flagged]

voidspark · on May 3, 2025

M3 Ultra has a big GPU with 819 GB/sec bandwidth.

LLM performance is twice as fast as RTX 5090

https://creativestrategies.com/mac-studio-m3-ultra-ai-workst...

behnamoh · on May 3, 2025

> LLM performance is twice as fast as RTX 5090

your tests are wrong. you used MLX for Mac Studio (optimized for Apple Silicon) but you didn't use vLLM for 5090. There's no way a machine with half the bandwidth of 5090 delivers twice as fast tok/s.

seanmcdirmid · on May 3, 2025

Unless it’s a large model that doesn’t fit in the 5090, bust that’s no longer a $4k macstudio I think.

behnamoh · on May 3, 2025

that's orthogonal to the speed discussion.

also, the GP was mostly testing models that fit in both 5090 and Mac Studio.

voidspark · on May 3, 2025

$4k will get you a 96 GB Mac Studio with M3 Ultra (819 GB/sec).

That's 3x the RAM of the 5090.

mdp2021 · on May 3, 2025

> That's 3x the RAM of the 5090

And a bit less than half the bandwidth (saying for completeness).

voidspark · on May 3, 2025

Yeah that's probably wrong. But the M3 Ultra is good enough for local inferencing, in any case.

kamranjon · on May 3, 2025

Pretty sure they're using the 80 GPU cores available in that case.

echelon · on May 3, 2025

And that still performs worse than entry-level Nvidia gaming cards.

Apple isn't serious about AI and needs to figure their AI story out. Every other big tech company is doing something about it.

kamranjon · on May 3, 2025

They're basically second place behind NVIDIA for model inference performance and often the only game in town for the average person if you're trying to run larger models that wont fit in the 16 or 24gb of memory available in top-shelf NVIDIA offerings.

I wouldn't say Apple isn't serious about AI, they had the forethought to build the shared memory architecture with the insane memory bandwidth needed for these types of tasks, while at the same time designing neural cores specifically for small on-device models needed for future apps.

I'd say Apple is currently ahead of NVIDIA in just sheer memory available - which for doing training and inference on large models, it's kinda crucial, at least right now. NVIDIA seems to be purposefully limiting the memory available in their consumers cards which is pretty short sighted I think.

balnazzar · on May 3, 2025

Not true. It performs 20-30% better than a RTX A6000 (I have both). Except it has more than 10 times the VRAM. For a comparison with newer Nvidia cards, benchmarks say it does substantially better than a 5070ti, a bit better than a 4080, and a bit worse than a 5080. But once again, it got 30 times the vram amount of the mentioned cards, which for AI workloads are just expensive toys due to lack of vram indeed.

voidspark · on May 3, 2025

Not for inferencing. M3 Ultra runs big LLMs twice as fast as RTX 5090.

https://creativestrategies.com/mac-studio-m3-ultra-ai-workst...

RTX 5090 only has 32GB RAM. M3 Ultra has up to 512 GB with 819 GB/sec bandwidth. It can run models that will not fit on an RTX card.

EDIT: Benchmark may not be properly utilizing the 5090. But the M3 Ultra is way more capable than an entry level RTX card at LLM inferencing.

Spooky23 · on May 3, 2025

My little $599 Mac Mini does inference about 15-20% slower than a 5070 in my kids’ gaming rig. They cost about the same, and I got a free computer.

Nvidia makes an incredible product, but apples different market segmentation strategy might make it a real player in the long run.

balnazzar · on May 3, 2025

It can run models that cannot fit on TEN rtx 5090s (yes, it can run DeepSeek V3/R1, quantized at 4 bit, at a honest 18-19 tok/s, and that's a model you cannot fit into 10 5090s..).

voidspark · on May 3, 2025

Right, that's the $9500 Mac Studio with 512GB RAM and 80-core GPU.

16x the RAM of RTX 5090.

There are two versions of the M3 Ultra

28-core CPU, 60-core GPU

32-core CPU, 80-core GPU

Both have a 32-core Neural Engine.

briandear · on May 3, 2025

Can we stop with the derisive “fanboy” nonsense? Most people don’t say “FOSS” fanboy or Linux “fanboy” — but plenty of people here are exactly that. It’s a bit insulting to people that like and appreciate Mac hardware; just because you might not like it doesn’t mean you have to be so dismissive. And that Mac Studio is a very impressive computer — but it’s usually the ones that have never used on that seem to have to most opinions about them.