What is the meaning of 'A3B'?

simonw · 2026-03-04T17:00:18 1772643618

It's the number of active parameters for a Mixture of Experts (misleading name IMO) model.

Qwen3.5-35B-A3B means that the model itself consists of 35 billion floating point numbers - very roughly 35GB of data - which are all loaded into memory at once.

But... on any given pass through the model weights only 3 billion of those parameters are "active" aka have matrix arithmetic applied against them.

This speeds up inference considerably because the computer has to do less operations for each token that is processed. It still needs the full amount of memory though as the 3B active it uses are likely different on every iteration.

zozbot234 · 2026-03-04T19:34:06 1772652846

It will benefit from a full amount of memory for sure, but AIUI if you use system memory and mmap for your experts you can execute the model with only enough memory for the active parameters, it's just unbearably slow since it has to swap in new experts for every token. So the more memory you have in excess to that, the more inactive but often-used experts can be kept in RAM for better performance.

EnPissant · 2026-03-04T22:32:08 1772663528

The ability to stream weights from disk has nothing to do with MoE or not. You can always do this. It will be unusable either way.

zozbot234 · 2026-03-04T22:55:52 1772664952

Agreed but for a dense model you'd have to stream the whole model for every token, whereas with MoE there's at least the possibility that some experts may be "cold" for any given request and not be streamed in or cached. This will probably become more likely as models get even sparser. (The "it's unusable" judgmemt is correct if you're considering close-to-minimum reauirements, but for just getting a model to fit, caching "almost all of it" in RAM may be an excellent choice.)

EnPissant · 2026-03-05T02:18:48 1772677128

Unlike offloading weights from VRAM to system RAM, I just can't see a situation where you would want to offload to an SSD. The difference is just too large, and any model so large you can't run it in system RAM, is going to be so large it is probably unusable except in VRAM.