Can you back this up with documentation? I don't believe that this is the case.

l33tman · 2026-02-22T16:40:59 1771778459

The router that routes the tokens between the "experts" is part of the training itself as well. The name MoE is really not a good acronym as it makes people believe it's on a more coarse level and that each of the experts somehow is trained by different corpus etc. But what do I know, there are new archs every week and someone might have done a MoE differently.

smallerize · 2026-02-23T12:30:58 1771849858

It's not only per token, but also each layer has its own router and can choose different experts. https://huggingface.co/blog/moe#what-is-a-mixture-of-experts...

pixelmelt · 2026-02-22T08:24:54 1771748694

Check out Unsloths REAP models, you can outright delete a few of the lesser used experts without the model going braindead since they all can handle each token but some are better posed to do so.