This is a very interesting strategy that might pay off. This model is a very goo...

2001zhaozhao · 2026-04-29T17:47:50 1777484870

This is 128B dense though. the K/V cache on long context is going to be massive

Havoc · 2026-04-29T18:30:49 1777487449

Don’t think kv size correlates to dense/moe

zozbot234 · 2026-04-29T18:46:47 1777488407

KV size correlates with attention parameters which are a subset of active parameters. So a typical MoE model will have way lower KV size than a dense model of equal total parameter count.

syntaxing · 2026-04-29T19:22:03 1777490523

With turbo quant, you would reduce it by over 6X.