In late 2021, GLaM had 1.2T parameters. It's difficult to find much use of it in...

In late 2021, GLaM had 1.2T parameters. It's difficult to find much use of it in the wild and while the benchmarks it uses are rather outdated, it has a HellaSwag score of 76.6% and WinoGrande of 73.5%. GPT3 had 64.3% and 70.2%.

Meanwhile, Gemma 2 9B, a model from July 2024 with 133x fewer parameters than GLaM, scores 82% and 80.6%. Hellaswag and WinoGrande aren't used in modern benchmarks, probably because they're too easy and largely memorised at this point.

And GPT-4 had 1.8T parameters sure, but it's noticeably worse than any modern model a fraction of the size, and the original incarnation was ridiculously expensive per token. And in any case, its number of parameters was only possible due using mixture-of-experts, which I would definitely classify as a sophisticated architecture as opposed to just throwing more parameters at a vanilla transformer. Even in 2021 GLaM was a MoE because the limits of scaling dense transformers had already been hit.