Fabrice Bellard has run a standard set of benchmarks w/ lm-eval on a big chunk o...

Fabrice Bellard has run a standard set of benchmarks w/ lm-eval on a big chunk of open models here: https://bellard.org/ts_server/ - Flan T5 XXL and GPT-NeoX 20B both outperform Pythia 12B on average (LLaMA 13B+ tops the charts).