Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Mistral 7b inferences about 18% faster for me as a 4bit quantized version on an A100. Thats definitely relevant when running anything but chatbots.


Are you measuring tokens/sec or words per second?

The difference matters as generally in my experience, Llama 3, by virtue of its giant vocabulary, generally tokenizes text with 20-25% less tokens than something like Mistral. So even if its 18% slower in terms of tokens/second, it may, depending on the text content, actually output a given body of text faster.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: