Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Why do none of the benchmarks test for hallucinations?
 help



In the text, we did share one hallucination benchmark: Claim-level errors fell by 33% and responses with an error fell by 18%, on a set of error-prone ChatGPT prompts we collected (though of course the rate will vary a lot across different types of prompts).

Hallucinations are the #1 problem with language models and we are working hard to keep bringing the rate down.

(I work at OpenAI.)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: