I actually made it, so I'm not sure if it has credibility, but the tests are sim...

I actually made it, so I'm not sure if it has credibility, but the tests are simply various (quite simple) questions, and models are just tested on it. I am also surprised Gemini 3 Flash does so well (note that only the MEDIUM reasoning does exceptionally well).

When I look at the results, it does make sense though. Higher models (like Gemini 3 pro) tend to overthink, doubt themselves and go with the wrong solution.

Claude usually fails in subtle ways, sometimes due to formatting or not respecting certain instructions.

From the Chinese models, Qwen 3.5 Plus (Qwen3.5-397B-A17B) does extremely well, and I actually started using it on a AI system for one of my clients, and today they sent me an email they were impressed with one response the AI gave to a customer, so it does translate in real-world usage.

I am not testing any specific thing, the categories there are just as a hint as what the tests are about.

I just added this page to maybe provide a bit more transparency, without divulging the tests: https://aibenchy.com/methodology/