I always wonder how absolute in performance a given model is. Sometimes i ask for Claude-Opus and the responses i get back are worse than the lowest end models of other assistants. Other times it surprises me and is clearly best in class.
Sometimes in between this variability of performance it pops up a little survey. "How's Claude doing this session from 1-5? 5 being great." and i suspect i'm in some experiment of extremely low performance. I'm actually at the point where i get the feeling peak hour weekdays is terrible and odd hour weekends are great even when forcing a specific model.
While there is some non-determinism it really does feel like performance is actually quite variable. It would make sense they scale up and down depending on utilization right? There was a post a week ago from Anthropic acknowledging terrible model performance in parts of August due to an experiemnt. Perhaps also at peak hour GPT has more datacenter capacity and doesn't get degraded as badly? No idea for sure but it is frustrating when simple asks fail and complex asks succeed without it being clear to me why that may be.
Well knowing the state of the tech industry they probably have a different, legal-team approved definition of “reducing model quality” than face value.
After all, using a different context window, subbing in a differently quantized model, throttling response length, rate limiting features aren’t technically “reducing model quality”.
Sometimes in between this variability of performance it pops up a little survey. "How's Claude doing this session from 1-5? 5 being great." and i suspect i'm in some experiment of extremely low performance. I'm actually at the point where i get the feeling peak hour weekdays is terrible and odd hour weekends are great even when forcing a specific model.
While there is some non-determinism it really does feel like performance is actually quite variable. It would make sense they scale up and down depending on utilization right? There was a post a week ago from Anthropic acknowledging terrible model performance in parts of August due to an experiemnt. Perhaps also at peak hour GPT has more datacenter capacity and doesn't get degraded as badly? No idea for sure but it is frustrating when simple asks fail and complex asks succeed without it being clear to me why that may be.