You can't really regression test a model this way because the training output is...

You can't really regression test a model this way because the training output isn't stable. Instead you have a huge test suite (you could even think of it more as an eval benchmark) and you try to make a determination of whether in aggregate you think you're doing better or worse.

As a more concrete example, imagine you're training an image classifier, and you have a bunch of ground truth human labeled images. If version X gets a 95% score on the image classification task, and version X+1 gets a 96% score, you're probably going to prefer model X+1 over model X even though there were some images labeled correctly by model X that weren't labeled correctly by model X+1. Obviously if you want you can give some tasks higher weight when you do the eval, but whatever you do you're going to have to deal with the fact that new models aren't always going to be strictly better at every task than the old model.