That's kinda the point here I think. GPT-3 is trained on much more data than SD ...

That's kinda the point here I think. GPT-3 is trained on much more data than SD and contains much more knowledge. SD is actually similar in size to GPT-2.

Image models of the same size as GPT-3 should be much more impressive, the difference will probably be quite large like the difference between GPT-2 and GPT-3.

Ask SD to write a step by step guide to do something and it will create an image that looks kinda like some instructions, but the contents will be nonsense.

An image model of the size of GPT-3 could probably do this task quite well in many cases.

Image models needs much better language understanding to get to the next level also though, so probably multimodal models may make more sense. Maybe feeding web pages rendered as images to an image model could give interesting results.