You getting six nines of accuracy on that with good latency? Did you watch the “how our large driving model deals with stop signs” from Tesla AI department? Given the multiplicative effect of driving decisions and the weird real world out there, it has be extremely reliable and robust to be a good driver as the miles mount up.
The reason you would insert an LLM into the vision stack is to deal with the weird and unexpected. Tesla’s current stop sign approach is to train a classifier from scratch on thousands of stop signs images. It’s not surprising that architecture can’t deal with stop signs that fall outside the distribution.
LLMs with vision work completely differently. You’re leveraging the world model, built from a terabyte of text data, to aid your classification. The classic example of an image they handle well is a man ironing clothes on the back of a taxi. Where traditional image classifiers wouldn’t have a hope of handling that, vision LLMs describe it with ease.