You getting six nines of accuracy on that with good latency? Did you watch the “...

valine · on July 14, 2023

The reason you would insert an LLM into the vision stack is to deal with the weird and unexpected. Tesla’s current stop sign approach is to train a classifier from scratch on thousands of stop signs images. It’s not surprising that architecture can’t deal with stop signs that fall outside the distribution.

LLMs with vision work completely differently. You’re leveraging the world model, built from a terabyte of text data, to aid your classification. The classic example of an image they handle well is a man ironing clothes on the back of a taxi. Where traditional image classifiers wouldn’t have a hope of handling that, vision LLMs describe it with ease.

https://llava.hliu.cc/