Indeed, really simple. And yes, the results are shockingly good. But what I find most remarkable about this is that the ViT-L+Q-former's hidden states are related by only a linear projection (plus bias) to the Vicuna-13B's token embeddings:
emb_in_vicuna_space = emb_in_qformer_space @ W + B
These two models are trained independently of each other, on very different data (RGB images vs integer token ids representing subwords), and yet somehow they learn to embed different data in feature vectors that are so... similar. WHY should that be the case?
It suggests to me there may be something universal about the embedding layers and hidden states of all trained deep learning models.
I think it’s just that affine transforms in high dimensions are surprisingly expressive. Since the functions are sparsely defined they’re much less constrained compared to the low dimensional affine transformations we usually think of.
Good point. Didn't think of that. It's a plausible explanation here, because the dimensionality of the spaces is so different, 5120 vs 768. Not surprisingly, the trained weight matrix has rank 768: it's using every feature in the lower-dimensional space.
Still, it's kind of shocking that it works so well!
I'd be curious to see if the learned weight matrix ends up being full-rank (or close to full-rank) if both spaces have the same dimensionality.
The weight matrix's rank would decrease for each feature in the target space that cannot be expressed as as a linear combination of features in the input space (plus a bias). For example, if the target space has a feature representing a non-visual quality like "smelliness," it would not be expressible as a linear combination of features representing visual attributes like "redness," "blueness," and "greenness," etc. in the input space.
If both spaces have the same dimensionality, the learned weight matrix would be full-rank only if every feature in the target space is expressible as a linear combination of features in the input space (plus a bias). Which brings me back to my original question: WHY would that be the case when the two models are trained independently on data that is so different?
A random nxn matrix is full rank... So it's kinda the default: any amount of noise in the embedding is going to result in full-rank transformations.
So it's really less-than-full rank which would require an explanation - ie, why does this image representation project into this perfectly isolated subspace of the language representation (or vice versa)?
If that happened I would start looking for things like a vocabulary of smell which is completely distinct and non-overlapping with any visual context. But we use cross-modal analogies in language /constantly/ (many smells are associated with things we can see - 'smells like a rose') so you wouldn't expect any clean separations for different modalities... Maybe there's some branch of analytic philosophy which has managed to completely divorce itself from the physical world...
> But we use cross-modal analogies in language /constantly/ (many smells are associated with things we can see - 'smells like a rose') so you wouldn't expect any clean separations for different modalities...
>somehow they learn to embed different data in feature vectors that are so... similar
At it's core, BLIP2 already projects RGB inputs into text token space and Vicuna (or rather LLaMA) uses such tokens as inputs as well as outputs. The only reason why a linear layer is needed at all is because they are not trained at the same time, so you still have to move text embeddings from one space to another. But it should not be surprising at all that one hidden linear layer suffices to do just that (see the universal approximation theorem [1]). This approach is just an efficient way to combine different models for downstream fine-tuning tasks while keeping their weights frozen, but it is neither new nor particularly surprising.
Thanks. Your comment about BLIP2 already projecting RGB inputs into (a different) text token space makes sense to me. See also fpgaminer's comment at https://news.ycombinator.com/item?id=35603246 . However, I don't see how the universal approximation theorem is relevant here. The fact that deep models with sufficient capacity can approximate any function does not imply that two deep models trained independently of each other on different tasks will learn to approximate functions that relate to each other only by a linear transformation.
>I don't see how the universal approximation theorem is relevant here. The fact that deep models
The universal approximation is exactly not about deep models. Deep means many layers. But in the most simple (and proven) case, a single hidden layer perceptron is all it needs according to the UAT. Technically it also needs a nonlinear activation function, but you get all sorts of nonlinearities for free downstream anyways in this particular model.
You'd need to increase width (dimensionality) if you make these models shallow.
My point still stands: The fact that models with sufficient capacity can approximate any function does not imply that two models trained independently of each other on different tasks will learn to approximate functions that relate to each other only by a linear transformation.
The UAT states that depth is fundamentally not important, at least theoretically. It only has immense practical uses. So adding an intermediate linear layer + some nonlinearity already gets you an error scaling like O(1/N) for width N (in theory), regardless of what you are actually mapping. At least as long as it's somewhat continuous.
BLIP2 is a contrastive Image-Language model. The embeddings from the BLIP2 image model are already both aligned with text, and linear. It should not be a surprise that only a projection is required to translate it to LLaMA's embedding space.
as well as this - https://llava-vl.github.io/, Just found this paper that demonstrated this a few months ago (that somehow language and vision models learn representations similar enough that linear projection is enough) https://arxiv.org/abs/2209.15162
Man you need to look at this - https://llava-vl.github.io/. They project with a linear layer from Clip directly. With blip-2, you could say it already converts RGB into token space.
It suggests to me there may be something universal about the embedding layers and hidden states of all trained deep learning models.