Indeed, really simple. And yes, the results are shockingly good. But what I find...

AbrahamParangi · on April 17, 2023

I think it’s just that affine transforms in high dimensions are surprisingly expressive. Since the functions are sparsely defined they’re much less constrained compared to the low dimensional affine transformations we usually think of.

cs702 · on April 17, 2023

Good point. Didn't think of that. It's a plausible explanation here, because the dimensionality of the spaces is so different, 5120 vs 768. Not surprisingly, the trained weight matrix has rank 768: it's using every feature in the lower-dimensional space.

Still, it's kind of shocking that it works so well!

I'd be curious to see if the learned weight matrix ends up being full-rank (or close to full-rank) if both spaces have the same dimensionality.

visarga · on April 17, 2023

They would have full-rank because all the embedding space is used. There are no unused large pockets.

cs702 · on April 17, 2023

The weight matrix's rank would decrease for each feature in the target space that cannot be expressed as as a linear combination of features in the input space (plus a bias). For example, if the target space has a feature representing a non-visual quality like "smelliness," it would not be expressible as a linear combination of features representing visual attributes like "redness," "blueness," and "greenness," etc. in the input space.

If both spaces have the same dimensionality, the learned weight matrix would be full-rank only if every feature in the target space is expressible as a linear combination of features in the input space (plus a bias). Which brings me back to my original question: WHY would that be the case when the two models are trained independently on data that is so different?

sdenton4 · on April 17, 2023

A random nxn matrix is full rank... So it's kinda the default: any amount of noise in the embedding is going to result in full-rank transformations.

So it's really less-than-full rank which would require an explanation - ie, why does this image representation project into this perfectly isolated subspace of the language representation (or vice versa)?

If that happened I would start looking for things like a vocabulary of smell which is completely distinct and non-overlapping with any visual context. But we use cross-modal analogies in language /constantly/ (many smells are associated with things we can see - 'smells like a rose') so you wouldn't expect any clean separations for different modalities... Maybe there's some branch of analytic philosophy which has managed to completely divorce itself from the physical world...

cs702 · on April 17, 2023

> But we use cross-modal analogies in language /constantly/ (many smells are associated with things we can see - 'smells like a rose') so you wouldn't expect any clean separations for different modalities...

That's a really good point. Thank you!

sigmoid10 · on April 17, 2023

>somehow they learn to embed different data in feature vectors that are so... similar

At it's core, BLIP2 already projects RGB inputs into text token space and Vicuna (or rather LLaMA) uses such tokens as inputs as well as outputs. The only reason why a linear layer is needed at all is because they are not trained at the same time, so you still have to move text embeddings from one space to another. But it should not be surprising at all that one hidden linear layer suffices to do just that (see the universal approximation theorem [1]). This approach is just an efficient way to combine different models for downstream fine-tuning tasks while keeping their weights frozen, but it is neither new nor particularly surprising.

[1] https://en.wikipedia.org/wiki/Universal_approximation_theore...

cs702 · on April 17, 2023

Thanks. Your comment about BLIP2 already projecting RGB inputs into (a different) text token space makes sense to me. See also fpgaminer's comment at https://news.ycombinator.com/item?id=35603246 . However, I don't see how the universal approximation theorem is relevant here. The fact that deep models with sufficient capacity can approximate any function does not imply that two deep models trained independently of each other on different tasks will learn to approximate functions that relate to each other only by a linear transformation.

sigmoid10 · on April 17, 2023

>I don't see how the universal approximation theorem is relevant here. The fact that deep models

The universal approximation is exactly not about deep models. Deep means many layers. But in the most simple (and proven) case, a single hidden layer perceptron is all it needs according to the UAT. Technically it also needs a nonlinear activation function, but you get all sorts of nonlinearities for free downstream anyways in this particular model.

cs702 · on April 17, 2023

You'd need to increase width (dimensionality) if you make these models shallow.

My point still stands: The fact that models with sufficient capacity can approximate any function does not imply that two models trained independently of each other on different tasks will learn to approximate functions that relate to each other only by a linear transformation.

sigmoid10 · on April 18, 2023

The UAT states that depth is fundamentally not important, at least theoretically. It only has immense practical uses. So adding an intermediate linear layer + some nonlinearity already gets you an error scaling like O(1/N) for width N (in theory), regardless of what you are actually mapping. At least as long as it's somewhat continuous.

fpgaminer · on April 17, 2023

BLIP2 is a contrastive Image-Language model. The embeddings from the BLIP2 image model are already both aligned with text, and linear. It should not be a surprise that only a projection is required to translate it to LLaMA's embedding space.

famouswaffles · on April 19, 2023

apparently you can project directly with CLIP. See here - https://llava-vl.github.io/. This seems pretty wild to me.

cs702 · on April 19, 2023

That seems pretty wild to me too.

cs702 · on April 17, 2023

This is the best answer. It makes sense to me. Thank you :-)

famouswaffles · on April 19, 2023

as well as this - https://llava-vl.github.io/, Just found this paper that demonstrated this a few months ago (that somehow language and vision models learn representations similar enough that linear projection is enough) https://arxiv.org/abs/2209.15162

cs702 · on April 19, 2023

Thank you for sharing this. I would not have expected that. It does seem pretty wild.

famouswaffles · on April 19, 2023

Man you need to look at this - https://llava-vl.github.io/. They project with a linear layer from Clip directly. With blip-2, you could say it already converts RGB into token space.