Literally the paragraph right before the one you quote is this:
> I am generally outside the ML field, but I do talk with people in the field. One of the things they tell me is that we don’t really know why transformer models have been so successful, or how to make them better. This is my summary of discussions-over-drinks; take it with many grains of salt. I am certain that People in The Comments will drop a gazillion papers to tell you why this is wrong.
As I understand it, this article is basically a conglomeration of several attempts at an article that the author has attempted to make over the past decade or so considering the impacts of AI on society. In their own words:
> Some of these ideas felt prescient in the 2010s and are now obvious. Others may be more novel, or not yet widely-heard. Some predictions will pan out, but others are wild speculation. I hope that regardless of your background or feelings on the current generation of ML systems, you find something interesting to think about.
As for the "Bitter Lesson" part, they pretty much directly said that it wasn't the Bitter Lesson exactly, saying it might be a variant of it. Honestly, it felt more like a way of throwing in a reference to something that also might provoke thought, which was done throughout the piece (which again, is the entire point).
It's totally valid to say "this article didn't provoke much thought for me". I'm a bit confused at why you think a lack of specific domain knowledge in a domain that they literally state they are not an expert in would be disqualifying for that purpose though.
The title of the article is “The Future of Everything is Lies, I Guess” and the first part is literally complaining about LLMs being bullshit machines, while the author proceeds to tell confabulations (or lies) of his own. Is there not a bit of irony in that?
If you’re a non-expert in a field, I don’t think it’s a good sign if you’re writing a 10 part article about that field’s impact on society and getting basic facts wrong. How can I trust that the conclusions will be any more credible?
> The title of the article is “The Future of Everything is Lies, I Guess” and the first part is literally complaining about LLMs being bullshit machines, while the author proceeds to tell confabulations (or lies) of his own. Is there not a bit of irony in that?
Maybe some, but not that much given the disclaimers I cited above. There's value in a qualitative confidence level for a statement, and I'd argue that this is something that LLMs do not seem to produce in practice without someone explicitly asking for it. The human author's ability to anticipate potential mistakes in their logic and communicate those ahead of time is not equivalent to the type of fabrications that LLMs routinely make.
> If you’re a non-expert in a field, I don’t think it’s a good sign if you’re writing a 10 part article about that field’s impact on society and getting basic facts wrong. How can I trust that the conclusions will be any more credible?
I don't know why an expert in LLM implementation would be inherently more qualified to analyze the second-order effects of their product than anyone else. There's precedent for people who are "too close" to something having biases that make them less effective at recognizing how tools will get used by non-experts, and society as a whole is largely composed of people who are not experts in LLM implementations. If you wanted to understand what the net effect of everyone having access to LLMs, having an understanding of people is probably more important than knowing exactly what an LLM does under the hood.
Might the conclusions be correct even if some of the facts are not? Even a stopped clock is right twice a day. And, "approximately correct" is still sometimes valuable.
The most obvious reason is that transformers accept a sequence as an input and produce a sequence as an output. The vast majority of pre-transformer architectures only accepted a fixed input and output size. Before 2016 I was somewhat interested in ML, but my curiosity vanished because of the fixed input and output size limitations.
RNNs including LSTMs at the time were pretty bad and difficult to train due to vanishing and exploding gradients at long sequence lengths and sequential training along the sequence length. Meanwhile transformers can be parallelized along the sequence length.
Then there are theoretical limitations. Transformers re-read the entire sequence for every output. This leads to quadratic attention. There are plenty of papers that tell you why it is impossible to replicate the properties of quadratic attention with linear attention.
The reason is blatantly obvious. If you want linear attention to have the same capability, you need to re-read the entire input sequence after every output. If you do this at the token level, then you have basically implemented quadratic attention.
Transformers aren't a mystery success, they are using computational brute force, which is hard to beat with other architectures. If you go with a more efficient architecture, you are by definition giving up some non-zero capabilities. Nobody really cares about getting slightly worse results from a much more efficient architecture. In the current ML space, it's SOTA (state of the art) or go home.
Transformers do have a fixed input/output size though - that's what a context window is. It's just that, via scaling and algorithmic improvements, the length of usable context windows has increased to the point that they're much less of a bottleneck.
I think your points around parallelisation and the flexibility of quadratic attention are spot-on though.
transformers have a fixed input size (padding the unneeded context window with null tokens). Whether you put in a sequence of things or just random tokens is irrelevant. To the network it is just "one input"
They also have a fixed output of one probability distribution for the next one token.
running it in a loop does not mean it can work with sequences, by that definition, so can literally everything else
Sorry but that's false, you are confusing transformers as an architecture, and auto-regressive generation, and padding during training.
Standard transformers take in an arbitrary input size and run blocks (self and possibly cross attention, positional encoding, MLPs) that don't care about its length.
> They also have a fixed output of one probability distribution for the next one token.
No, in most implementations, they output a probability distribution for every token in the input. If you input 512 tokens, you get 512 probability distributions. You can input however many tokens you want - 1, 2048, one million, it's the same thing (although since standard self-attention scales quadratically you'll eventually run out of memory). Modern relative embeddings like RoPE can support infinite length although the quality will degrade if you extrapolate too far beyond what the model saw during training.
For typical auto-regressive generation, they are trained with causal masking/teacher forcing, which makes it calculate the probability for the next token. During inference, you throw away all but the last probability distribution and use that to sample the next token, and then repeat. You also do this with an RNN. An autoregressive CNN (e.g. WaveNet) would be closer to what you described in that it has a fixed window looking backwards.
But a transformer doesn't have to be used for auto-regressive generation, you can use it for diffusion, as a classifier model, for embedding text. It doesn't even see a sequence as spatially organised - unlike a CNN or an RNN it doesn't have architectural intrinsic biases about the position of elements, which is why it needs positional embeddings. This lets you have 2D, 3D, 4D, or disordered elements in a sequence. You can even have non-regularly sampled sequences. (Again this is for a classic transformer without sliding window attention or any other special modifications).
> (padding the unneeded context window with null tokens).
To have efficient training, you pad all samples in a batch to have the same length (and maybe make it a power of two). But you are working with a single sequence, the length is arbitrary up to hardware limitations, and no padding is needed.
The network has a fixed number of input neurons. You have to put something in all of them.
If you enter "hello", the network might get " hello", but all of its inputs need some inputs. It doesn't (and can't) process tokens one at a time.
"No, in most implementations, they output a probability distribution for every token in the input."
A probability distribution obviously contains a probability for every possible next token. But the whole probability distribution (which adds up to one) only predicts the next ONE token. It predicts what is the probability of that one token being A, or B, or C, etc, giving a probability for each possible token. It's still predicting only one token.
In anything but the last column, the numbers are junk. You can treat them as probability distributions all you want, but the system is only trained to get the outputs of the last column "correct".
Not to be rude, but you're arguing with a machine learning engineer about the basics of neural network architectures :P
> The network has a fixed number of input neurons. You have to put something in all of them.
The way transformers work is that they apply the same "input neurons" to each individual token! It's not:
Token 1 -> Neuron 1
Token 2 -> Neuron 2
Token 3 -> Neuron 3...
With excess neurons not being used, it's
Token 1 -> Vector of dimensions N -> ALL neurons
Token 2 -> Vector of dimensions N -> ALL neurons
Token 3- > Vector of dimensions N -> ALL neurons
...
Grossly oversimplified, in a typical transformer layer, you have 3 distinct such "networks" of neurons. You apply them each token, giving you, for each token, a "query", a "key", and a "value". You take the dot product of there query and key, apply softmax, then multiply it with the value, giving you the vector to input for the next layer.
A probability distribution obviously contains a probability for every possible next token. But the whole probability distribution (which adds up to one) only predicts the next ONE token. It predicts what is the probability of that one token being A, or B, or C, etc, giving a probability for each possible token. It's still predicting only one token.
In anything but the last column, the numbers are junk. You can treat them as probability distributions all you want, but the system is only trained to get the outputs of the last column "correct".
Not quite, the reason transformers train fast is because you can train on all columns at once.
For tokens 1, 2, 3, 4, ... you get predictions for tokens 2, 3, 4, 5... Typical autoregressive transformer training uses a causal mask, so that token 1 doesn't see token 2, enabling you to train on all the predictions at once.
> I am generally outside the ML field, but I do talk with people in the field. One of the things they tell me is that we don’t really know why transformer models have been so successful, or how to make them better. This is my summary of discussions-over-drinks; take it with many grains of salt. I am certain that People in The Comments will drop a gazillion papers to tell you why this is wrong.
As I understand it, this article is basically a conglomeration of several attempts at an article that the author has attempted to make over the past decade or so considering the impacts of AI on society. In their own words:
> Some of these ideas felt prescient in the 2010s and are now obvious. Others may be more novel, or not yet widely-heard. Some predictions will pan out, but others are wild speculation. I hope that regardless of your background or feelings on the current generation of ML systems, you find something interesting to think about.
As for the "Bitter Lesson" part, they pretty much directly said that it wasn't the Bitter Lesson exactly, saying it might be a variant of it. Honestly, it felt more like a way of throwing in a reference to something that also might provoke thought, which was done throughout the piece (which again, is the entire point).
It's totally valid to say "this article didn't provoke much thought for me". I'm a bit confused at why you think a lack of specific domain knowledge in a domain that they literally state they are not an expert in would be disqualifying for that purpose though.