The phrase, “the first token looks at 1 token,” is simply a shorthand for the se...

The phrase, “the first token looks at 1 token,” is simply a shorthand for the self-attention step when the sequence length is one. Although there are no preceding tokens, we still treat it as an O(1^2) operation where the first token effectively attends to itself (or a special [BOS] token). This approach preserves the big-O analysis when summing over all tokens.