Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The phrase, “the first token looks at 1 token,” is simply a shorthand for the self-attention step when the sequence length is one. Although there are no preceding tokens, we still treat it as an O(1^2) operation where the first token effectively attends to itself (or a special [BOS] token). This approach preserves the big-O analysis when summing over all tokens.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: