🤖 ML Models
ML - Transformers & Attention
The architecture behind every modern LLM (GPT, Claude, Llama). Introduced in "Attention Is All You Need" (Vaswani et al., 2017). Replaced RNNs and LSTMs because attention allows parallel processing of sequences and captures long-range dependencies that recurrent models struggle with.
2
Minutes
7
Concepts
+45
XP
1
The Core Idea: Self-Attention

For each token in a sequence, compute how much it should "attend to" every other token. This creates a weighted combination of all tokens' representations.

This is how Claude "understands" that a pronoun 500 tokens later refers back to a specific noun — attention creates direct connections between any two positions regardless of distance.