“Attention is all you need” spelled out in this great step-by-step tutorial on building your own transformer (GPT).
Attention was the key discovery that enabled the transformer architecture. The idea is that each token should be able to communicate with or look at each previous token in the sequence but not future tokens. For example given token number 4 in a sequence of 8 tokens, token 4 should be able to access token 1, 2 and 3, but not tokens 5 through 8.
This concept is what makes it possible for a computer to have an intelligent conversation with a human.