Curiosity · Concept

Transformer Architecture

The Transformer is the neural network architecture introduced by Vaswani et al. in 2017 that replaced RNNs and LSTMs for sequence tasks. It uses stacked layers of multi-head self-attention and feed-forward blocks, processed in parallel rather than step-by-step. Nearly every modern LLM — GPT, Llama, Claude, Gemini, DeepSeek — is a Transformer variant.

Quick reference

Proficiency: Intermediate
Also known as: Transformer, attention-based model, self-attention networks
Prerequisites: Neural networks (basics), Self-attention, Embeddings

Frequently asked questions

What is the Transformer architecture?

It is a neural network introduced in the 2017 paper 'Attention Is All You Need'. Instead of processing tokens sequentially like an RNN, it uses self-attention to compare every token against every other token in parallel, which makes training faster and lets the model capture long-range dependencies.

What's the difference between encoder-only, decoder-only, and encoder-decoder Transformers?

Encoder-only (BERT) is bidirectional and used for understanding tasks. Decoder-only (GPT, Llama) is causal and used for generation — this is the standard for chat LLMs. Encoder-decoder (T5, original Transformer) is used for translation and conditional generation.

Why did Transformers replace RNNs?

Transformers are parallelizable across a sequence, so you can train on long contexts using GPUs efficiently. RNNs must process tokens one at a time. Transformers also model long-range dependencies better because every token can attend to every other token directly.

Are all modern LLMs Transformers?

Almost all frontier LLMs through 2026 are Transformer variants. A few research architectures (Mamba, RWKV, state-space models) challenge this, and hybrids exist, but the Transformer is still the dominant design.

Sources

Vaswani et al. — Attention Is All You Need — accessed 2026-04-20
The Illustrated Transformer — Jay Alammar — accessed 2026-04-20
Hugging Face — Transformers documentation — accessed 2026-04-20

Quick reference

Frequently asked questions

Sources

Related