Curiosity · Concept

Decoder-Only Transformer

The original Transformer (Vaswani et al. 2017) had an encoder-decoder structure for translation. GPT-style models dropped the encoder entirely and kept only the decoder stack with a causal attention mask that prevents each position from peeking at future tokens. Training is next-token prediction on massive text corpora; inference generates one token at a time, feeding outputs back as inputs. Scaling this architecture — more parameters, more data, more compute — plus tokenization, positional embeddings (RoPE today), and tricks like grouped-query attention and flash-attention is essentially the recipe for every modern chat LLM.

Quick reference

Proficiency: Intermediate
Also known as: GPT-style transformer, autoregressive transformer, causal LM
Prerequisites: attention mechanism, neural networks

Frequently asked questions

What is a decoder-only transformer?

A decoder-only transformer is a stack of transformer blocks using causal self-attention — each position can only attend to earlier positions. It's trained to predict the next token given all previous ones, and it's the architecture behind GPT, Claude, Llama, and most modern LLMs.

Why decoder-only instead of encoder-decoder?

Simpler (one stack instead of two), scales more efficiently, and next-token prediction on huge corpora turns out to be a remarkably general training objective. Encoder-decoder models (T5, BART) still shine on strict seq2seq tasks like translation, but the industry defaulted to decoder-only for general-purpose LLMs.

What's inside one transformer block?

Layer norm, multi-head self-attention, residual connection, layer norm, MLP (usually SwiGLU in modern models), residual connection. Stack 32-100+ of these and you have a frontier-scale LM.

How does inference work?

Autoregressively: feed the prompt, generate one token, append it to the sequence, feed back in. A KV cache stores prior attention keys/values so each new token is O(seq_len) instead of O(seq_len²).

Sources

Radford et al. — Improving Language Understanding by Generative Pre-Training (GPT) — accessed 2026-04-20
Hugging Face — Transformers overview — accessed 2026-04-20

Quick reference

Frequently asked questions

Sources

Related