Transformer Architecture

TL;DR

Introduced in 'Attention Is All You Need' (Vaswani et al., 2017, arXiv:1706.03762), the Transformer replaced sequential recurrence with parallel self-attention, unlocking GPU-scale language model training and every frontier model shipping today.
Three primitives carry the architecture — scaled dot-product attention, multi-head attention and positional encoding — wrapped in a residual block of attention + position-wise feed-forward + normalisation, then stacked 12 to 126 times.
Three family lines emerged: decoder-only causal LMs (GPT, Llama, Claude, Gemini) dominate generative tasks; encoder-only bidirectional models (BERT, modern embeddings) dominate retrieval and classification; encoder-decoder stacks (T5, BART, NLLB) lead translation and structured-output work.
Modern variants (Llama 3.1 405B, Qwen3, DeepSeek-V3, GPT-4o, Claude 4) retain the 2017 skeleton but swap in RoPE, SwiGLU, RMSNorm, Grouped-Query Attention, Flash Attention 3 and FP8 Tensor Cores — a Transformer in spirit, not in literal layer code.
Training cost follows the Chinchilla heuristic of roughly 6·P·T FLOPs (6 × parameters × tokens); a 70B model on 15T tokens consumes about 6.3·10^24 FLOPs, or roughly 9,500 H100-days at 75 % MFU.

Overview

The Transformer is the neural network design that turned sequence modelling into a GPU-native problem. Before it, the field was dominated by recurrent networks (LSTM, GRU) whose token-by-token data dependency made parallel execution structurally impossible. Vaswani et al. published 'Attention Is All You Need' at NeurIPS 2017, demonstrated state-of-the-art English-German translation with a model that contained no recurrence and no convolutions, and quietly rewrote the rest of the decade. Within five years, every credible frontier system — GPT-3, BERT, T5, PaLM, Llama, Claude — was a Transformer.

The architecture matters because it co-designs perfectly with the hardware that exists. Self-attention reduces to a pair of batched matrix multiplications, which is exactly what Tensor Cores were built to accelerate. Multi-head attention parallelises across the head dimension. Position-wise feed-forward networks are large GEMMs. Residual connections and layer norms add negligible cost. Every operation in a Transformer block is either a matmul or an element-wise op — there is no irregular control flow, no sequential state to carry across tokens, no memory-bound recurrence. That is why a single H100 SXM5 reaches 70-85 % of its 989 TFLOPS BF16 peak on Transformer training, where on RNN training it would saturate well below 30 %.

By 2026 the Transformer is not one architecture but a family. The dense decoder-only variant powers most chat-and-completion APIs (Llama 3.1 405B, Qwen3 72B, Claude Sonnet 4). Mixture-of-Experts variants (DeepSeek-V3, Mixtral 8x22B, Qwen3-MoE 235B) sparsify the FFN to push parameter count past 600 billion at a fraction of the dense FLOPs. Encoder-only models still produce most embeddings used by retrieval-augmented systems. Encoder-decoder remains the right tool for translation and structured-input-to-structured-output tasks. The shared skeleton makes the family interoperable in practice: a vLLM build that serves Llama can serve Qwen and Mistral with almost no code changes.

This entry is the reference for the operator who needs to reason about Transformers as a systems engineer, not just as a paper-reader: what each block does mechanically, how the modern variants differ, how memory and FLOPs scale, what fails and how to fix it, and where the architecture is in 2026 relative to the credible challengers (Mamba, RWKV, RetNet, Hyena). This entry helps you understand the Transformer well enough to decide which open-weights model fits your workload, size the GPUs that will serve it, and reason about the failure modes that bite production teams. If you are deploying models on Yobibyte or training on Yobitel NeoCloud, this matters because every model in the catalogue — Llama 3.1 405B, Qwen 3, DeepSeek-V3, Mistral Large 2, BGE embeddings — is some specific Transformer variant, and the routing logic that picks GPU and runtime per workload encodes the maths in this entry.

Quick start: build a scaled dot-product attention head in PyTorch

The shortest path to understanding the Transformer is to implement scaled dot-product attention from scratch. The snippet below runs today on any machine with pip install torch — CPU is fine for sanity checks, a single GPU for anything larger. It implements the single-head attention defined by Vaswani et al. (no multi-head, no causal mask, no Flash kernels), trains a toy embedding lookup against a random target, and prints the loss so you can verify it goes down.

# attention_minimal.py — runs with: pip install torch && python attention_minimal.py
import torch
import torch.nn as nn
import torch.nn.functional as F

torch.manual_seed(42)

def scaled_dot_product_attention(q, k, v, mask=None):
    """Vaswani et al. (2017) eq. 1: softmax(Q K^T / sqrt(d_k)) V"""
    d_k = q.size(-1)
    scores = (q @ k.transpose(-2, -1)) / (d_k ** 0.5)   # (..., n_q, n_k)
    if mask is not None:
        scores = scores.masked_fill(mask == 0, float("-inf"))
    weights = F.softmax(scores, dim=-1)                 # (..., n_q, n_k)
    return weights @ v, weights                          # (..., n_q, d_v)

class SingleHeadAttention(nn.Module):
    def __init__(self, d_model: int, d_k: int, d_v: int):
        super().__init__()
        self.w_q = nn.Linear(d_model, d_k, bias=False)
        self.w_k = nn.Linear(d_model, d_k, bias=False)
        self.w_v = nn.Linear(d_model, d_v, bias=False)

    def forward(self, x: torch.Tensor, causal: bool = True):
        q, k, v = self.w_q(x), self.w_k(x), self.w_v(x)
        n = x.size(-2)
        mask = torch.tril(torch.ones(n, n, device=x.device)) if causal else None
        out, _ = scaled_dot_product_attention(q, k, v, mask=mask)
        return out

# Smoke test: a tiny model overfitting a random sequence.
d_model, n, d_k = 64, 32, 64
x = torch.randn(4, n, d_model)
target = torch.randn(4, n, d_k)
attn = SingleHeadAttention(d_model, d_k, d_k)
opt = torch.optim.Adam(attn.parameters(), lr=1e-2)

for step in range(200):
    out = attn(x, causal=True)
    loss = F.mse_loss(out, target)
    opt.zero_grad(); loss.backward(); opt.step()
    if step % 50 == 0:
        print(f"step {step:>3} loss {loss.item():.4f}")
# Expect loss to drop from ~1.0 to <0.1 within 200 steps.

Tip: Run this on CPU first to confirm the maths. Production code should call torch.nn.functional.scaled_dot_product_attention (which dispatches to Flash Attention 3 on Hopper/Blackwell) rather than writing the QK^T matmul by hand — the kernel above is 5-20x slower and materialises the full n x n attention matrix in HBM.

How it works: the Transformer block, end-to-end

A Transformer is a stack of identical blocks, each containing two sub-layers — multi-head self-attention and a position-wise feed-forward network — with a residual connection and a normalisation around each. The entire model is built by chaining input embeddings, N of these blocks, and an output unembedding. There is no other architectural cleverness; the depth of the design is in the details.

Self-attention is the engine. Given queries Q, keys K and values V — all linear projections of the same input — attention computes softmax(QK^T / sqrt(d_k)) V. Read row by row: each query row asks 'which key rows am I compatible with?', the softmax converts those scores into a probability distribution, and the result is a weighted sum of value rows. The cost is O(n^2 * d) in time and naively O(n^2) in HBM memory — which is what Flash Attention exists to fix. Crucially, every (query, key, value) pair is computed in parallel as a single batched matmul, with no sequential dependency. On a GPU, that distinction is the difference between a workload and a benchmark.

Multi-head attention runs h independent attention operations in parallel, each on a learned projection of Q/K/V into d_k = d_model / h dimensions. The h outputs are concatenated and projected back to d_model by a final W_O matrix. Different heads learn different relationships (syntactic, semantic, positional); the concatenation lets the network combine them in a single layer. Modern variants — Multi-Query Attention (one shared K/V) and Grouped-Query Attention (g shared K/V pairs) — keep h query heads but shrink K and V to reduce the KV cache that dominates serving memory.

The position-wise feed-forward network expands the hidden dimension by a factor of about 4 (or 2.67 for SwiGLU), applies a non-linearity, and contracts back. It is by far the largest source of parameters in the block — typically two thirds of total weights — and is the layer that Mixture-of-Experts variants replace with a sparse gated router.

Two more pieces complete the block. The residual connection x -> x + Sublayer(x) gives gradients an unobstructed path back to the input, allowing 100+ layer stacks to train. The normalisation (LayerNorm originally, RMSNorm in modern decoder-only models) rescales activations to keep them in a numerically stable range. The 2017 design used post-norm (norm after the residual sum); every model since 2020 uses pre-norm (norm inside the sublayer, before attention or FFN) because it does not need learning-rate warm-up tricks to converge at depth.

Position information is injected once at the input (sinusoidal in the original paper, learned in BERT/GPT-2) or repeatedly in attention (RoPE rotation, ALiBi linear bias). Without a positional signal, attention is permutation-equivariant and a sentence is just a bag of tokens.

Embedding lookup: token id -> d_model-dimensional vector. Tied or untied with the output unembedding.
Positional signal: sinusoidal addition (2017), learned absolute (BERT/GPT-2), RoPE rotation (Llama/Qwen/DeepSeek), or ALiBi bias (BLOOM/MPT).
N identical blocks: each is RMSNorm -> Multi-head attention (with causal mask in decoders) -> residual -> RMSNorm -> SwiGLU FFN -> residual.
Final RMSNorm before the output projection (the 'unembedding' or 'lm_head'), producing logits of shape (sequence, vocabulary).
Loss: cross-entropy of predicted next-token vs ground-truth, summed over all positions (with the causal mask making the parallel-training-of-autoregression trick legal).

Variants and architectural choices: a modern decoder-only block

Authoritative component-by-component breakdown of the standard 2026 decoder Transformer block. Every modern decoder-only LLM uses some variant of this layout; the only meaningful diversity is the choice of normalisation, positional encoding, FFN activation and attention sparsity pattern. Values shown for the 'modern default' column are what Llama 3.1, Qwen 3, DeepSeek-V3 (excluding MoE), Mistral Large and Gemma 2 converge on.

Component	2017 original	Modern default (2026)	Purpose
Input embedding	Learned, dim 512	Learned, dim 4,096-16,384	Token id to dense vector.
Positional encoding	Sinusoidal add at input	RoPE rotation inside attention	Inject sequence order.
Pre-block norm	Post-norm LayerNorm	Pre-norm RMSNorm	Stabilise activations, enable deep stacks.
Attention projections	Multi-Head W_Q, W_K, W_V (one per head)	Grouped-Query: h heads of W_Q, g of W_K/W_V	Shrink KV cache by factor h/g.
Attention compute	softmax(QK^T / sqrt(d_k)) V naive	Flash Attention 3 fused kernel	Same maths, O(n) HBM traffic, FP8/BF16 mixed.
Attention output	Concat heads, W_O linear	Concat heads, W_O linear	Mix head outputs back to d_model.
Residual + norm	x + Sublayer(LN(x)) (post)	x + Sublayer(RMSNorm(x)) (pre)	Gradient highway + scale stability.
FFN expansion	d_model -> 4 d_model -> d_model, ReLU	d_model -> ~2.67 d_model -> d_model, SwiGLU	Per-token non-linear processing.
FFN sparsity	Dense	Dense, or top-k MoE in MoE variants	Decouple parameters from per-token FLOPs.
Final norm	None separate	RMSNorm before unembedding	Stabilise logits.
Unembedding	Linear d_model -> vocab	Linear d_model -> vocab, often tied to input	Logits over vocabulary.
Loss	Cross-entropy on next token	Cross-entropy on next token	Autoregressive language modelling.

Note: Hyperparameter trio that recurs across modern frontier models: d_k = 128 per head (Tensor Core friendly), g = 8 KV groups (GQA sweet spot), FFN ratio ~2.67 for SwiGLU. Deviating from these usually loses either quality or kernel efficiency.

Where it is used today: production stacks across the three family lines

By 2026 every credible frontier system and a long tail of production workloads sits on a Transformer of one shape or another. The architecture has split into three family lines, each with a clear deployment pattern.

Decoder-only causal LMs dominate generative work. A decoder-only Transformer is trained to predict the next token from all previous tokens, using a causal attention mask (lower-triangular: position i can attend to positions 1..i but not i+1..n). At inference it generates one token at a time, with a KV cache holding all previously computed keys and values so each new token costs O(n) attention rather than O(n^2). GPT-2, GPT-3, GPT-4, Llama 1/2/3, Qwen, Mistral, Claude, Gemini and DeepSeek are all this shape. The architectural skeleton is identical across them; the differences are scale (parameters, training tokens), tokeniser (BPE vs SentencePiece), and the modern-variant choices in the table above. The snippet below illustrates how a current open-weights decoder is loaded for serving — illustrative, not deployment-grade.

Encoder-only bidirectional models dominate retrieval and classification. They drop the causal mask, so every token attends to every other token bidirectionally — the right architecture for tasks where the full input is known upfront and the goal is to produce a representation (classification, retrieval, named entity recognition, embeddings for vector search). BERT-base (110M, 2018) was the canonical first member; modern descendants include DeBERTa-v3, E5, BGE, GTE and the text-embedding-3 family. Embedding models in 2026 are almost all encoder-only Transformers fine-tuned with contrastive objectives; the output is the final-layer hidden state at the [CLS] token (or mean-pooled across tokens), L2-normalised, written as a row into a vector database that retrieval-augmented generation queries against.

Encoder-decoder stacks remain the right tool for translation and structured-input-to-structured-output tasks. The original 2017 paper described an encoder-decoder for English-German translation: the encoder processes the source bidirectionally; the decoder generates the target autoregressively, attending to its own previous tokens and to the encoder output via cross-attention. T5 (Raffel et al., 2020) demonstrated the same skeleton handled summarisation, question answering and classification by framing every task as text-to-text; NLLB (Costa-jussa et al., 2022) scaled it to 200 languages. For general-purpose chat and reasoning, decoder-only has displaced encoder-decoder on every public leaderboard since GPT-3 — but for translation and constrained-grammar generation (e.g., natural-language-to-SQL with strict schema), the encoder-decoder form still wins.

Yobitel customers running production workloads on Yobibyte see all three family lines in active use: decoder-only Llama 3.1 70B serving chat and code completion, encoder-only BGE / E5 producing embeddings that feed retrieval pipelines, and encoder-decoder NLLB powering multilingual customer-support translation in UK and EU regions. The Transformer skeleton in this entry is not a textbook abstraction for these readers — it is the architecture of every model they consume.

# Illustrative: load a decoder-only causal LM with vLLM.
# pip install vllm
from vllm import LLM, SamplingParams

llm = LLM(
    model="meta-llama/Meta-Llama-3.1-8B-Instruct",
    dtype="bfloat16",
    gpu_memory_utilization=0.90,
    max_model_len=8192,             # context window
    enable_prefix_caching=True,     # cache shared prompt prefixes
    enable_chunked_prefill=True,    # interleave prefill + decode
)

params = SamplingParams(temperature=0.7, top_p=0.9, max_tokens=256)
prompts = ["Summarise the Transformer architecture in three sentences."]
for output in llm.generate(prompts, params):
    print(output.outputs[0].text)

Trade-offs and known limitations

The Transformer is uncontested at the frontier of mid-2026, but it is not unbeatable on every axis. The architecture has well-understood structural costs and a handful of credible challengers that have carved real niches.

The dominant structural cost is attention itself. Compute is O(n^2 * d) in time and naively O(n^2) in HBM memory — Flash Attention 3 reduces the memory traffic to O(n) by tiling but does not change the FLOP count. At n > 256k tokens, prefill latency dominates serving and either Ring Attention (distributing the n x n matrix across devices) or chunked prefill becomes necessary. KV cache grows linearly with sequence length; at 1M tokens, even GQA-shrunk caches can exceed 70 GB per request, forcing offload to CPU or NVMe and an order-of-magnitude latency penalty on decode.

Position-encoding extrapolation is the other persistent limitation. Vanilla RoPE degrades sharply past the training length; RoPE with YaRN scaling extends 4-16x; ALiBi extrapolates further but with lower short-context quality. Vocabulary size linearly grows the embedding and unembedding matrices, which becomes meaningful at the 256k+ token vocabularies needed for full multilingual coverage. Activation memory drives the maximum micro-batch at training, which is why activation checkpointing, ZeRO-3 / FSDP optimiser-state sharding and tensor parallelism are part of every credible training recipe.

Several architectures aim to beat the O(n^2) attention with linear or sub-quadratic alternatives; none has matched Transformer quality at frontier scale through mid-2026, but each has carved a niche. State Space Models (Mamba, Mamba-2; Dao & Gu, 2024) deliver linear-time selective state recurrence — Mamba-2 closes most of the quality gap on language modelling at <13B scale and is strictly faster on long context, with hybrid Jamba (AI21) and Zamba (Zyphra) interleaving SSM blocks with attention blocks to combine the strengths. RWKV (Peng et al., 2023-2024) is a linear-attention RNN with Transformer-style training parallelism that reaches Llama 2 quality at the same parameter scale with O(n) inference. Hyena and Striped Hyena use long convolutions plus gating, linear in n; strong on DNA and time-series, behind on natural language. RetNet (Microsoft, 2023) underperformed its paper's claims at scale and is no longer pursued by frontier teams. The shared lesson: when a team is HBM-constrained at the edge, an SSM hybrid is the right consideration; when serving at frontier scale on NVLink-connected GPUs, the Transformer is still the right answer.

Practical implementation notes

Libraries that implement the Transformer well in 2026: PyTorch's torch.nn.functional.scaled_dot_product_attention dispatches to Flash Attention 3 on Hopper/Blackwell by default and is the reference attention call; xFormers covers older kernels and memory-efficient variants; FlashAttention itself ships standalone wheels for cutting-edge kernel updates; Megatron-LM, TransformerEngine and DeepSpeed cover the distributed-training side; HuggingFace transformers and vLLM cover the model-loading and serving side. The common gotchas below are the ones that bite teams during a first build-out, regardless of which library combination is chosen.

Numerical precision is the most common foot-gun. Loss spikes to NaN in FP16 training almost always mean activation magnitudes exceeded the FP16 range (~65,504) — the cure is BF16 (the 8-bit exponent matches FP32 range), or PyTorch GradScaler loss scaling, not lowering the learning rate. Gradient norm collapsing to zero usually means dead activations in a ReLU FFN or an LR schedule that did not step after warm-up; switching to SwiGLU and verifying the schedule actually progresses fixes both. After FP8 conversion, inference quality drops sharply when the calibration set was unrepresentative — recalibrate with 512-1024 representative prompts, or fall back to FP8-weight plus BF16-activation mixed precision.

Attention-implementation bugs are subtle. If attention output is identical for every position, the causal mask was applied incorrectly (transposed or off-by-one): the mask should be torch.tril of shape (n, n), added to scores BEFORE the softmax, not after — a four-token sanity test catches it. If generated text becomes repetitive past N tokens, that is position-encoding extrapolation beyond the trained length — apply RoPE YaRN scaling with the trained-length-to-target ratio, or fine-tune at the longer context.

At serving time, KV cache OOMs on long contexts mean the cache has exceeded HBM and the PagedAttention block table has fragmented. The fix is to lower max_num_seqs in vLLM, switch to a GQA model, enable prefix caching, or move to H200/B200 with larger HBM. Distributed training stalls at high GPU count typically come from NCCL all-reduce contention with dataloader I/O — pinning dataloader workers to NUMA nodes, tuning NCCL_IB_HCA and verifying NVLink topology with nvidia-smi topo -m fixes the bulk. Output logits biased toward common tokens usually means the output unembedding was tied to the input embedding without re-norming; apply an explicit RMSNorm before lm_head.

On sizing for a planning conversation: inference weight memory is 2 bytes per parameter at BF16, 1 byte at INT8/FP8, 0.5 byte at INT4/FP4 — a 70B BF16 model needs 140 GB just for weights. KV cache memory per request is 2 * num_layers * num_kv_heads * d_k * n * bytes_per_element — Llama 3.1 70B (80 layers, 8 KV heads with GQA, d_k = 128) at 32k tokens FP16 is about 11 GB per request, 44 GB at 128k. Training FLOPs follow ~6 * P * T (forward + backward + optimiser update); training peak memory is roughly 16-20 bytes per parameter without ZeRO/FSDP, dropping to ~2-3 bytes per parameter with ZeRO-3 sharding plus activation checkpointing. Runtime-level throughput, $/token and per-second tables live in the inference-runtime entries (vllm, tensorrt-llm, sglang), not here.

Model card discipline is the standard practice for any team that ships a Transformer-based product. Document dataset composition, weight licence (verify the actual checkpoint licence, not the announcement blog), capabilities, limitations, evaluation results, intended and out-of-scope tasks. UK NCSC's AI Cyber Security Code of Practice (2025) cites model-card disclosure as a baseline control; the EU AI Act (in force August 2026 for general-purpose models) requires it for any model placed on the EU market.

Warning: If you see NaN losses in mixed-precision training, the cure is almost never 'lower the learning rate'. It is almost always 'switch FP16 to BF16' or 'add gradient clipping at norm 1.0'. Lowering LR just masks the underlying numerical issue until it surfaces somewhere else.

Where the Transformer fits in the Yobitel stack

Every workload that runs on Yobibyte — Yobitel's managed AI-native platform — sits on top of a Transformer. The platform exposes a model catalogue covering open-weights decoder-only LLMs (Llama 3.1, Qwen 3, DeepSeek-V3, Mistral), encoder embedding models (BGE, E5, multilingual variants), and encoder-decoder translation models (NLLB). Customers select a model by name in their workspace and Yobibyte routes inference through industry-standard runtimes selected per workload, with the architectural details (KV cache management, FP8 quantisation, GQA support, RoPE scaling) handled transparently.

Omniscient Compute — Yobitel's compute-orchestration fabric — picks the GPU SKU per workload from the inventory described in the compute-hardware entries (H100, H200, B200, B300, MI300X). The picker reasons about KV cache memory pressure, prefill FLOPs and decode bandwidth, all of which are Transformer-architecture-derived signals. A long-context Llama 70B request lands on an H200 or B200 not by accident but because the routing logic encodes the maths from the Sizing section above.

InferenceBench — Yobitel's public benchmark service — measures Transformer serving performance across the same GPU and runtime combinations, reporting tokens-per-second, time-to-first-token, p99 latency and cost-per-million-tokens. The benchmarks are reproducible, the configurations are published, and the methodology is open. For teams selecting a serving stack, InferenceBench is the empirical complement to the architectural reasoning in this entry.

References

Attention Is All You Need (Vaswani et al., 2017) · arXiv
The Illustrated Transformer (Jay Alammar) · Jay Alammar
The Annotated Transformer (Harvard NLP) · Harvard NLP
RoFormer: Enhanced Transformer with Rotary Position Embedding · arXiv
FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision · arXiv
Llama 3 Technical Report · arXiv
Scaling Laws for Neural Language Models (Kaplan et al., 2020) · arXiv
Training Compute-Optimal Large Language Models (Chinchilla, Hoffmann et al., 2022) · arXiv

TL;DR

Introduced in 'Attention Is All You Need' (Vaswani et al., 2017, arXiv:1706.03762), the Transformer replaced sequential recurrence with parallel self-attention, unlocking GPU-scale language model training and every frontier model shipping today.
Three primitives carry the architecture — scaled dot-product attention, multi-head attention and positional encoding — wrapped in a residual block of attention + position-wise feed-forward + normalisation, then stacked 12 to 126 times.
Three family lines emerged: decoder-only causal LMs (GPT, Llama, Claude, Gemini) dominate generative tasks; encoder-only bidirectional models (BERT, modern embeddings) dominate retrieval and classification; encoder-decoder stacks (T5, BART, NLLB) lead translation and structured-output work.
Modern variants (Llama 3.1 405B, Qwen3, DeepSeek-V3, GPT-4o, Claude 4) retain the 2017 skeleton but swap in RoPE, SwiGLU, RMSNorm, Grouped-Query Attention, Flash Attention 3 and FP8 Tensor Cores — a Transformer in spirit, not in literal layer code.
Training cost follows the Chinchilla heuristic of roughly 6·P·T FLOPs (6 × parameters × tokens); a 70B model on 15T tokens consumes about 6.3·10^24 FLOPs, or roughly 9,500 H100-days at 75 % MFU.

Overview

Quick start: build a scaled dot-product attention head in PyTorch

# attention_minimal.py — runs with: pip install torch && python attention_minimal.py
import torch
import torch.nn as nn
import torch.nn.functional as F

torch.manual_seed(42)

def scaled_dot_product_attention(q, k, v, mask=None):
    """Vaswani et al. (2017) eq. 1: softmax(Q K^T / sqrt(d_k)) V"""
    d_k = q.size(-1)
    scores = (q @ k.transpose(-2, -1)) / (d_k ** 0.5)   # (..., n_q, n_k)
    if mask is not None:
        scores = scores.masked_fill(mask == 0, float("-inf"))
    weights = F.softmax(scores, dim=-1)                 # (..., n_q, n_k)
    return weights @ v, weights                          # (..., n_q, d_v)

class SingleHeadAttention(nn.Module):
    def __init__(self, d_model: int, d_k: int, d_v: int):
        super().__init__()
        self.w_q = nn.Linear(d_model, d_k, bias=False)
        self.w_k = nn.Linear(d_model, d_k, bias=False)
        self.w_v = nn.Linear(d_model, d_v, bias=False)

    def forward(self, x: torch.Tensor, causal: bool = True):
        q, k, v = self.w_q(x), self.w_k(x), self.w_v(x)
        n = x.size(-2)
        mask = torch.tril(torch.ones(n, n, device=x.device)) if causal else None
        out, _ = scaled_dot_product_attention(q, k, v, mask=mask)
        return out

# Smoke test: a tiny model overfitting a random sequence.
d_model, n, d_k = 64, 32, 64
x = torch.randn(4, n, d_model)
target = torch.randn(4, n, d_k)
attn = SingleHeadAttention(d_model, d_k, d_k)
opt = torch.optim.Adam(attn.parameters(), lr=1e-2)

for step in range(200):
    out = attn(x, causal=True)
    loss = F.mse_loss(out, target)
    opt.zero_grad(); loss.backward(); opt.step()
    if step % 50 == 0:
        print(f"step {step:>3} loss {loss.item():.4f}")
# Expect loss to drop from ~1.0 to <0.1 within 200 steps.

Tip: Run this on CPU first to confirm the maths. Production code should call torch.nn.functional.scaled_dot_product_attention (which dispatches to Flash Attention 3 on Hopper/Blackwell) rather than writing the QK^T matmul by hand — the kernel above is 5-20x slower and materialises the full n x n attention matrix in HBM.

How it works: the Transformer block, end-to-end

Embedding lookup: token id -> d_model-dimensional vector. Tied or untied with the output unembedding.
Positional signal: sinusoidal addition (2017), learned absolute (BERT/GPT-2), RoPE rotation (Llama/Qwen/DeepSeek), or ALiBi bias (BLOOM/MPT).
N identical blocks: each is RMSNorm -> Multi-head attention (with causal mask in decoders) -> residual -> RMSNorm -> SwiGLU FFN -> residual.
Final RMSNorm before the output projection (the 'unembedding' or 'lm_head'), producing logits of shape (sequence, vocabulary).
Loss: cross-entropy of predicted next-token vs ground-truth, summed over all positions (with the causal mask making the parallel-training-of-autoregression trick legal).

Variants and architectural choices: a modern decoder-only block

Component	2017 original	Modern default (2026)	Purpose
Input embedding	Learned, dim 512	Learned, dim 4,096-16,384	Token id to dense vector.
Positional encoding	Sinusoidal add at input	RoPE rotation inside attention	Inject sequence order.
Pre-block norm	Post-norm LayerNorm	Pre-norm RMSNorm	Stabilise activations, enable deep stacks.
Attention projections	Multi-Head W_Q, W_K, W_V (one per head)	Grouped-Query: h heads of W_Q, g of W_K/W_V	Shrink KV cache by factor h/g.
Attention compute	softmax(QK^T / sqrt(d_k)) V naive	Flash Attention 3 fused kernel	Same maths, O(n) HBM traffic, FP8/BF16 mixed.
Attention output	Concat heads, W_O linear	Concat heads, W_O linear	Mix head outputs back to d_model.
Residual + norm	x + Sublayer(LN(x)) (post)	x + Sublayer(RMSNorm(x)) (pre)	Gradient highway + scale stability.
FFN expansion	d_model -> 4 d_model -> d_model, ReLU	d_model -> ~2.67 d_model -> d_model, SwiGLU	Per-token non-linear processing.
FFN sparsity	Dense	Dense, or top-k MoE in MoE variants	Decouple parameters from per-token FLOPs.
Final norm	None separate	RMSNorm before unembedding	Stabilise logits.
Unembedding	Linear d_model -> vocab	Linear d_model -> vocab, often tied to input	Logits over vocabulary.
Loss	Cross-entropy on next token	Cross-entropy on next token	Autoregressive language modelling.

Note: Hyperparameter trio that recurs across modern frontier models: d_k = 128 per head (Tensor Core friendly), g = 8 KV groups (GQA sweet spot), FFN ratio ~2.67 for SwiGLU. Deviating from these usually loses either quality or kernel efficiency.

Where it is used today: production stacks across the three family lines

# Illustrative: load a decoder-only causal LM with vLLM.
# pip install vllm
from vllm import LLM, SamplingParams

llm = LLM(
    model="meta-llama/Meta-Llama-3.1-8B-Instruct",
    dtype="bfloat16",
    gpu_memory_utilization=0.90,
    max_model_len=8192,             # context window
    enable_prefix_caching=True,     # cache shared prompt prefixes
    enable_chunked_prefill=True,    # interleave prefill + decode
)

params = SamplingParams(temperature=0.7, top_p=0.9, max_tokens=256)
prompts = ["Summarise the Transformer architecture in three sentences."]
for output in llm.generate(prompts, params):
    print(output.outputs[0].text)

Trade-offs and known limitations

Practical implementation notes

Warning: If you see NaN losses in mixed-precision training, the cure is almost never 'lower the learning rate'. It is almost always 'switch FP16 to BF16' or 'add gradient clipping at norm 1.0'. Lowering LR just masks the underlying numerical issue until it surfaces somewhere else.

Where the Transformer fits in the Yobitel stack

References

Attention Is All You Need (Vaswani et al., 2017) · arXiv
The Illustrated Transformer (Jay Alammar) · Jay Alammar
The Annotated Transformer (Harvard NLP) · Harvard NLP
RoFormer: Enhanced Transformer with Rotary Position Embedding · arXiv
FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision · arXiv
Llama 3 Technical Report · arXiv
Scaling Laws for Neural Language Models (Kaplan et al., 2020) · arXiv
Training Compute-Optimal Large Language Models (Chinchilla, Hoffmann et al., 2022) · arXiv

Transformer Architecture

Overview

Quick start: build a scaled dot-product attention head in PyTorch

How it works: the Transformer block, end-to-end

Variants and architectural choices: a modern decoder-only block

Where it is used today: production stacks across the three family lines

Trade-offs and known limitations

Practical implementation notes

Where the Transformer fits in the Yobitel stack

References

Browse all entries

Deploy on Yobibyte

Transformer Architecture

Overview

Quick start: build a scaled dot-product attention head in PyTorch

How it works: the Transformer block, end-to-end

Variants and architectural choices: a modern decoder-only block

Where it is used today: production stacks across the three family lines

Trade-offs and known limitations

Practical implementation notes

Where the Transformer fits in the Yobitel stack

References

Browse all entries

Deploy on Yobibyte