SwiGLU Activation

TL;DR

SwiGLU (Shazeer, 2020, arXiv:2002.05202) is a Gated Linear Unit where one branch passes through a Swish activation: SwiGLU(x) = (Swish(xW) ⊙ xV) · W_O.
It replaces the ReLU-based feed-forward block in the original Transformer with a gated three-matrix variant.
Empirically, SwiGLU trains more stably than ReLU or GeLU FFNs at large scale and improves perplexity at iso-compute by 1-2 per cent.
Llama, Mistral, Qwen, PaLM, DeepSeek and Gemma all use SwiGLU. To keep parameter count roughly constant, the hidden expansion is reduced from 4× to about 2.67×.

Background: The Position-wise FFN#

Every Transformer block has a position-wise feed-forward sub-layer: two linear projections with a non-linearity between them. The original 2017 paper used ReLU and expanded d_model by 4× in the hidden layer. BERT switched to GELU. T5 used ReLU. By 2020 it was empirically clear that activation choice modestly but reliably affected final perplexity.

Noam Shazeer's 2020 'GLU Variants Improve Transformer' paper systematically compared ReLU, GELU, Swish and several Gated Linear Unit variants. The winners — GeGLU and SwiGLU — both replaced the two-matrix FFN with a three-matrix gated variant.

The Operation#

Standard FFN: FFN(x) = activation(x · W_1) · W_2. SwiGLU FFN: SwiGLU(x) = (Swish(x · W) ⊙ (x · V)) · W_O, where ⊙ is element-wise multiplication and Swish(x) = x · sigmoid(x).

Three matrices instead of two — W and V both project from d_model to d_ff, then W_O contracts back to d_model. The 'gate' is the Swish-activated branch; it modulates the linear branch element-wise. This gating gives the network more expressive capacity per parameter without dramatically more compute.

python

class SwiGLUFFN(nn.Module):
    def __init__(self, d_model, d_ff):
        super().__init__()
        # d_ff is typically ~2.67 * d_model to keep total params ~ 4 * d_model^2
        self.w_gate = nn.Linear(d_model, d_ff, bias=False)
        self.w_up   = nn.Linear(d_model, d_ff, bias=False)
        self.w_down = nn.Linear(d_ff, d_model, bias=False)

    def forward(self, x):
        return self.w_down(F.silu(self.w_gate(x)) * self.w_up(x))

Why SwiGLU Wins#

Three reasons emerge across published ablations:

Smoother gradients than ReLU — Swish is differentiable everywhere and its derivative is non-zero almost everywhere, avoiding the 'dying ReLU' regime that ReLU FFNs occasionally enter at scale.
Multiplicative gating — the element-wise product gives the network an explicit mechanism to suppress or amplify activations independently per neuron, which a single-branch FFN cannot do.
Empirical scaling — SwiGLU's perplexity advantage is small at small scale but persists or grows as models scale, which is exactly the property worth keeping in a frontier model.

The 2.67× Convention#

A standard ReLU FFN with 4× hidden expansion has 2 · 4 · d_model² = 8 d_model² parameters. SwiGLU has three matrices, so to match parameter count the hidden expansion is reduced to roughly 2.67×, giving 3 · 2.67 · d_model² ≈ 8 d_model². Llama uses exactly this convention, rounded to multiples of 256 for hardware-friendly tile sizes.

When you see 'hidden_size 8192, intermediate_size 28672' in a Llama config, that 28672 / 8192 ≈ 3.5 is the d_ff/d_model ratio, slightly above 2.67 because Llama 3 widened the FFN further. Different model families pick different exact ratios.

Adoption#

PaLM (Google, 2022) was the first frontier model to publicly use SwiGLU. Llama (Meta, 2023) brought it into the open-source mainstream. Since then it has been the default FFN in every major decoder-only LLM release: Llama 2/3, Mistral 7B, Mixtral, Qwen 1.5/2/2.5/3, DeepSeek-V2/V3, Gemma, Phi-3.

The original GeGLU (which uses GELU instead of Swish) is rare in modern models — SwiGLU consistently edged ahead in head-to-head comparisons.

References

GLU Variants Improve Transformer (Shazeer, 2020) · arXiv
PaLM: Scaling Language Modeling with Pathways · arXiv
Llama 2: Open Foundation and Fine-Tuned Chat Models · arXiv

Background: The Position-wise FFN#

The Operation#

Standard FFN: FFN(x) = activation(x · W_1) · W_2. SwiGLU FFN: SwiGLU(x) = (Swish(x · W) ⊙ (x · V)) · W_O, where ⊙ is element-wise multiplication and Swish(x) = x · sigmoid(x).

python

class SwiGLUFFN(nn.Module):
    def __init__(self, d_model, d_ff):
        super().__init__()
        # d_ff is typically ~2.67 * d_model to keep total params ~ 4 * d_model^2
        self.w_gate = nn.Linear(d_model, d_ff, bias=False)
        self.w_up   = nn.Linear(d_model, d_ff, bias=False)
        self.w_down = nn.Linear(d_ff, d_model, bias=False)

    def forward(self, x):
        return self.w_down(F.silu(self.w_gate(x)) * self.w_up(x))

Why SwiGLU Wins#

Three reasons emerge across published ablations:

Smoother gradients than ReLU — Swish is differentiable everywhere and its derivative is non-zero almost everywhere, avoiding the 'dying ReLU' regime that ReLU FFNs occasionally enter at scale.

Multiplicative gating — the element-wise product gives the network an explicit mechanism to suppress or amplify activations independently per neuron, which a single-branch FFN cannot do.

Empirical scaling — SwiGLU's perplexity advantage is small at small scale but persists or grows as models scale, which is exactly the property worth keeping in a frontier model.

The 2.67× Convention#

Adoption#

The original GeGLU (which uses GELU instead of Swish) is rare in modern models — SwiGLU consistently edged ahead in head-to-head comparisons.

SwiGLU Activation

Background: The Position-wise FFN#

The Operation#

Why SwiGLU Wins#

The 2.67× Convention#

Adoption#

References

Browse all entries

Deploy on Yobitel

SwiGLU Activation

Background: The Position-wise FFN#

The Operation#

Why SwiGLU Wins#

The 2.67× Convention#

Adoption#

References

Browse all entries

Deploy on Yobitel