GGML Format

TL;DR

GGML is both a C tensor library (still active) and the name of an early single-file model format (now legacy).
Tensor library underpins llama.cpp, whisper.cpp and other ggerganov projects; provides quantised dtypes and CPU-first execution.
The model-format usage was superseded by GGJT and then GGUF in August 2023.
Old `.ggml` and `.bin` files are still found in the wild but most tools have dropped support; convert to GGUF for modern workflows.

Two Things Called GGML#

GGML refers to two related-but-distinct things. The first is the C tensor library written by Georgi Gerganov — a small, dependency-free implementation of tensor operations with quantised dtypes, designed for CPU and accelerator backends. The library remains very much alive and powers llama.cpp and whisper.cpp.

The second is a single-file model format that llama.cpp used in its early days. That format went through several iterations — GGML, GGJF, GGJT — before being replaced by GGUF in August 2023 because the older formats lacked the metadata flexibility needed for new model architectures.

Tensor Library#

C99, no external dependencies.
Quantised dtypes: Q4_0, Q4_1, Q5_0, Q5_1, Q8_0, Q8_1 and the K-quants (Q2_K through Q6_K).
Operators include matrix multiplication, attention, RoPE, normalisation, activations.
Backends: CPU (with AVX, AVX2, AVX-512, NEON, AMX paths), CUDA, Metal, Vulkan, SYCL, ROCm.
Used by llama.cpp, whisper.cpp, stable-diffusion.cpp, bark.cpp.

Legacy Format#

Old GGML model files used a fixed-layout binary structure with limited metadata. As model architectures diversified — RoPE variants, GQA, sliding windows, MoE — adding new fields required breaking the format. GGUF replaced it with a key-value metadata section that lets new architectures land without breaking existing readers.

Practically, any model file with a `.ggml` extension or marked as 'GGJT' should be considered legacy. Tools dropped support over 2023-2024; convert to GGUF (or download a re-published GGUF version) for use with current llama.cpp builds.

Pre-GGUF model files are not forward compatible. If you find a `.ggml` or `.bin` model that no longer loads, the only fix is to convert from the original safetensors checkpoint to GGUF.

When to Care#

Engineers writing custom inference for non-LLM models (audio, image, scientific) sometimes prefer GGML as a tensor library because of its simplicity and broad hardware reach. For LLM users the library is an implementation detail; the user-visible format is GGUF.

References

ggml on GitHub · GitHub
GGUF Format Specification (successor) · GitHub
llama.cpp on GitHub · GitHub

Two Things Called GGML#

Tensor Library#

C99, no external dependencies.

Quantised dtypes: Q4_0, Q4_1, Q5_0, Q5_1, Q8_0, Q8_1 and the K-quants (Q2_K through Q6_K).

Operators include matrix multiplication, attention, RoPE, normalisation, activations.

Backends: CPU (with AVX, AVX2, AVX-512, NEON, AMX paths), CUDA, Metal, Vulkan, SYCL, ROCm.

Used by llama.cpp, whisper.cpp, stable-diffusion.cpp, bark.cpp.

Legacy Format#

Pre-GGUF model files are not forward compatible. If you find a `.ggml` or `.bin` model that no longer loads, the only fix is to convert from the original safetensors checkpoint to GGUF.

GGML Format

Two Things Called GGML#

Tensor Library#

Legacy Format#

When to Care#

References

Browse all entries

Deploy on Yobitel

GGML Format

Two Things Called GGML#

Tensor Library#

Legacy Format#

When to Care#

References

Browse all entries

Deploy on Yobitel