TL;DR
- Axolotl (axolotl-ai-cloud/axolotl, formerly OpenAccess-AI-Collective/axolotl) is an Apache 2.0 open-source fine-tuning framework that wraps Transformers, PEFT, TRL, bitsandbytes, DeepSpeed, FSDP and Accelerate behind a single YAML config file — turning 'train this model with this recipe on this dataset' into a Git-checkable artefact.
- First-class support for LoRA, QLoRA, full fine-tunes, continued pretraining, multi-modal training, sample packing, NEFTune, FlashAttention 2/3, Liger kernels and Unsloth kernels — plus the entire preference family (DPO, ORPO, KTO, IPO, CPO, SimPO, GRPO) using the same config schema.
- Standard invocation is `axolotl train config.yml` (formerly `accelerate launch -m axolotl.cli.train config.yml`); the framework parses the YAML, validates it against a Pydantic schema, infers chat templates and tokeniser configuration, builds the dataset pipeline, wires DeepSpeed / FSDP, and hands off to TRL's SFTTrainer / DPOTrainer / GRPOTrainer.
- Hardware sweep: single H100 80 GB runs 8-13B full QLoRA fine-tunes in 1-3 hours per epoch on 10k examples; single H200 141 GB handles 70B QLoRA at long context; 8x H100 with DeepSpeed ZeRO-3 covers 70B full FT and 405B QLoRA; multi-node FSDP scales linearly to 64+ GPUs.
- It is the recipe-of-choice for many open-model release teams (Nous Research, Cognitive Computations, Teknium, Arcee, Allen AI) — most public fine-tune cards on Hugging Face that say 'trained with Axolotl' link a YAML you can clone unchanged. Yobibyte's FineTune resource exposes Axolotl as one of its execution backends, hidden behind a customer-facing API.
Overview#
Axolotl exists because writing a correct fine-tuning script from scratch is a minefield. The engineer has to choose an optimiser, a learning-rate schedule, the right LoRA target modules for the architecture, a chat template, a tokeniser pad strategy, a data collator, a distributed backend, gradient checkpointing strategy, mixed-precision settings, attention implementation, sample packing, sequence-length policy and dozens of smaller knobs. Get any one wrong — a missing pad token, the wrong chat template, gradient accumulation interacting badly with sample packing — and the run will look fine in the loss curve but converge to a model that quietly underperforms in evaluation. Axolotl encodes the institutional knowledge of those choices into a YAML schema and a Pydantic validator: if your config makes sense, the run will work; if it does not, Axolotl tells you why before the first batch ships.
Axolotl started in 2023 as OpenAccess-AI-Collective/axolotl and now lives at axolotl-ai-cloud/axolotl under the same Apache 2.0 licence and a commercial-cloud arm (Axolotl AI Cloud) that operates managed training. The open-source library remains the substrate every commercial offering relies on; nothing in the YAML is gated. By mid-2026 it supports every mainstream open-weights family (Llama 1/2/3/3.1/3.2/3.3, Mistral, Mixtral, Gemma 1/2/3, Qwen 1.5/2/2.5/3, Phi-2/3/3.5, DeepSeek-V2/V3, Yi, CodeLlama, StarCoder2, Granite), every PEFT method (LoRA, QLoRA, DoRA, LoftQ, GaLore, ReLoRA), every preference method TRL ships (DPO, ORPO, KTO, IPO, CPO, SimPO, GRPO) and every distributed backend Hugging Face Accelerate exposes (DeepSpeed ZeRO-1/2/3, FSDP, FSDP2, plain DDP, single-GPU).
This entry helps you decide whether Axolotl is the right fine-tune harness for your workload, write a config that will train successfully on the first attempt, and reason about where it sits versus Unsloth, LLaMA-Factory and writing your own TRL loop. Yobitel's Yobibyte FineTune resource uses Axolotl as one of its internal execution backends — customers submit a high-level job spec and Yobibyte runs the equivalent Axolotl pipeline on Yobitel-managed H100 / H200 capacity in UK and EU NeoCloud regions with NCSC OFFICIAL alignment — so understanding Axolotl is the fastest route to understanding what Yobibyte FineTune is doing under the hood and when to switch from managed to self-hosted.
Quick start: 8B QLoRA fine-tune in 60 seconds of typing#
The shortest path from an empty directory to a trained adapter on a single H100 is four shell commands. The config below targets Llama 3.1 8B with QLoRA r=32 on the Alpaca dataset, runs for a single epoch, and produces a portable adapter directory of roughly 350 MB.
# 1. Install (Python 3.10+, CUDA 12.4+).
pip install "axolotl[flash-attn,deepspeed]>=0.8.0"
# 2. Pull a known-good example config to start from.
axolotl fetch examples/llama-3/qlora-fsdp-70b.yaml # or 'lora-8b.yaml'
# 3. Edit base_model, datasets and output_dir in the YAML.
$EDITOR llama-3-qlora.yml
# 4. Train. Single-GPU; for multi-GPU add 'accelerate launch' or set 'deepspeed:' in the config.
axolotl train llama-3-qlora.yml
# Outputs:
# ./outputs/llama3-qlora/adapter_model.safetensors (~350 MB)
# ./outputs/llama3-qlora/adapter_config.json
# ./outputs/llama3-qlora/training_args.bin
# ./outputs/llama3-qlora/trainer_state.json (loss curve, eval metrics)
# 5. Merge for serving, or push the adapter as-is for multi-LoRA hosting.
axolotl merge-lora llama-3-qlora.yml --lora-model-dir ./outputs/llama3-qloraStart from `axolotl fetch examples/<family>/<recipe>.yaml` rather than a hand-written config. The example files at `examples/llama-3/`, `examples/mistral/`, `examples/qwen/`, `examples/gemma/` and `examples/phi/` are CI-tested every release on real hardware and codify the right target modules, chat template and optimiser per architecture.
How it works: from YAML to a TRL trainer#
Axolotl is not a new trainer. It is a config-validation, dataset-normalisation and orchestration layer over the standard Hugging Face stack. The CLI entry point reads the YAML, validates every field against an internal Pydantic schema, resolves defaults and architecture-specific overrides, then constructs the same Transformers / PEFT / TRL / Accelerate objects you would build by hand — wired together correctly.
Internally the run unfolds in roughly six phases. (1) Config load: YAML parsed, Pydantic validation, dotted CLI overrides applied (`--learning_rate 1e-4` works on any field). (2) Tokeniser and chat template selection: pulls the canonical template from the tokeniser's `chat_template` attribute when present, applies architecture-specific fallbacks for older models. (3) Dataset pipeline: each entry in `datasets:` is loaded via Hugging Face `datasets`, normalised through a per-format converter (Alpaca, ShareGPT, OpenAI conversations, raw completion, or a custom Jinja template), tokenised, optionally sample-packed up to `sequence_len`, and shuffled. (4) Model load: base model loaded in BF16 or 4-bit NF4 (`load_in_4bit: true`) via bitsandbytes; PEFT wrappers attached if `adapter:` is set; `prepare_model_for_kbit_training` called automatically when quantising. (5) Distributed wrap: Accelerate selects the backend — DeepSpeed (if `deepspeed:` points to a ZeRO config), FSDP (`fsdp:` set), or plain DDP — and wraps the model accordingly. (6) Train: TRL's SFTTrainer (default), DPOTrainer, ORPOTrainer, KTOTrainer, CPOTrainer or GRPOTrainer is constructed with the resolved arguments and `.train()` is invoked. Axolotl adds periodic eval, MLflow / W&B / TensorBoard logging hooks, and a final `save_pretrained` to `output_dir`.
The value Axolotl adds over a hand-written TRL script is the validation layer. The Pydantic schema knows that `adapter: qlora` requires `load_in_4bit: true` and refuses to start if both are not set. It knows that `sample_packing: true` needs `pad_to_sequence_len: true` and a specific data collator. It knows that `lora_target_modules: [q_proj, k_proj]` on a Mixture-of-Experts model misses the expert layers and warns. It knows that mixing `gradient_checkpointing: true` with `use_reentrant: true` on PyTorch 2.5+ produces a silent NaN and patches it. These rules are written down in the validator so the run either works or refuses to start with a precise error, which is the difference between Axolotl and 'a Python script that calls SFTTrainer'.
- Entry point: `axolotl train <config.yml>` (or `accelerate launch -m axolotl.cli.train <config.yml>` for explicit multi-GPU control).
- Config validation: Pydantic schema, fail-fast with precise error messages before any GPU memory is touched.
- Dataset converters built in: alpaca, sharegpt, openai (conversations), llama2_chat, chatml, completion, jinja (custom template).
- Sample packing: concatenates short sequences up to `sequence_len`, with FlashAttention's variable-length attention masking — 2-4x throughput on short-sequence chat data.
- Distributed backends: DeepSpeed ZeRO-1/2/3 (`deepspeed: deepspeed_configs/zero3.json`), FSDP / FSDP2 (`fsdp: ['full_shard', 'auto_wrap']`), plain DDP, single-GPU.
- Output format: standard PEFT adapter directory (adapter_model.safetensors + adapter_config.json) or merged BF16 model — both load directly into vLLM, TensorRT-LLM and SGLang.
Reference: every config.yml field worth knowing#
Axolotl's config schema is the surface most operators interact with. Authoritative reference of the fields that show up in every real-world fine-tune run, grouped by section.
| Field | Type | Typical value | What it does |
|---|---|---|---|
| base_model | string | meta-llama/Meta-Llama-3.1-8B | Hugging Face model ID or local path |
| model_type | string | auto | Override architecture detection (rarely needed) |
| tokenizer_type | string | auto | Override tokeniser class |
| load_in_4bit | bool | true | Enable QLoRA NF4 quantisation of the base |
| load_in_8bit | bool | false | Enable 8-bit quantisation (less common in 2026) |
| bnb_4bit_quant_type | string | nf4 | NF4 (default, near-optimal) or fp4 |
| bnb_4bit_use_double_quant | bool | true | Double-quantise scaling constants |
| bnb_4bit_compute_dtype | string | bfloat16 | Compute dtype after dequant (use BF16, not FP16) |
| adapter | string | qlora | lora, qlora, or empty for full FT |
| lora_r | int | 32 | LoRA rank (sweep [8, 16, 32, 64]) |
| lora_alpha | int | 64 | LoRA alpha (convention: 2 * lora_r) |
| lora_dropout | float | 0.05 | LoRA dropout (0 for large datasets) |
| lora_target_modules | list[string] | [q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj] | Linear layers to wrap with LoRA |
| lora_modules_to_save | list[string] | [embed_tokens, lm_head] | Layers to fully fine-tune alongside LoRA (used for vocab expansion) |
| use_rslora | bool | true | Rank-stabilised LoRA scaling (helpful for r >= 64) |
| use_dora | bool | false | Enable DoRA on top of LoRA |
| peft_use_loftq | bool | false | LoftQ init for QLoRA quality recovery |
| datasets | list[object] | [{path: tatsu-lab/alpaca, type: alpaca}] | One or more dataset entries; each has path + type |
| chat_template | string | llama3 | llama3, chatml, gemma, qwen, mistral, alpaca, jinja |
| sequence_len | int | 4096 | Max tokens per sample; longer needs more activation memory |
| sample_packing | bool | true | Concatenate short samples up to sequence_len |
| pad_to_sequence_len | bool | true | Required when sample_packing is on |
| train_on_inputs | bool | false | If true, loss flows on prompt tokens too (rarely wanted) |
| eval_sample_packing | bool | false | Disable packing on eval split for accurate per-example loss |
| val_set_size | float | 0.05 | Fraction of training data held out for validation |
| micro_batch_size | int | 2 | Per-device train batch size |
| gradient_accumulation_steps | int | 4 | Effective batch = micro * grad_accum * world_size |
| num_epochs | int | 3 | Total passes over the dataset |
| max_steps | int | -1 | Cap total optimiser steps (overrides num_epochs) |
| learning_rate | float | 0.0002 | Peak LR (2e-4 for LoRA, 1e-5 for full FT) |
| lr_scheduler | string | cosine | cosine, linear, constant, constant_with_warmup |
| warmup_ratio | float | 0.03 | Fraction of total steps used for LR warmup |
| optimizer | string | paged_adamw_8bit | Paged AdamW for QLoRA; adamw_torch_fused for full FT |
| weight_decay | float | 0.0 | L2 regularisation |
| bf16 | bool | true | Use bfloat16 mixed precision (default on Ampere+) |
| fp16 | bool | false | Use float16 (legacy, risk of NaN with LoRA) |
| tf32 | bool | true | Enable TF32 matmuls on Ampere+ |
| flash_attention | bool | true | FlashAttention 2 (or 3 on Hopper) |
| liger_kernel | bool | false | Liger fused kernels — extra throughput on supported models |
| unsloth_lora_mlp | bool | false | Unsloth kernels for LoRA MLP (single-GPU speed-up) |
| gradient_checkpointing | bool | true | Trade compute for activation memory |
| gradient_checkpointing_kwargs | object | {use_reentrant: false} | Required false on PyTorch 2.5+ |
| neftune_noise_alpha | float | 5 | NEFTune embedding noise (5-15 helps chat fluency) |
| deepspeed | string | deepspeed_configs/zero3.json | Path to ZeRO config; switches Accelerate to DeepSpeed |
| fsdp | list[string] | [full_shard, auto_wrap] | Enable FSDP wrapping (alternative to DeepSpeed) |
| fsdp_config | object | {fsdp_offload_params: false} | FSDP detailed options |
| rl | string | dpo | Switch to preference training (dpo, orpo, kto, ipo, cpo, simpo, grpo) |
| rl_beta | float | 0.1 | DPO/ORPO/KTO regularisation strength |
| output_dir | string | ./outputs/llama3-qlora | Where adapters and checkpoints land |
| save_steps | int | 200 | Checkpoint cadence |
| save_total_limit | int | 3 | Keep only N most recent checkpoints |
| logging_steps | int | 10 | Loss logging cadence |
| eval_steps | int | 200 | Validation cadence |
| wandb_project | string | yobitel-finetune | W&B project (optional) |
| mlflow_tracking_uri | string | https://mlflow.example.com | MLflow tracking server (optional) |
# llama-3-8b-qlora.yml — production-ready single-H100 fine-tune
base_model: meta-llama/Meta-Llama-3.1-8B
strict: false
# QLoRA: 4-bit NF4 base + BF16 LoRA on top.
load_in_4bit: true
bnb_4bit_quant_type: nf4
bnb_4bit_use_double_quant: true
bnb_4bit_compute_dtype: bfloat16
adapter: qlora
lora_r: 32
lora_alpha: 64
lora_dropout: 0.05
lora_target_modules:
- q_proj
- k_proj
- v_proj
- o_proj
- gate_proj
- up_proj
- down_proj
use_rslora: true
# Data — Alpaca format auto-detected from 'type'.
datasets:
- path: tatsu-lab/alpaca
type: alpaca
chat_template: llama3
val_set_size: 0.05
# Context + packing.
sequence_len: 4096
sample_packing: true
pad_to_sequence_len: true
eval_sample_packing: false
# Optimisation.
micro_batch_size: 2
gradient_accumulation_steps: 8 # effective batch = 16 on 1 GPU
num_epochs: 3
optimizer: paged_adamw_8bit
lr_scheduler: cosine
learning_rate: 0.0002
warmup_ratio: 0.03
# Precision + kernels.
bf16: true
tf32: true
flash_attention: true
gradient_checkpointing: true
gradient_checkpointing_kwargs:
use_reentrant: false
# Logging + checkpointing.
output_dir: ./outputs/llama3-qlora
save_steps: 200
save_total_limit: 3
logging_steps: 10
eval_steps: 200
wandb_project: yobitel-finetune
wandb_run_name: llama3-8b-alpaca-r32Workload patterns: what people actually run on Axolotl#
Three patterns cover the overwhelming majority of real Axolotl runs in 2026. Each pattern has a canonical YAML shape and a known hardware profile; deviating from the pattern usually means deviating from the path most users have validated.
- Pattern 1 — Single-GPU QLoRA on a 7-13B base. Most common workload. `adapter: qlora`, `load_in_4bit: true`, `lora_r: 16-32`, sample packing on, sequence_len 4096-8192, single H100 or A100 80 GB (or RTX 4090 24 GB for 7B). One epoch over 10-50k examples completes in 1-3 hours. Used for: instruction tuning, domain specialisation, persona / style fine-tunes, RAG quality lifts. Yobibyte FineTune's default profile for sub-30B bases.
- Pattern 2 — Multi-GPU QLoRA / full FT on a 30-70B base. `adapter: qlora` (or empty for full FT), `deepspeed: deepspeed_configs/zero3.json` or `fsdp: [full_shard, auto_wrap]`, 8x H100 80 GB or 4x H200 141 GB, micro_batch_size 1-2 with grad_accum 4-8. Full epoch on a 100k example dataset takes 4-12 hours. Used for: serious open-model release fine-tunes (the dolphin-3, OpenHermes-3, nous-hermes-3 line, etc.), high-quality instruction datasets at scale.
- Pattern 3 — DPO / ORPO / GRPO preference training stacked on an SFT artefact. `rl: dpo` (or orpo, kto, grpo), base = the already-SFT'd model, preference dataset of (prompt, chosen, rejected) triples, lower LR (5e-7 to 1e-6 for full FT, 5e-6 to 5e-5 for LoRA), 1 epoch, KL beta 0.1. This is the standard recipe behind every modern open-weights instruction model's final stage. Yobibyte FineTune exposes DPO as `method: dpo` and resolves to the corresponding Axolotl `rl:` configuration internally.
- Pattern 4 (less common) — Continued pretraining on a domain corpus. `adapter:` empty (full FT), raw text dataset (`type: completion`), sequence_len 8k-32k, learning rate 1e-5 with linear warmup over 5% of steps, mixed in 10-20% original pretraining distribution data to prevent catastrophic forgetting. Used for: legal, medical, code, multilingual domain specialisation where SFT alone is insufficient.
- Pattern 5 — Multi-modal SFT (text + image). `processor_type: auto`, vision-language base (LLaVA, Qwen2-VL, Pixtral), `chat_template: chatml` or model-specific, dataset of conversations with image references. Sequence_len typically 8k+. Requires more careful collator handling — Axolotl 0.8+ ships built-in support for the major VL families.
The boundary between Patterns 1-2 is set by base model size and your GPU budget. The boundary between Patterns 2 and 3 is whether you have preference data. If your dataset has only (prompt, response) pairs, Pattern 1 / 2 is the only option. If it has (prompt, chosen, rejected) triples or scalar preference scores, Pattern 3 layered on top of a Pattern-1 SFT artefact reliably outperforms SFT alone.
Sizing and capacity planning#
Sizing for an Axolotl run is dominated by three variables: base model size, sequence length and adapter type (full FT vs LoRA vs QLoRA). The tables below give working-set estimates for the common configurations in 2026, assuming `gradient_checkpointing: true`, `flash_attention: true`, sample packing on and standard `paged_adamw_8bit` optimiser.
| Base size | Method | Seq len | Working VRAM | GPU class | Time / 10k examples |
|---|---|---|---|---|---|
| 7B (Mistral 7B) | QLoRA r=16 | 4k | 12-15 GB | RTX 4090 24 GB | ~45 min |
| 8B (Llama 3.1 8B) | QLoRA r=32 | 4k | 14-18 GB | RTX 4090 / L40S | ~60 min |
| 8B (Llama 3.1 8B) | BF16 LoRA r=32 | 4k | 28-34 GB | A100 40 GB / H100 80 GB | ~40 min |
| 13B (Qwen 14B) | QLoRA r=32 | 4k | 18-24 GB | A100 40 GB / H100 80 GB | ~90 min |
| 34B (Yi 34B) | QLoRA r=32 | 4k | 30-38 GB | A100 80 GB / H100 80 GB | ~3 hr |
| 70B (Llama 3.1 70B) | QLoRA r=32 | 4k | 55-70 GB | H100 80 GB / H200 141 GB | ~6-8 hr |
| 70B (Llama 3.1 70B) | QLoRA r=32 | 16k | 75-95 GB | H200 141 GB | ~12-16 hr |
| 70B (Llama 3.1 70B) | Full FT (ZeRO-3) | 4k | ~500 GB total | 8x H100 80 GB | ~10-14 hr |
| 141B (Mixtral 8x22B) | QLoRA r=32 | 4k | ~95 GB | H200 141 GB / 2x H100 | ~10-14 hr |
| 405B (Llama 3.1 405B) | FSDP-QLoRA r=32 | 4k | ~250 GB total | 4x H100 / 2x H200 | ~20-30 hr |
Limits and quotas#
Axolotl itself has no fixed limits — it inherits whatever the underlying PyTorch, Transformers, PEFT, TRL and Accelerate stack supports. The practical ceilings worth knowing are operational, not framework limits.
| Limit | Practical ceiling (2026) | Notes |
|---|---|---|
| Max sequence length | 131,072 tokens (Llama 3.1) / 1M+ (Qwen2.5) | Activation memory grows linearly even with FlashAttention; reduce micro_batch_size accordingly |
| Max base model size | 405B params (FSDP-QLoRA) | Above 405B, multi-node ZeRO-3 with offload is the only path |
| Max dataset size | Unlimited (streaming) | Set `streaming: true` per dataset entry for >1 TB corpora |
| Max world size | 256+ GPUs (tested on Slurm + DeepSpeed) | Communication overhead scales; ZeRO-3 with FlashAttention recommended |
| Max LoRA rank | 1024+ | Quality plateaus well before this; r=64-128 is the practical upper bound |
| Max checkpoint frequency | Every optimiser step | I/O bound; save_steps >= 200 is the sane default |
| Max batch size (effective) | 8192+ | Limited by gradient accumulation precision; 256-1024 is typical for SFT |
| Max tokenisers per run | 1 | One base = one tokeniser; multi-tokeniser distillation requires a separate harness |
The most common 'limit hit' in practice is OOM caused by sample_packing concatenating to exactly `sequence_len` when activation memory is tight. If you see OOM mid-epoch, drop `sequence_len` by 25%, drop `micro_batch_size` to 1, or both — sample packing keeps throughput high even at micro_batch_size=1.
Observability: loss curves, eval metrics and run hygiene#
Axolotl emits training signals through Hugging Face's standard logging stack and integrates with Weights & Biases, MLflow, TensorBoard and ClearML by default — set `wandb_project:`, `mlflow_tracking_uri:`, `use_tensorboard: true` or `clearml_project:` in the config and metrics ship automatically. The fields you actually want to watch on a healthy run.
- `train/loss` — the live training loss. Should fall steadily, level off around epoch 2-3. A flat curve from step 0 means the model is not learning (check lora_target_modules, learning_rate, dataset format).
- `train/learning_rate` — the LR schedule. Confirms warmup completed and cosine decay engaged. Useful sanity check after editing scheduler config.
- `train/grad_norm` — gradient norm before clipping. Healthy SFT: 0.3-2.0. Spikes to 10+ usually mean LR too high or a poisoned batch.
- `eval/loss` — held-out validation loss. Should track training loss until late in training, then diverge slightly. Large divergence early = overfitting (epochs too high or data too small).
- `eval/perplexity` — exp(eval_loss); easier to compare across runs. Drop relative to base is the headline quality signal for SFT.
- `train/global_step`, `train/epoch` — progress markers. Useful for ETA calculations.
- GPU utilisation — `nvidia-smi dmon` should show >95% SM utilisation in steady state; <80% means the data loader is the bottleneck (increase `dataloader_num_workers`, enable `dataloader_pin_memory`).
- Memory pressure — keep peak VRAM ~10 GB below GPU capacity to allow for activation spikes; check `train/runtime/max_memory` if logged.
# Observability section of a production Axolotl config.
wandb_project: yobitel-finetune
wandb_entity: yobitel-ml
wandb_run_name: llama3-8b-alpaca-r32-v3
wandb_log_model: end
# OR MLflow:
mlflow_tracking_uri: https://mlflow.yobitel.internal
mlflow_experiment_name: llama3-finetune
# OR TensorBoard (local only):
use_tensorboard: true
# Eval cadence and logging.
logging_steps: 10
eval_steps: 200
save_steps: 200
save_total_limit: 3
# Periodic generation samples — qualitative check during training.
do_eval: true
eval_strategy: steps
eval_table_size: 5 # log 5 generated completions per eval stepCost and FinOps#
Axolotl runs are billed by GPU-hour on whichever cloud you use. The dominant cost drivers are base model size, dataset size, sequence length and number of epochs — in that order. Indicative 2026 costs in USD for the common workloads, computed at Yobitel NeoCloud reference pricing ($2.60/H100/hr, $3.20/H200/hr) which sets a useful market baseline.
| Workload | GPUs | Wall time | GPU-hours | Cost (NeoCloud) |
|---|---|---|---|---|
| 8B QLoRA r=32, 10k examples, 1 epoch | 1x H100 | ~1 hr | 1 | ~$2.60 |
| 8B QLoRA r=32, 50k examples, 3 epochs | 1x H100 | ~15 hr | 15 | ~$39 |
| 13B QLoRA r=32, 50k examples, 3 epochs | 1x H100 | ~22 hr | 22 | ~$57 |
| 70B QLoRA r=32, 50k examples, 1 epoch | 1x H200 | ~30 hr | 30 | ~$96 |
| 70B QLoRA r=32, 100k examples, 3 epochs | 2x H200 | ~45 hr | 90 | ~$288 |
| 70B full FT (ZeRO-3), 100k examples, 1 epoch | 8x H100 | ~12 hr | 96 | ~$250 |
| 405B FSDP-QLoRA r=32, 50k examples | 4x H100 | ~25 hr | 100 | ~$260 |
| 8B DPO (after SFT), 20k preference pairs | 1x H100 | ~2 hr | 2 | ~$5 |
QLoRA wins on cost by 3-5x vs full FT for nearly identical quality on standard instruction tuning. Start at QLoRA r=32, evaluate, and only escalate to full FT if your evaluation shows a measurable gap. Yobibyte FineTune defaults to QLoRA for this reason.
Security and compliance#
Axolotl is a training harness — it executes whatever code, dataset and model the operator points it at. Security posture is the operator's responsibility, but the framework supports the controls you need to satisfy NCSC OFFICIAL, SOC 2, ISO 27001 and GDPR Article 32 requirements when running on regulated infrastructure.
- Model and dataset provenance: `base_model:` and `datasets:` accept local paths and S3 / GCS / Azure URIs alongside Hugging Face IDs — keep sensitive bases and datasets inside your control plane rather than pulling from public Hub.
- Token handling: `hf_use_auth_token: true` consumes the `HF_TOKEN` env var; rotate and scope tokens with read-only access to private repos.
- Trust-remote-code: `trust_remote_code: false` is the safe default; some VL and exotic architectures require true, which executes arbitrary code from the model repo — audit before enabling on production capacity.
- Output isolation: `output_dir:` should live on encrypted storage (LUKS, AWS EBS KMS, etc.); adapters can contain learned representations of training data and should be treated as sensitive artefacts.
- Audit logging: when running on Yobitel NeoCloud, every Axolotl invocation is captured through standards-based observability (OpenTelemetry, Prometheus) and shipped to the customer's tenancy audit log — sufficient evidence for SOC 2 CC6 / ISO 27001 A.12.4.
- Air-gapped operation: Axolotl runs offline with `HF_HUB_OFFLINE=1` once base, tokeniser and dataset are pre-staged; required for OFFICIAL-SENSITIVE workloads where outbound network access is disallowed.
- Reproducibility: the YAML config + dataset hash + base model SHA together form the reproducibility manifest. Pin every dependency in `pip install` (e.g. `axolotl==0.8.0`, `transformers==4.45.0`, `peft==0.13.0`) and commit lockfile to evidence-grade reproducibility.
Migration and alternatives#
Axolotl is one of four real options for serious LLM fine-tuning in 2026. Picking between them is mostly a function of priorities — throughput, ecosystem freshness, UI, multi-node — rather than capability.
| Tool | Strength | Weakness | When to pick |
|---|---|---|---|
| Axolotl | Most flexible, ships every new TRL technique fast, YAML-versioned, multi-node native | Steeper config than Unsloth | Production teams shipping many fine-tunes; multi-GPU / multi-node; preference training |
| Unsloth | 2x throughput + 50-70% less VRAM on single GPU via Triton kernels | Single-GPU only in OSS; model-architecture-specific | Solo researcher on one GPU; supported model family; throughput-bound |
| LLaMA-Factory | Gradio web UI, 100+ pre-baked model templates, broad family coverage | More opinionated; harder to customise deeply | Teams that want UI-driven workflow; rapid model-zoo exploration |
| Hand-written TRL | Total control; no YAML schema lock-in | You re-implement everything Axolotl validates | Research code experimenting with novel recipes; one-off custom losses |
| Yobibyte FineTune (managed) | API-only; runs Axolotl on Yobitel-managed H100/H200; multi-LoRA serving included | Less granular than self-hosted Axolotl | Teams that want fine-tuning as a service, not infrastructure to operate |
Axolotl + Unsloth compose: setting `unsloth_lora_mlp: true`, `unsloth_lora_qkv: true`, `unsloth_lora_o: true` in an Axolotl config wires Unsloth's kernels into the Axolotl path for supported architectures, giving you most of Unsloth's single-GPU speed-up with Axolotl's flexibility. This is the highest-throughput single-GPU recipe in 2026.
Troubleshooting#
Failure modes that bite real Axolotl users and the fixes that resolve them.
| Symptom | Most likely cause | Fix |
|---|---|---|
| Pydantic validation error on `axolotl train` | Missing required field or invalid combination | Read the error — Pydantic names the field. Most common: `adapter: qlora` without `load_in_4bit: true` |
| Loss is NaN from step 1 | FP16 with LoRA (use BF16) or wrong chat template | Set `bf16: true`, `fp16: false`; confirm `chat_template:` matches base model |
| Loss decreases but eval quality is terrible | Loss masking off — `train_on_inputs: true` | Set `train_on_inputs: false` (the default; check if overridden) |
| GPU at 60% utilisation, slow training | Data loader is the bottleneck | Increase `dataloader_num_workers` (4-8) and `dataloader_pin_memory: true` |
| OOM mid-epoch despite stable start | Sample packing hit a max-length batch | Drop `micro_batch_size` to 1 or `sequence_len` by 25% |
| `RuntimeError: CUDA error: device-side assert` | Tokeniser produced an out-of-vocab ID | Likely added special tokens without `lora_modules_to_save: [embed_tokens, lm_head]` |
| DeepSpeed ZeRO-3 hangs at start | Mismatched CUDA / NCCL versions across workers | Re-install with consistent CUDA 12.4+; check `NCCL_DEBUG=INFO` output |
| FSDP wraps but does not save adapter correctly | FSDP state-dict type | Set `fsdp_state_dict_type: FULL_STATE_DICT` in `fsdp_config` |
| DPO loss does not decrease | Reference model is wrong or beta too high | Confirm `rl_beta: 0.1` (not 1.0), reference defaults to frozen base |
| Merged model produces garbage | Merged QLoRA adapter directly into 4-bit base | Use `axolotl merge-lora` with `--save_safetensors` — dequantises base first |
| Sample packing reduces throughput instead of increasing it | `eval_sample_packing: true` triggers re-packing on every eval | Set `eval_sample_packing: false` |
| Wandb logs but training loss never appears | `report_to: []` or wandb credentials missing | Set `report_to: [wandb]` and `WANDB_API_KEY` env var |
Where Axolotl fits in the Yobitel stack#
Yobibyte FineTune — the customer-facing fine-tune resource on Yobitel's Yobibyte platform — uses Axolotl as one of its internal execution backends for the open-weights families Axolotl supports best (Llama, Mistral, Mixtral, Gemma, Qwen, Phi, DeepSeek). The customer-facing API accepts a high-level job spec (base, method = lora|qlora|dpo, dataset reference, rank, epochs, learning rate, spend cap) and the platform resolves it into the equivalent Axolotl YAML, runs the job on Yobitel-managed H100 / H200 capacity in NCSC OFFICIAL-aligned UK and EU NeoCloud regions, and returns the resulting adapter directly into the Yobibyte multi-LoRA inference surface — so the customer can call their fine-tuned model through an OpenAI-compatible endpoint within minutes of the job completing.
For teams that want self-hosted control rather than the managed Yobibyte FineTune surface, Yobitel NeoCloud rents H100 and H200 SXM5 capacity by the hour with the same NCSC OFFICIAL alignment; the same Axolotl YAML the customer would run locally runs identically on rented NeoCloud GPUs. The choice between managed Yobibyte FineTune and self-managed Axolotl-on-NeoCloud is the standard build-vs-buy axis: managed wins on time-to-first-adapter and integrated multi-LoRA serving; self-managed wins on the marginal control benefits of writing the YAML yourself.
InferenceBench, Yobitel's public AI-model benchmark, evaluates fine-tuned adapters alongside base models on its leaderboard so customers can compare the empirical quality lift of an Axolotl-produced fine-tune against the base it derived from before committing to production rollout. Cross-link: see the LoRA, QLoRA, SFT, DPO and Yobibyte entries for the corresponding methods and the platform that consumes the outputs.
References
- Axolotl on GitHub · GitHub
- Axolotl documentation · Axolotl Project
- TRL — Transformer Reinforcement Learning · GitHub
- Hugging Face PEFT · GitHub
- DeepSpeed ZeRO · DeepSpeed Project