Axolotl

TL;DR

Axolotl (axolotl-ai-cloud/axolotl, formerly OpenAccess-AI-Collective/axolotl) is an Apache 2.0 open-source fine-tuning framework that wraps Transformers, PEFT, TRL, bitsandbytes, DeepSpeed, FSDP and Accelerate behind a single YAML config file — turning 'train this model with this recipe on this dataset' into a Git-checkable artefact.
First-class support for LoRA, QLoRA, full fine-tunes, continued pretraining, multi-modal training, sample packing, NEFTune, FlashAttention 2/3, Liger kernels and Unsloth kernels — plus the entire preference family (DPO, ORPO, KTO, IPO, CPO, SimPO, GRPO) using the same config schema.
Standard invocation is `axolotl train config.yml` (formerly `accelerate launch -m axolotl.cli.train config.yml`); the framework parses the YAML, validates it against a Pydantic schema, infers chat templates and tokeniser configuration, builds the dataset pipeline, wires DeepSpeed / FSDP, and hands off to TRL's SFTTrainer / DPOTrainer / GRPOTrainer.
Hardware sweep: single H100 80 GB runs 8-13B full QLoRA fine-tunes in 1-3 hours per epoch on 10k examples; single H200 141 GB handles 70B QLoRA at long context; 8x H100 with DeepSpeed ZeRO-3 covers 70B full FT and 405B QLoRA; multi-node FSDP scales linearly to 64+ GPUs.
It is the recipe-of-choice for many open-model release teams (Nous Research, Cognitive Computations, Teknium, Arcee, Allen AI) — most public fine-tune cards on Hugging Face that say 'trained with Axolotl' link a YAML you can clone unchanged. Yobibyte's FineTune resource exposes Axolotl as one of its execution backends, hidden behind a customer-facing API.

Overview

Axolotl exists because writing a correct fine-tuning script from scratch is a minefield. The engineer has to choose an optimiser, a learning-rate schedule, the right LoRA target modules for the architecture, a chat template, a tokeniser pad strategy, a data collator, a distributed backend, gradient checkpointing strategy, mixed-precision settings, attention implementation, sample packing, sequence-length policy and dozens of smaller knobs. Get any one wrong — a missing pad token, the wrong chat template, gradient accumulation interacting badly with sample packing — and the run will look fine in the loss curve but converge to a model that quietly underperforms in evaluation. Axolotl encodes the institutional knowledge of those choices into a YAML schema and a Pydantic validator: if your config makes sense, the run will work; if it does not, Axolotl tells you why before the first batch ships.

Axolotl started in 2023 as OpenAccess-AI-Collective/axolotl and now lives at axolotl-ai-cloud/axolotl under the same Apache 2.0 licence and a commercial-cloud arm (Axolotl AI Cloud) that operates managed training. The open-source library remains the substrate every commercial offering relies on; nothing in the YAML is gated. By mid-2026 it supports every mainstream open-weights family (Llama 1/2/3/3.1/3.2/3.3, Mistral, Mixtral, Gemma 1/2/3, Qwen 1.5/2/2.5/3, Phi-2/3/3.5, DeepSeek-V2/V3, Yi, CodeLlama, StarCoder2, Granite), every PEFT method (LoRA, QLoRA, DoRA, LoftQ, GaLore, ReLoRA), every preference method TRL ships (DPO, ORPO, KTO, IPO, CPO, SimPO, GRPO) and every distributed backend Hugging Face Accelerate exposes (DeepSpeed ZeRO-1/2/3, FSDP, FSDP2, plain DDP, single-GPU).

This entry helps you decide whether Axolotl is the right fine-tune harness for your workload, write a config that will train successfully on the first attempt, and reason about where it sits versus Unsloth, LLaMA-Factory and writing your own TRL loop. Yobitel's Yobibyte FineTune resource uses Axolotl as one of its internal execution backends — customers submit a high-level job spec and Yobibyte runs the equivalent Axolotl pipeline on Yobitel-managed H100 / H200 capacity in UK and EU NeoCloud regions with NCSC OFFICIAL alignment — so understanding Axolotl is the fastest route to understanding what Yobibyte FineTune is doing under the hood and when to switch from managed to self-hosted.

Quick start: 8B QLoRA fine-tune in 60 seconds of typing

The shortest path from an empty directory to a trained adapter on a single H100 is four shell commands. The config below targets Llama 3.1 8B with QLoRA r=32 on the Alpaca dataset, runs for a single epoch, and produces a portable adapter directory of roughly 350 MB.

# 1. Install (Python 3.10+, CUDA 12.4+).
pip install "axolotl[flash-attn,deepspeed]>=0.8.0"

# 2. Pull a known-good example config to start from.
axolotl fetch examples/llama-3/qlora-fsdp-70b.yaml  # or 'lora-8b.yaml'

# 3. Edit base_model, datasets and output_dir in the YAML.
$EDITOR llama-3-qlora.yml

# 4. Train. Single-GPU; for multi-GPU add 'accelerate launch' or set 'deepspeed:' in the config.
axolotl train llama-3-qlora.yml

# Outputs:
#   ./outputs/llama3-qlora/adapter_model.safetensors  (~350 MB)
#   ./outputs/llama3-qlora/adapter_config.json
#   ./outputs/llama3-qlora/training_args.bin
#   ./outputs/llama3-qlora/trainer_state.json         (loss curve, eval metrics)

# 5. Merge for serving, or push the adapter as-is for multi-LoRA hosting.
axolotl merge-lora llama-3-qlora.yml --lora-model-dir ./outputs/llama3-qlora

Tip: Start from axolotl fetch examples/<family>/<recipe>.yaml rather than a hand-written config. The example files at examples/llama-3/, examples/mistral/, examples/qwen/, examples/gemma/ and examples/phi/ are CI-tested every release on real hardware and codify the right target modules, chat template and optimiser per architecture.

How it works: from YAML to a TRL trainer

Axolotl is not a new trainer. It is a config-validation, dataset-normalisation and orchestration layer over the standard Hugging Face stack. The CLI entry point reads the YAML, validates every field against an internal Pydantic schema, resolves defaults and architecture-specific overrides, then constructs the same Transformers / PEFT / TRL / Accelerate objects you would build by hand — wired together correctly.

Internally the run unfolds in roughly six phases. (1) Config load: YAML parsed, Pydantic validation, dotted CLI overrides applied (--learning_rate 1e-4 works on any field). (2) Tokeniser and chat template selection: pulls the canonical template from the tokeniser's chat_template attribute when present, applies architecture-specific fallbacks for older models. (3) Dataset pipeline: each entry in datasets: is loaded via Hugging Face datasets, normalised through a per-format converter (Alpaca, ShareGPT, OpenAI conversations, raw completion, or a custom Jinja template), tokenised, optionally sample-packed up to sequence_len, and shuffled. (4) Model load: base model loaded in BF16 or 4-bit NF4 (load_in_4bit: true) via bitsandbytes; PEFT wrappers attached if adapter: is set; prepare_model_for_kbit_training called automatically when quantising. (5) Distributed wrap: Accelerate selects the backend — DeepSpeed (if deepspeed: points to a ZeRO config), FSDP (fsdp: set), or plain DDP — and wraps the model accordingly. (6) Train: TRL's SFTTrainer (default), DPOTrainer, ORPOTrainer, KTOTrainer, CPOTrainer or GRPOTrainer is constructed with the resolved arguments and .train() is invoked. Axolotl adds periodic eval, MLflow / W&B / TensorBoard logging hooks, and a final save_pretrained to output_dir.

The value Axolotl adds over a hand-written TRL script is the validation layer. The Pydantic schema knows that adapter: qlora requires load_in_4bit: true and refuses to start if both are not set. It knows that sample_packing: true needs pad_to_sequence_len: true and a specific data collator. It knows that lora_target_modules: [q_proj, k_proj] on a Mixture-of-Experts model misses the expert layers and warns. It knows that mixing gradient_checkpointing: true with use_reentrant: true on PyTorch 2.5+ produces a silent NaN and patches it. These rules are written down in the validator so the run either works or refuses to start with a precise error, which is the difference between Axolotl and 'a Python script that calls SFTTrainer'.

Entry point: axolotl train <config.yml> (or accelerate launch -m axolotl.cli.train <config.yml> for explicit multi-GPU control).
Config validation: Pydantic schema, fail-fast with precise error messages before any GPU memory is touched.
Dataset converters built in: alpaca, sharegpt, openai (conversations), llama2_chat, chatml, completion, jinja (custom template).
Sample packing: concatenates short sequences up to sequence_len, with FlashAttention's variable-length attention masking — 2-4x throughput on short-sequence chat data.
Distributed backends: DeepSpeed ZeRO-1/2/3 (deepspeed: deepspeed_configs/zero3.json), FSDP / FSDP2 (fsdp: ['full_shard', 'auto_wrap']), plain DDP, single-GPU.
Output format: standard PEFT adapter directory (adapter_model.safetensors + adapter_config.json) or merged BF16 model — both load directly into vLLM, TensorRT-LLM and SGLang.

Reference: every config.yml field worth knowing

Axolotl's config schema is the surface most operators interact with. Authoritative reference of the fields that show up in every real-world fine-tune run, grouped by section.

Field	Type	Typical value	What it does
base_model	string	meta-llama/Meta-Llama-3.1-8B	Hugging Face model ID or local path
model_type	string	auto	Override architecture detection (rarely needed)
tokenizer_type	string	auto	Override tokeniser class
load_in_4bit	bool	true	Enable QLoRA NF4 quantisation of the base
load_in_8bit	bool	false	Enable 8-bit quantisation (less common in 2026)
bnb_4bit_quant_type	string	nf4	NF4 (default, near-optimal) or fp4
bnb_4bit_use_double_quant	bool	true	Double-quantise scaling constants
bnb_4bit_compute_dtype	string	bfloat16	Compute dtype after dequant (use BF16, not FP16)
adapter	string	qlora	lora, qlora, or empty for full FT
lora_r	int	32	LoRA rank (sweep [8, 16, 32, 64])
lora_alpha	int	64	LoRA alpha (convention: 2 * lora_r)
lora_dropout	float	0.05	LoRA dropout (0 for large datasets)
lora_target_modules	list[string]	[q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj]	Linear layers to wrap with LoRA
lora_modules_to_save	list[string]	[embed_tokens, lm_head]	Layers to fully fine-tune alongside LoRA (used for vocab expansion)
use_rslora	bool	true	Rank-stabilised LoRA scaling (helpful for r >= 64)
use_dora	bool	false	Enable DoRA on top of LoRA
peft_use_loftq	bool	false	LoftQ init for QLoRA quality recovery
datasets	list[object]	[{path: tatsu-lab/alpaca, type: alpaca}]	One or more dataset entries; each has path + type
chat_template	string	llama3	llama3, chatml, gemma, qwen, mistral, alpaca, jinja
sequence_len	int	4096	Max tokens per sample; longer needs more activation memory
sample_packing	bool	true	Concatenate short samples up to sequence_len
pad_to_sequence_len	bool	true	Required when sample_packing is on
train_on_inputs	bool	false	If true, loss flows on prompt tokens too (rarely wanted)
eval_sample_packing	bool	false	Disable packing on eval split for accurate per-example loss
val_set_size	float	0.05	Fraction of training data held out for validation
micro_batch_size	int	2	Per-device train batch size
gradient_accumulation_steps	int	4	Effective batch = micro * grad_accum * world_size
num_epochs	int	3	Total passes over the dataset
max_steps	int	-1	Cap total optimiser steps (overrides num_epochs)
learning_rate	float	0.0002	Peak LR (2e-4 for LoRA, 1e-5 for full FT)
lr_scheduler	string	cosine	cosine, linear, constant, constant_with_warmup
warmup_ratio	float	0.03	Fraction of total steps used for LR warmup
optimizer	string	paged_adamw_8bit	Paged AdamW for QLoRA; adamw_torch_fused for full FT
weight_decay	float	0.0	L2 regularisation
bf16	bool	true	Use bfloat16 mixed precision (default on Ampere+)
fp16	bool	false	Use float16 (legacy, risk of NaN with LoRA)
tf32	bool	true	Enable TF32 matmuls on Ampere+
flash_attention	bool	true	FlashAttention 2 (or 3 on Hopper)
liger_kernel	bool	false	Liger fused kernels — extra throughput on supported models
unsloth_lora_mlp	bool	false	Unsloth kernels for LoRA MLP (single-GPU speed-up)
gradient_checkpointing	bool	true	Trade compute for activation memory
gradient_checkpointing_kwargs	object	{use_reentrant: false}	Required false on PyTorch 2.5+
neftune_noise_alpha	float	5	NEFTune embedding noise (5-15 helps chat fluency)
deepspeed	string	deepspeed_configs/zero3.json	Path to ZeRO config; switches Accelerate to DeepSpeed
fsdp	list[string]	[full_shard, auto_wrap]	Enable FSDP wrapping (alternative to DeepSpeed)
fsdp_config	object	{fsdp_offload_params: false}	FSDP detailed options
rl	string	dpo	Switch to preference training (dpo, orpo, kto, ipo, cpo, simpo, grpo)
rl_beta	float	0.1	DPO/ORPO/KTO regularisation strength
output_dir	string	./outputs/llama3-qlora	Where adapters and checkpoints land
save_steps	int	200	Checkpoint cadence
save_total_limit	int	3	Keep only N most recent checkpoints
logging_steps	int	10	Loss logging cadence
eval_steps	int	200	Validation cadence
wandb_project	string	yobitel-finetune	W&B project (optional)
mlflow_tracking_uri	string	https://mlflow.example.com	MLflow tracking server (optional)

# llama-3-8b-qlora.yml — production-ready single-H100 fine-tune
base_model: meta-llama/Meta-Llama-3.1-8B
strict: false

# QLoRA: 4-bit NF4 base + BF16 LoRA on top.
load_in_4bit: true
bnb_4bit_quant_type: nf4
bnb_4bit_use_double_quant: true
bnb_4bit_compute_dtype: bfloat16

adapter: qlora
lora_r: 32
lora_alpha: 64
lora_dropout: 0.05
lora_target_modules:
  - q_proj
  - k_proj
  - v_proj
  - o_proj
  - gate_proj
  - up_proj
  - down_proj
use_rslora: true

# Data — Alpaca format auto-detected from 'type'.
datasets:
  - path: tatsu-lab/alpaca
    type: alpaca
chat_template: llama3
val_set_size: 0.05

# Context + packing.
sequence_len: 4096
sample_packing: true
pad_to_sequence_len: true
eval_sample_packing: false

# Optimisation.
micro_batch_size: 2
gradient_accumulation_steps: 8        # effective batch = 16 on 1 GPU
num_epochs: 3
optimizer: paged_adamw_8bit
lr_scheduler: cosine
learning_rate: 0.0002
warmup_ratio: 0.03

# Precision + kernels.
bf16: true
tf32: true
flash_attention: true
gradient_checkpointing: true
gradient_checkpointing_kwargs:
  use_reentrant: false

# Logging + checkpointing.
output_dir: ./outputs/llama3-qlora
save_steps: 200
save_total_limit: 3
logging_steps: 10
eval_steps: 200
wandb_project: yobitel-finetune
wandb_run_name: llama3-8b-alpaca-r32

Workload patterns: what people actually run on Axolotl

Three patterns cover the overwhelming majority of real Axolotl runs in 2026. Each pattern has a canonical YAML shape and a known hardware profile; deviating from the pattern usually means deviating from the path most users have validated.

Pattern 1 — Single-GPU QLoRA on a 7-13B base. Most common workload. adapter: qlora, load_in_4bit: true, lora_r: 16-32, sample packing on, sequence_len 4096-8192, single H100 or A100 80 GB (or RTX 4090 24 GB for 7B). One epoch over 10-50k examples completes in 1-3 hours. Used for: instruction tuning, domain specialisation, persona / style fine-tunes, RAG quality lifts. Yobibyte FineTune's default profile for sub-30B bases.
Pattern 2 — Multi-GPU QLoRA / full FT on a 30-70B base. adapter: qlora (or empty for full FT), deepspeed: deepspeed_configs/zero3.json or fsdp: [full_shard, auto_wrap], 8x H100 80 GB or 4x H200 141 GB, micro_batch_size 1-2 with grad_accum 4-8. Full epoch on a 100k example dataset takes 4-12 hours. Used for: serious open-model release fine-tunes (the dolphin-3, OpenHermes-3, nous-hermes-3 line, etc.), high-quality instruction datasets at scale.
Pattern 3 — DPO / ORPO / GRPO preference training stacked on an SFT artefact. rl: dpo (or orpo, kto, grpo), base = the already-SFT'd model, preference dataset of (prompt, chosen, rejected) triples, lower LR (5e-7 to 1e-6 for full FT, 5e-6 to 5e-5 for LoRA), 1 epoch, KL beta 0.1. This is the standard recipe behind every modern open-weights instruction model's final stage. Yobibyte FineTune exposes DPO as method: dpo and resolves to the corresponding Axolotl rl: configuration internally.
Pattern 4 (less common) — Continued pretraining on a domain corpus. adapter: empty (full FT), raw text dataset (type: completion), sequence_len 8k-32k, learning rate 1e-5 with linear warmup over 5% of steps, mixed in 10-20% original pretraining distribution data to prevent catastrophic forgetting. Used for: legal, medical, code, multilingual domain specialisation where SFT alone is insufficient.
Pattern 5 — Multi-modal SFT (text + image). processor_type: auto, vision-language base (LLaVA, Qwen2-VL, Pixtral), chat_template: chatml or model-specific, dataset of conversations with image references. Sequence_len typically 8k+. Requires more careful collator handling — Axolotl 0.8+ ships built-in support for the major VL families.

Note: The boundary between Patterns 1-2 is set by base model size and your GPU budget. The boundary between Patterns 2 and 3 is whether you have preference data. If your dataset has only (prompt, response) pairs, Pattern 1 / 2 is the only option. If it has (prompt, chosen, rejected) triples or scalar preference scores, Pattern 3 layered on top of a Pattern-1 SFT artefact reliably outperforms SFT alone.

Sizing and capacity planning

Sizing for an Axolotl run is dominated by three variables: base model size, sequence length and adapter type (full FT vs LoRA vs QLoRA). The tables below give working-set estimates for the common configurations in 2026, assuming gradient_checkpointing: true, flash_attention: true, sample packing on and standard paged_adamw_8bit optimiser.

Base size	Method	Seq len	Working VRAM	GPU class	Time / 10k examples
7B (Mistral 7B)	QLoRA r=16	4k	12-15 GB	RTX 4090 24 GB	~45 min
8B (Llama 3.1 8B)	QLoRA r=32	4k	14-18 GB	RTX 4090 / L40S	~60 min
8B (Llama 3.1 8B)	BF16 LoRA r=32	4k	28-34 GB	A100 40 GB / H100 80 GB	~40 min
13B (Qwen 14B)	QLoRA r=32	4k	18-24 GB	A100 40 GB / H100 80 GB	~90 min
34B (Yi 34B)	QLoRA r=32	4k	30-38 GB	A100 80 GB / H100 80 GB	~3 hr
70B (Llama 3.1 70B)	QLoRA r=32	4k	55-70 GB	H100 80 GB / H200 141 GB	~6-8 hr
70B (Llama 3.1 70B)	QLoRA r=32	16k	75-95 GB	H200 141 GB	~12-16 hr
70B (Llama 3.1 70B)	Full FT (ZeRO-3)	4k	~500 GB total	8x H100 80 GB	~10-14 hr
141B (Mixtral 8x22B)	QLoRA r=32	4k	~95 GB	H200 141 GB / 2x H100	~10-14 hr
405B (Llama 3.1 405B)	FSDP-QLoRA r=32	4k	~250 GB total	4x H100 / 2x H200	~20-30 hr

Limits and quotas

Axolotl itself has no fixed limits — it inherits whatever the underlying PyTorch, Transformers, PEFT, TRL and Accelerate stack supports. The practical ceilings worth knowing are operational, not framework limits.

Limit	Practical ceiling (2026)	Notes
Max sequence length	131,072 tokens (Llama 3.1) / 1M+ (Qwen2.5)	Activation memory grows linearly even with FlashAttention; reduce micro_batch_size accordingly
Max base model size	405B params (FSDP-QLoRA)	Above 405B, multi-node ZeRO-3 with offload is the only path
Max dataset size	Unlimited (streaming)	Set `streaming: true` per dataset entry for >1 TB corpora
Max world size	256+ GPUs (tested on Slurm + DeepSpeed)	Communication overhead scales; ZeRO-3 with FlashAttention recommended
Max LoRA rank	1024+	Quality plateaus well before this; r=64-128 is the practical upper bound
Max checkpoint frequency	Every optimiser step	I/O bound; save_steps >= 200 is the sane default
Max batch size (effective)	8192+	Limited by gradient accumulation precision; 256-1024 is typical for SFT
Max tokenisers per run	1	One base = one tokeniser; multi-tokeniser distillation requires a separate harness

Warning: The most common 'limit hit' in practice is OOM caused by sample_packing concatenating to exactly sequence_len when activation memory is tight. If you see OOM mid-epoch, drop sequence_len by 25%, drop micro_batch_size to 1, or both — sample packing keeps throughput high even at micro_batch_size=1.

Observability: loss curves, eval metrics and run hygiene

Axolotl emits training signals through Hugging Face's standard logging stack and integrates with Weights & Biases, MLflow, TensorBoard and ClearML by default — set wandb_project:, mlflow_tracking_uri:, use_tensorboard: true or clearml_project: in the config and metrics ship automatically. The fields you actually want to watch on a healthy run.

train/loss — the live training loss. Should fall steadily, level off around epoch 2-3. A flat curve from step 0 means the model is not learning (check lora_target_modules, learning_rate, dataset format).
train/learning_rate — the LR schedule. Confirms warmup completed and cosine decay engaged. Useful sanity check after editing scheduler config.
train/grad_norm — gradient norm before clipping. Healthy SFT: 0.3-2.0. Spikes to 10+ usually mean LR too high or a poisoned batch.
eval/loss — held-out validation loss. Should track training loss until late in training, then diverge slightly. Large divergence early = overfitting (epochs too high or data too small).
eval/perplexity — exp(eval_loss); easier to compare across runs. Drop relative to base is the headline quality signal for SFT.
train/global_step, train/epoch — progress markers. Useful for ETA calculations.
GPU utilisation — nvidia-smi dmon should show >95% SM utilisation in steady state; <80% means the data loader is the bottleneck (increase dataloader_num_workers, enable dataloader_pin_memory).
Memory pressure — keep peak VRAM ~10 GB below GPU capacity to allow for activation spikes; check train/runtime/max_memory if logged.

# Observability section of a production Axolotl config.
wandb_project: yobitel-finetune
wandb_entity: yobitel-ml
wandb_run_name: llama3-8b-alpaca-r32-v3
wandb_log_model: end

# OR MLflow:
mlflow_tracking_uri: https://mlflow.yobitel.internal
mlflow_experiment_name: llama3-finetune

# OR TensorBoard (local only):
use_tensorboard: true

# Eval cadence and logging.
logging_steps: 10
eval_steps: 200
save_steps: 200
save_total_limit: 3

# Periodic generation samples — qualitative check during training.
do_eval: true
eval_strategy: steps
eval_table_size: 5         # log 5 generated completions per eval step

Cost and FinOps

Axolotl runs are billed by GPU-hour on whichever cloud you use. The dominant cost drivers are base model size, dataset size, sequence length and number of epochs — in that order. Indicative 2026 costs in USD for the common workloads, computed at Yobitel NeoCloud reference pricing ($2.60/H100/hr, $3.20/H200/hr) which sets a useful market baseline.

Workload	GPUs	Wall time	GPU-hours	Cost (NeoCloud)
8B QLoRA r=32, 10k examples, 1 epoch	1x H100	~1 hr	1	~$2.60
8B QLoRA r=32, 50k examples, 3 epochs	1x H100	~15 hr	15	~$39
13B QLoRA r=32, 50k examples, 3 epochs	1x H100	~22 hr	22	~$57
70B QLoRA r=32, 50k examples, 1 epoch	1x H200	~30 hr	30	~$96
70B QLoRA r=32, 100k examples, 3 epochs	2x H200	~45 hr	90	~$288
70B full FT (ZeRO-3), 100k examples, 1 epoch	8x H100	~12 hr	96	~$250
405B FSDP-QLoRA r=32, 50k examples	4x H100	~25 hr	100	~$260
8B DPO (after SFT), 20k preference pairs	1x H100	~2 hr	2	~$5

Tip: QLoRA wins on cost by 3-5x vs full FT for nearly identical quality on standard instruction tuning. Start at QLoRA r=32, evaluate, and only escalate to full FT if your evaluation shows a measurable gap. Yobibyte FineTune defaults to QLoRA for this reason.

Security and compliance

Axolotl is a training harness — it executes whatever code, dataset and model the operator points it at. Security posture is the operator's responsibility, but the framework supports the controls you need to satisfy NCSC OFFICIAL, SOC 2, ISO 27001 and GDPR Article 32 requirements when running on regulated infrastructure.

Model and dataset provenance: base_model: and datasets: accept local paths and S3 / GCS / Azure URIs alongside Hugging Face IDs — keep sensitive bases and datasets inside your control plane rather than pulling from public Hub.
Token handling: hf_use_auth_token: true consumes the HF_TOKEN env var; rotate and scope tokens with read-only access to private repos.
Trust-remote-code: trust_remote_code: false is the safe default; some VL and exotic architectures require true, which executes arbitrary code from the model repo — audit before enabling on production capacity.
Output isolation: output_dir: should live on encrypted storage (LUKS, AWS EBS KMS, etc.); adapters can contain learned representations of training data and should be treated as sensitive artefacts.
Audit logging: when running on Yobitel NeoCloud, every Axolotl invocation is captured through standards-based observability (OpenTelemetry, Prometheus) and shipped to the customer's tenancy audit log — sufficient evidence for SOC 2 CC6 / ISO 27001 A.12.4.
Air-gapped operation: Axolotl runs offline with HF_HUB_OFFLINE=1 once base, tokeniser and dataset are pre-staged; required for OFFICIAL-SENSITIVE workloads where outbound network access is disallowed.
Reproducibility: the YAML config + dataset hash + base model SHA together form the reproducibility manifest. Pin every dependency in pip install (e.g. axolotl==0.8.0, transformers==4.45.0, peft==0.13.0) and commit lockfile to evidence-grade reproducibility.

Migration and alternatives

Axolotl is one of four real options for serious LLM fine-tuning in 2026. Picking between them is mostly a function of priorities — throughput, ecosystem freshness, UI, multi-node — rather than capability.

Tool	Strength	Weakness	When to pick
Axolotl	Most flexible, ships every new TRL technique fast, YAML-versioned, multi-node native	Steeper config than Unsloth	Production teams shipping many fine-tunes; multi-GPU / multi-node; preference training
Unsloth	2x throughput + 50-70% less VRAM on single GPU via Triton kernels	Single-GPU only in OSS; model-architecture-specific	Solo researcher on one GPU; supported model family; throughput-bound
LLaMA-Factory	Gradio web UI, 100+ pre-baked model templates, broad family coverage	More opinionated; harder to customise deeply	Teams that want UI-driven workflow; rapid model-zoo exploration
Hand-written TRL	Total control; no YAML schema lock-in	You re-implement everything Axolotl validates	Research code experimenting with novel recipes; one-off custom losses
Yobibyte FineTune (managed)	API-only; runs Axolotl on Yobitel-managed H100/H200; multi-LoRA serving included	Less granular than self-hosted Axolotl	Teams that want fine-tuning as a service, not infrastructure to operate

Note: Axolotl + Unsloth compose: setting unsloth_lora_mlp: true, unsloth_lora_qkv: true, unsloth_lora_o: true in an Axolotl config wires Unsloth's kernels into the Axolotl path for supported architectures, giving you most of Unsloth's single-GPU speed-up with Axolotl's flexibility. This is the highest-throughput single-GPU recipe in 2026.

Troubleshooting

Failure modes that bite real Axolotl users and the fixes that resolve them.

Symptom	Most likely cause	Fix
Pydantic validation error on `axolotl train`	Missing required field or invalid combination	Read the error — Pydantic names the field. Most common: `adapter: qlora` without `load_in_4bit: true`
Loss is NaN from step 1	FP16 with LoRA (use BF16) or wrong chat template	Set `bf16: true`, `fp16: false`; confirm `chat_template:` matches base model
Loss decreases but eval quality is terrible	Loss masking off — `train_on_inputs: true`	Set `train_on_inputs: false` (the default; check if overridden)
GPU at 60% utilisation, slow training	Data loader is the bottleneck	Increase `dataloader_num_workers` (4-8) and `dataloader_pin_memory: true`
OOM mid-epoch despite stable start	Sample packing hit a max-length batch	Drop `micro_batch_size` to 1 or `sequence_len` by 25%
`RuntimeError: CUDA error: device-side assert`	Tokeniser produced an out-of-vocab ID	Likely added special tokens without `lora_modules_to_save: [embed_tokens, lm_head]`
DeepSpeed ZeRO-3 hangs at start	Mismatched CUDA / NCCL versions across workers	Re-install with consistent CUDA 12.4+; check `NCCL_DEBUG=INFO` output
FSDP wraps but does not save adapter correctly	FSDP state-dict type	Set `fsdp_state_dict_type: FULL_STATE_DICT` in `fsdp_config`
DPO loss does not decrease	Reference model is wrong or beta too high	Confirm `rl_beta: 0.1` (not 1.0), reference defaults to frozen base
Merged model produces garbage	Merged QLoRA adapter directly into 4-bit base	Use `axolotl merge-lora` with `--save_safetensors` — dequantises base first
Sample packing reduces throughput instead of increasing it	`eval_sample_packing: true` triggers re-packing on every eval	Set `eval_sample_packing: false`
Wandb logs but training loss never appears	`report_to: []` or wandb credentials missing	Set `report_to: [wandb]` and `WANDB_API_KEY` env var

Where Axolotl fits in the Yobitel stack

Yobibyte FineTune — the customer-facing fine-tune resource on Yobitel's Yobibyte platform — uses Axolotl as one of its internal execution backends for the open-weights families Axolotl supports best (Llama, Mistral, Mixtral, Gemma, Qwen, Phi, DeepSeek). The customer-facing API accepts a high-level job spec (base, method = lora|qlora|dpo, dataset reference, rank, epochs, learning rate, spend cap) and the platform resolves it into the equivalent Axolotl YAML, runs the job on Yobitel-managed H100 / H200 capacity in NCSC OFFICIAL-aligned UK and EU NeoCloud regions, and returns the resulting adapter directly into the Yobibyte multi-LoRA inference surface — so the customer can call their fine-tuned model through an OpenAI-compatible endpoint within minutes of the job completing.

For teams that want self-hosted control rather than the managed Yobibyte FineTune surface, Yobitel NeoCloud rents H100 and H200 SXM5 capacity by the hour with the same NCSC OFFICIAL alignment; the same Axolotl YAML the customer would run locally runs identically on rented NeoCloud GPUs. The choice between managed Yobibyte FineTune and self-managed Axolotl-on-NeoCloud is the standard build-vs-buy axis: managed wins on time-to-first-adapter and integrated multi-LoRA serving; self-managed wins on the marginal control benefits of writing the YAML yourself.

InferenceBench, Yobitel's public AI-model benchmark, evaluates fine-tuned adapters alongside base models on its leaderboard so customers can compare the empirical quality lift of an Axolotl-produced fine-tune against the base it derived from before committing to production rollout. Cross-link: see the LoRA, QLoRA, SFT, DPO and Yobibyte entries for the corresponding methods and the platform that consumes the outputs.

References

Axolotl on GitHub · GitHub
Axolotl documentation · Axolotl Project
TRL — Transformer Reinforcement Learning · GitHub
Hugging Face PEFT · GitHub
DeepSpeed ZeRO · DeepSpeed Project

TL;DR

Axolotl (axolotl-ai-cloud/axolotl, formerly OpenAccess-AI-Collective/axolotl) is an Apache 2.0 open-source fine-tuning framework that wraps Transformers, PEFT, TRL, bitsandbytes, DeepSpeed, FSDP and Accelerate behind a single YAML config file — turning 'train this model with this recipe on this dataset' into a Git-checkable artefact.
First-class support for LoRA, QLoRA, full fine-tunes, continued pretraining, multi-modal training, sample packing, NEFTune, FlashAttention 2/3, Liger kernels and Unsloth kernels — plus the entire preference family (DPO, ORPO, KTO, IPO, CPO, SimPO, GRPO) using the same config schema.
Standard invocation is `axolotl train config.yml` (formerly `accelerate launch -m axolotl.cli.train config.yml`); the framework parses the YAML, validates it against a Pydantic schema, infers chat templates and tokeniser configuration, builds the dataset pipeline, wires DeepSpeed / FSDP, and hands off to TRL's SFTTrainer / DPOTrainer / GRPOTrainer.
Hardware sweep: single H100 80 GB runs 8-13B full QLoRA fine-tunes in 1-3 hours per epoch on 10k examples; single H200 141 GB handles 70B QLoRA at long context; 8x H100 with DeepSpeed ZeRO-3 covers 70B full FT and 405B QLoRA; multi-node FSDP scales linearly to 64+ GPUs.
It is the recipe-of-choice for many open-model release teams (Nous Research, Cognitive Computations, Teknium, Arcee, Allen AI) — most public fine-tune cards on Hugging Face that say 'trained with Axolotl' link a YAML you can clone unchanged. Yobibyte's FineTune resource exposes Axolotl as one of its execution backends, hidden behind a customer-facing API.

Overview

Quick start: 8B QLoRA fine-tune in 60 seconds of typing

# 1. Install (Python 3.10+, CUDA 12.4+).
pip install "axolotl[flash-attn,deepspeed]>=0.8.0"

# 2. Pull a known-good example config to start from.
axolotl fetch examples/llama-3/qlora-fsdp-70b.yaml  # or 'lora-8b.yaml'

# 3. Edit base_model, datasets and output_dir in the YAML.
$EDITOR llama-3-qlora.yml

# 4. Train. Single-GPU; for multi-GPU add 'accelerate launch' or set 'deepspeed:' in the config.
axolotl train llama-3-qlora.yml

# Outputs:
#   ./outputs/llama3-qlora/adapter_model.safetensors  (~350 MB)
#   ./outputs/llama3-qlora/adapter_config.json
#   ./outputs/llama3-qlora/training_args.bin
#   ./outputs/llama3-qlora/trainer_state.json         (loss curve, eval metrics)

# 5. Merge for serving, or push the adapter as-is for multi-LoRA hosting.
axolotl merge-lora llama-3-qlora.yml --lora-model-dir ./outputs/llama3-qlora

Tip: Start from axolotl fetch examples/<family>/<recipe>.yaml rather than a hand-written config. The example files at examples/llama-3/, examples/mistral/, examples/qwen/, examples/gemma/ and examples/phi/ are CI-tested every release on real hardware and codify the right target modules, chat template and optimiser per architecture.

How it works: from YAML to a TRL trainer

Entry point: axolotl train <config.yml> (or accelerate launch -m axolotl.cli.train <config.yml> for explicit multi-GPU control).
Config validation: Pydantic schema, fail-fast with precise error messages before any GPU memory is touched.
Dataset converters built in: alpaca, sharegpt, openai (conversations), llama2_chat, chatml, completion, jinja (custom template).
Sample packing: concatenates short sequences up to sequence_len, with FlashAttention's variable-length attention masking — 2-4x throughput on short-sequence chat data.
Distributed backends: DeepSpeed ZeRO-1/2/3 (deepspeed: deepspeed_configs/zero3.json), FSDP / FSDP2 (fsdp: ['full_shard', 'auto_wrap']), plain DDP, single-GPU.
Output format: standard PEFT adapter directory (adapter_model.safetensors + adapter_config.json) or merged BF16 model — both load directly into vLLM, TensorRT-LLM and SGLang.

Reference: every config.yml field worth knowing

Axolotl's config schema is the surface most operators interact with. Authoritative reference of the fields that show up in every real-world fine-tune run, grouped by section.

Field	Type	Typical value	What it does
base_model	string	meta-llama/Meta-Llama-3.1-8B	Hugging Face model ID or local path
model_type	string	auto	Override architecture detection (rarely needed)
tokenizer_type	string	auto	Override tokeniser class
load_in_4bit	bool	true	Enable QLoRA NF4 quantisation of the base
load_in_8bit	bool	false	Enable 8-bit quantisation (less common in 2026)
bnb_4bit_quant_type	string	nf4	NF4 (default, near-optimal) or fp4
bnb_4bit_use_double_quant	bool	true	Double-quantise scaling constants
bnb_4bit_compute_dtype	string	bfloat16	Compute dtype after dequant (use BF16, not FP16)
adapter	string	qlora	lora, qlora, or empty for full FT
lora_r	int	32	LoRA rank (sweep [8, 16, 32, 64])
lora_alpha	int	64	LoRA alpha (convention: 2 * lora_r)
lora_dropout	float	0.05	LoRA dropout (0 for large datasets)
lora_target_modules	list[string]	[q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj]	Linear layers to wrap with LoRA
lora_modules_to_save	list[string]	[embed_tokens, lm_head]	Layers to fully fine-tune alongside LoRA (used for vocab expansion)
use_rslora	bool	true	Rank-stabilised LoRA scaling (helpful for r >= 64)
use_dora	bool	false	Enable DoRA on top of LoRA
peft_use_loftq	bool	false	LoftQ init for QLoRA quality recovery
datasets	list[object]	[{path: tatsu-lab/alpaca, type: alpaca}]	One or more dataset entries; each has path + type
chat_template	string	llama3	llama3, chatml, gemma, qwen, mistral, alpaca, jinja
sequence_len	int	4096	Max tokens per sample; longer needs more activation memory
sample_packing	bool	true	Concatenate short samples up to sequence_len
pad_to_sequence_len	bool	true	Required when sample_packing is on
train_on_inputs	bool	false	If true, loss flows on prompt tokens too (rarely wanted)
eval_sample_packing	bool	false	Disable packing on eval split for accurate per-example loss
val_set_size	float	0.05	Fraction of training data held out for validation
micro_batch_size	int	2	Per-device train batch size
gradient_accumulation_steps	int	4	Effective batch = micro * grad_accum * world_size
num_epochs	int	3	Total passes over the dataset
max_steps	int	-1	Cap total optimiser steps (overrides num_epochs)
learning_rate	float	0.0002	Peak LR (2e-4 for LoRA, 1e-5 for full FT)
lr_scheduler	string	cosine	cosine, linear, constant, constant_with_warmup
warmup_ratio	float	0.03	Fraction of total steps used for LR warmup
optimizer	string	paged_adamw_8bit	Paged AdamW for QLoRA; adamw_torch_fused for full FT
weight_decay	float	0.0	L2 regularisation
bf16	bool	true	Use bfloat16 mixed precision (default on Ampere+)
fp16	bool	false	Use float16 (legacy, risk of NaN with LoRA)
tf32	bool	true	Enable TF32 matmuls on Ampere+
flash_attention	bool	true	FlashAttention 2 (or 3 on Hopper)
liger_kernel	bool	false	Liger fused kernels — extra throughput on supported models
unsloth_lora_mlp	bool	false	Unsloth kernels for LoRA MLP (single-GPU speed-up)
gradient_checkpointing	bool	true	Trade compute for activation memory
gradient_checkpointing_kwargs	object	{use_reentrant: false}	Required false on PyTorch 2.5+
neftune_noise_alpha	float	5	NEFTune embedding noise (5-15 helps chat fluency)
deepspeed	string	deepspeed_configs/zero3.json	Path to ZeRO config; switches Accelerate to DeepSpeed
fsdp	list[string]	[full_shard, auto_wrap]	Enable FSDP wrapping (alternative to DeepSpeed)
fsdp_config	object	{fsdp_offload_params: false}	FSDP detailed options
rl	string	dpo	Switch to preference training (dpo, orpo, kto, ipo, cpo, simpo, grpo)
rl_beta	float	0.1	DPO/ORPO/KTO regularisation strength
output_dir	string	./outputs/llama3-qlora	Where adapters and checkpoints land
save_steps	int	200	Checkpoint cadence
save_total_limit	int	3	Keep only N most recent checkpoints
logging_steps	int	10	Loss logging cadence
eval_steps	int	200	Validation cadence
wandb_project	string	yobitel-finetune	W&B project (optional)
mlflow_tracking_uri	string	https://mlflow.example.com	MLflow tracking server (optional)

# llama-3-8b-qlora.yml — production-ready single-H100 fine-tune
base_model: meta-llama/Meta-Llama-3.1-8B
strict: false

# QLoRA: 4-bit NF4 base + BF16 LoRA on top.
load_in_4bit: true
bnb_4bit_quant_type: nf4
bnb_4bit_use_double_quant: true
bnb_4bit_compute_dtype: bfloat16

adapter: qlora
lora_r: 32
lora_alpha: 64
lora_dropout: 0.05
lora_target_modules:
  - q_proj
  - k_proj
  - v_proj
  - o_proj
  - gate_proj
  - up_proj
  - down_proj
use_rslora: true

# Data — Alpaca format auto-detected from 'type'.
datasets:
  - path: tatsu-lab/alpaca
    type: alpaca
chat_template: llama3
val_set_size: 0.05

# Context + packing.
sequence_len: 4096
sample_packing: true
pad_to_sequence_len: true
eval_sample_packing: false

# Optimisation.
micro_batch_size: 2
gradient_accumulation_steps: 8        # effective batch = 16 on 1 GPU
num_epochs: 3
optimizer: paged_adamw_8bit
lr_scheduler: cosine
learning_rate: 0.0002
warmup_ratio: 0.03

# Precision + kernels.
bf16: true
tf32: true
flash_attention: true
gradient_checkpointing: true
gradient_checkpointing_kwargs:
  use_reentrant: false

# Logging + checkpointing.
output_dir: ./outputs/llama3-qlora
save_steps: 200
save_total_limit: 3
logging_steps: 10
eval_steps: 200
wandb_project: yobitel-finetune
wandb_run_name: llama3-8b-alpaca-r32

Workload patterns: what people actually run on Axolotl

Pattern 1 — Single-GPU QLoRA on a 7-13B base. Most common workload. adapter: qlora, load_in_4bit: true, lora_r: 16-32, sample packing on, sequence_len 4096-8192, single H100 or A100 80 GB (or RTX 4090 24 GB for 7B). One epoch over 10-50k examples completes in 1-3 hours. Used for: instruction tuning, domain specialisation, persona / style fine-tunes, RAG quality lifts. Yobibyte FineTune's default profile for sub-30B bases.
Pattern 2 — Multi-GPU QLoRA / full FT on a 30-70B base. adapter: qlora (or empty for full FT), deepspeed: deepspeed_configs/zero3.json or fsdp: [full_shard, auto_wrap], 8x H100 80 GB or 4x H200 141 GB, micro_batch_size 1-2 with grad_accum 4-8. Full epoch on a 100k example dataset takes 4-12 hours. Used for: serious open-model release fine-tunes (the dolphin-3, OpenHermes-3, nous-hermes-3 line, etc.), high-quality instruction datasets at scale.
Pattern 3 — DPO / ORPO / GRPO preference training stacked on an SFT artefact. rl: dpo (or orpo, kto, grpo), base = the already-SFT'd model, preference dataset of (prompt, chosen, rejected) triples, lower LR (5e-7 to 1e-6 for full FT, 5e-6 to 5e-5 for LoRA), 1 epoch, KL beta 0.1. This is the standard recipe behind every modern open-weights instruction model's final stage. Yobibyte FineTune exposes DPO as method: dpo and resolves to the corresponding Axolotl rl: configuration internally.
Pattern 4 (less common) — Continued pretraining on a domain corpus. adapter: empty (full FT), raw text dataset (type: completion), sequence_len 8k-32k, learning rate 1e-5 with linear warmup over 5% of steps, mixed in 10-20% original pretraining distribution data to prevent catastrophic forgetting. Used for: legal, medical, code, multilingual domain specialisation where SFT alone is insufficient.
Pattern 5 — Multi-modal SFT (text + image). processor_type: auto, vision-language base (LLaVA, Qwen2-VL, Pixtral), chat_template: chatml or model-specific, dataset of conversations with image references. Sequence_len typically 8k+. Requires more careful collator handling — Axolotl 0.8+ ships built-in support for the major VL families.

Note: The boundary between Patterns 1-2 is set by base model size and your GPU budget. The boundary between Patterns 2 and 3 is whether you have preference data. If your dataset has only (prompt, response) pairs, Pattern 1 / 2 is the only option. If it has (prompt, chosen, rejected) triples or scalar preference scores, Pattern 3 layered on top of a Pattern-1 SFT artefact reliably outperforms SFT alone.

Sizing and capacity planning

Base size	Method	Seq len	Working VRAM	GPU class	Time / 10k examples
7B (Mistral 7B)	QLoRA r=16	4k	12-15 GB	RTX 4090 24 GB	~45 min
8B (Llama 3.1 8B)	QLoRA r=32	4k	14-18 GB	RTX 4090 / L40S	~60 min
8B (Llama 3.1 8B)	BF16 LoRA r=32	4k	28-34 GB	A100 40 GB / H100 80 GB	~40 min
13B (Qwen 14B)	QLoRA r=32	4k	18-24 GB	A100 40 GB / H100 80 GB	~90 min
34B (Yi 34B)	QLoRA r=32	4k	30-38 GB	A100 80 GB / H100 80 GB	~3 hr
70B (Llama 3.1 70B)	QLoRA r=32	4k	55-70 GB	H100 80 GB / H200 141 GB	~6-8 hr
70B (Llama 3.1 70B)	QLoRA r=32	16k	75-95 GB	H200 141 GB	~12-16 hr
70B (Llama 3.1 70B)	Full FT (ZeRO-3)	4k	~500 GB total	8x H100 80 GB	~10-14 hr
141B (Mixtral 8x22B)	QLoRA r=32	4k	~95 GB	H200 141 GB / 2x H100	~10-14 hr
405B (Llama 3.1 405B)	FSDP-QLoRA r=32	4k	~250 GB total	4x H100 / 2x H200	~20-30 hr

Limits and quotas

Limit	Practical ceiling (2026)	Notes
Max sequence length	131,072 tokens (Llama 3.1) / 1M+ (Qwen2.5)	Activation memory grows linearly even with FlashAttention; reduce micro_batch_size accordingly
Max base model size	405B params (FSDP-QLoRA)	Above 405B, multi-node ZeRO-3 with offload is the only path
Max dataset size	Unlimited (streaming)	Set `streaming: true` per dataset entry for >1 TB corpora
Max world size	256+ GPUs (tested on Slurm + DeepSpeed)	Communication overhead scales; ZeRO-3 with FlashAttention recommended
Max LoRA rank	1024+	Quality plateaus well before this; r=64-128 is the practical upper bound
Max checkpoint frequency	Every optimiser step	I/O bound; save_steps >= 200 is the sane default
Max batch size (effective)	8192+	Limited by gradient accumulation precision; 256-1024 is typical for SFT
Max tokenisers per run	1	One base = one tokeniser; multi-tokeniser distillation requires a separate harness

Warning: The most common 'limit hit' in practice is OOM caused by sample_packing concatenating to exactly sequence_len when activation memory is tight. If you see OOM mid-epoch, drop sequence_len by 25%, drop micro_batch_size to 1, or both — sample packing keeps throughput high even at micro_batch_size=1.

Observability: loss curves, eval metrics and run hygiene

train/loss — the live training loss. Should fall steadily, level off around epoch 2-3. A flat curve from step 0 means the model is not learning (check lora_target_modules, learning_rate, dataset format).
train/learning_rate — the LR schedule. Confirms warmup completed and cosine decay engaged. Useful sanity check after editing scheduler config.
train/grad_norm — gradient norm before clipping. Healthy SFT: 0.3-2.0. Spikes to 10+ usually mean LR too high or a poisoned batch.
eval/loss — held-out validation loss. Should track training loss until late in training, then diverge slightly. Large divergence early = overfitting (epochs too high or data too small).
eval/perplexity — exp(eval_loss); easier to compare across runs. Drop relative to base is the headline quality signal for SFT.
train/global_step, train/epoch — progress markers. Useful for ETA calculations.
GPU utilisation — nvidia-smi dmon should show >95% SM utilisation in steady state; <80% means the data loader is the bottleneck (increase dataloader_num_workers, enable dataloader_pin_memory).
Memory pressure — keep peak VRAM ~10 GB below GPU capacity to allow for activation spikes; check train/runtime/max_memory if logged.

# Observability section of a production Axolotl config.
wandb_project: yobitel-finetune
wandb_entity: yobitel-ml
wandb_run_name: llama3-8b-alpaca-r32-v3
wandb_log_model: end

# OR MLflow:
mlflow_tracking_uri: https://mlflow.yobitel.internal
mlflow_experiment_name: llama3-finetune

# OR TensorBoard (local only):
use_tensorboard: true

# Eval cadence and logging.
logging_steps: 10
eval_steps: 200
save_steps: 200
save_total_limit: 3

# Periodic generation samples — qualitative check during training.
do_eval: true
eval_strategy: steps
eval_table_size: 5         # log 5 generated completions per eval step

Cost and FinOps

Workload	GPUs	Wall time	GPU-hours	Cost (NeoCloud)
8B QLoRA r=32, 10k examples, 1 epoch	1x H100	~1 hr	1	~$2.60
8B QLoRA r=32, 50k examples, 3 epochs	1x H100	~15 hr	15	~$39
13B QLoRA r=32, 50k examples, 3 epochs	1x H100	~22 hr	22	~$57
70B QLoRA r=32, 50k examples, 1 epoch	1x H200	~30 hr	30	~$96
70B QLoRA r=32, 100k examples, 3 epochs	2x H200	~45 hr	90	~$288
70B full FT (ZeRO-3), 100k examples, 1 epoch	8x H100	~12 hr	96	~$250
405B FSDP-QLoRA r=32, 50k examples	4x H100	~25 hr	100	~$260
8B DPO (after SFT), 20k preference pairs	1x H100	~2 hr	2	~$5

Tip: QLoRA wins on cost by 3-5x vs full FT for nearly identical quality on standard instruction tuning. Start at QLoRA r=32, evaluate, and only escalate to full FT if your evaluation shows a measurable gap. Yobibyte FineTune defaults to QLoRA for this reason.

Security and compliance

Model and dataset provenance: base_model: and datasets: accept local paths and S3 / GCS / Azure URIs alongside Hugging Face IDs — keep sensitive bases and datasets inside your control plane rather than pulling from public Hub.
Token handling: hf_use_auth_token: true consumes the HF_TOKEN env var; rotate and scope tokens with read-only access to private repos.
Trust-remote-code: trust_remote_code: false is the safe default; some VL and exotic architectures require true, which executes arbitrary code from the model repo — audit before enabling on production capacity.
Output isolation: output_dir: should live on encrypted storage (LUKS, AWS EBS KMS, etc.); adapters can contain learned representations of training data and should be treated as sensitive artefacts.
Audit logging: when running on Yobitel NeoCloud, every Axolotl invocation is captured through standards-based observability (OpenTelemetry, Prometheus) and shipped to the customer's tenancy audit log — sufficient evidence for SOC 2 CC6 / ISO 27001 A.12.4.
Air-gapped operation: Axolotl runs offline with HF_HUB_OFFLINE=1 once base, tokeniser and dataset are pre-staged; required for OFFICIAL-SENSITIVE workloads where outbound network access is disallowed.
Reproducibility: the YAML config + dataset hash + base model SHA together form the reproducibility manifest. Pin every dependency in pip install (e.g. axolotl==0.8.0, transformers==4.45.0, peft==0.13.0) and commit lockfile to evidence-grade reproducibility.

Migration and alternatives

Tool	Strength	Weakness	When to pick
Axolotl	Most flexible, ships every new TRL technique fast, YAML-versioned, multi-node native	Steeper config than Unsloth	Production teams shipping many fine-tunes; multi-GPU / multi-node; preference training
Unsloth	2x throughput + 50-70% less VRAM on single GPU via Triton kernels	Single-GPU only in OSS; model-architecture-specific	Solo researcher on one GPU; supported model family; throughput-bound
LLaMA-Factory	Gradio web UI, 100+ pre-baked model templates, broad family coverage	More opinionated; harder to customise deeply	Teams that want UI-driven workflow; rapid model-zoo exploration
Hand-written TRL	Total control; no YAML schema lock-in	You re-implement everything Axolotl validates	Research code experimenting with novel recipes; one-off custom losses
Yobibyte FineTune (managed)	API-only; runs Axolotl on Yobitel-managed H100/H200; multi-LoRA serving included	Less granular than self-hosted Axolotl	Teams that want fine-tuning as a service, not infrastructure to operate

Note: Axolotl + Unsloth compose: setting unsloth_lora_mlp: true, unsloth_lora_qkv: true, unsloth_lora_o: true in an Axolotl config wires Unsloth's kernels into the Axolotl path for supported architectures, giving you most of Unsloth's single-GPU speed-up with Axolotl's flexibility. This is the highest-throughput single-GPU recipe in 2026.

Troubleshooting

Failure modes that bite real Axolotl users and the fixes that resolve them.

Symptom	Most likely cause	Fix
Pydantic validation error on `axolotl train`	Missing required field or invalid combination	Read the error — Pydantic names the field. Most common: `adapter: qlora` without `load_in_4bit: true`
Loss is NaN from step 1	FP16 with LoRA (use BF16) or wrong chat template	Set `bf16: true`, `fp16: false`; confirm `chat_template:` matches base model
Loss decreases but eval quality is terrible	Loss masking off — `train_on_inputs: true`	Set `train_on_inputs: false` (the default; check if overridden)
GPU at 60% utilisation, slow training	Data loader is the bottleneck	Increase `dataloader_num_workers` (4-8) and `dataloader_pin_memory: true`
OOM mid-epoch despite stable start	Sample packing hit a max-length batch	Drop `micro_batch_size` to 1 or `sequence_len` by 25%
`RuntimeError: CUDA error: device-side assert`	Tokeniser produced an out-of-vocab ID	Likely added special tokens without `lora_modules_to_save: [embed_tokens, lm_head]`
DeepSpeed ZeRO-3 hangs at start	Mismatched CUDA / NCCL versions across workers	Re-install with consistent CUDA 12.4+; check `NCCL_DEBUG=INFO` output
FSDP wraps but does not save adapter correctly	FSDP state-dict type	Set `fsdp_state_dict_type: FULL_STATE_DICT` in `fsdp_config`
DPO loss does not decrease	Reference model is wrong or beta too high	Confirm `rl_beta: 0.1` (not 1.0), reference defaults to frozen base
Merged model produces garbage	Merged QLoRA adapter directly into 4-bit base	Use `axolotl merge-lora` with `--save_safetensors` — dequantises base first
Sample packing reduces throughput instead of increasing it	`eval_sample_packing: true` triggers re-packing on every eval	Set `eval_sample_packing: false`
Wandb logs but training loss never appears	`report_to: []` or wandb credentials missing	Set `report_to: [wandb]` and `WANDB_API_KEY` env var

Where Axolotl fits in the Yobitel stack

References

Axolotl on GitHub · GitHub
Axolotl documentation · Axolotl Project
TRL — Transformer Reinforcement Learning · GitHub
Hugging Face PEFT · GitHub
DeepSpeed ZeRO · DeepSpeed Project

Axolotl

Overview

Quick start: 8B QLoRA fine-tune in 60 seconds of typing

How it works: from YAML to a TRL trainer

Reference: every config.yml field worth knowing

Workload patterns: what people actually run on Axolotl

Sizing and capacity planning

Limits and quotas

Observability: loss curves, eval metrics and run hygiene

Cost and FinOps

Security and compliance

Migration and alternatives

Troubleshooting

Where Axolotl fits in the Yobitel stack

References

Browse all entries

Deploy on Yobibyte

Axolotl

Overview

Quick start: 8B QLoRA fine-tune in 60 seconds of typing

How it works: from YAML to a TRL trainer

Reference: every config.yml field worth knowing

Workload patterns: what people actually run on Axolotl

Sizing and capacity planning

Limits and quotas

Observability: loss curves, eval metrics and run hygiene

Cost and FinOps

Security and compliance

Migration and alternatives

Troubleshooting

Where Axolotl fits in the Yobitel stack

References

Browse all entries

Deploy on Yobibyte