Professional Services · Training + Fine-tuning
Training that lands on a working checkpoint
LoRA fine-tunes through large-scale GPU pre-training runs. Distributed training engineering across PyTorch FSDP, DeepSpeed, NeMo, Megatron, TRL, Axolotl. Data prepared properly, recipes picked deliberately, runs that recover from node loss without restarting from zero. Reproducible from git SHA + seed.
Representative run
StableLlama-3 8B · SFT on customer data · 64×H100
Eval loss
step 0 → 50
Final loss
0.62
MFU
48%
Throughput
6.4k tok/s
No diverged steps. Checkpoints every 500 steps to parallel FS. Run-record fully reproducible from git SHA + seed.
The stages of work
From dataset to signed-off checkpoint
Every training engagement runs through the same stages. The ratio shifts with the recipe; the discipline doesn't.
Data preparation
Dataset assembly, deduplication, tokenisation, train/val split, contamination check. The unglamorous step that most regressions trace back to.
Exit
Dataset card · tokenised shards · contamination report
Recipe selection
LoRA vs full fine-tune vs continued pre-training vs alignment. Picked against your data volume, compute budget, and eval target.
Exit
Recipe doc · ablation plan · accuracy/cost projection
Training engineering
Parallelism strategy (DP / FSDP / TP / PP), optimiser config, learning-rate schedule, checkpoint cadence. Cluster tuned for MFU not for marketing numbers.
Exit
Run scripts · cluster topology · MFU report
Run + recovery
Live monitoring, divergence handling, hardware-failure recovery, smart restarts. Long pre-training runs that recover from a node failure without restarting from zero.
Exit
Telemetry dashboards · incident playbook · recovery runbook
Eval + sign-off
Evaluation against your held-out set + standard public benches. Accuracy delta vs baseline, regression checks, signed decision record.
Exit
Eval report · model card · sign-off record
Where training runs go wrong
The failures we've already debugged for you
Long runs are a different discipline from notebook experiments. These are the failure shapes that wreck schedules.
Loss spike at step 30k
Naïve response
Restart from scratch
What we ship
Skip-step recovery + LR rewind
Long pre-training runs hit instabilities. The right response is not a restart; it is rewinding the optimiser state, lowering the LR, skipping the offending batch range, and continuing. We script this so on-call handles it without paging the model team.
Gradient norm exploding
Naïve response
Cap and pray
What we ship
Activation checkpointing + gradient-norm telemetry
When grad-norm climbs it tells you something. Capping it hides the symptom; reading it tells you whether to drop LR, scale FSDP, or fix a tokeniser bug.
MFU stuck below 40%
Naïve response
Buy more GPUs
What we ship
Parallelism re-tile + comms overlap
Most fleets sit at sub-40% MFU because nobody ever re-tiled the parallelism for the model + cluster they have. Right tile, comms overlap, and activation recompute knobs typically lift MFU into the 50-60% band without buying anything.
Eval regression on held-out
Naïve response
Ship anyway, fix in v2
What we ship
Run-bisect against eval suite
When a fine-tuned model regresses on the customer eval, the answer is not to ship anyway. We bisect the training run against the eval suite to isolate which data segment or optimiser step introduced the regression.
Frameworks we drive
We pick the framework that fits the recipe
No religion. The right framework is the one that gets your checkpoint to your eval set fastest, with reproducibility on the way.
PyTorch FSDP / DeepSpeed
Default for full fine-tunes and most pre-training runs. Mature ecosystem.
NVIDIA NeMo + Megatron
Megatron-LM core for large parallelism setups. NeMo for the framework around it.
Hugging Face TRL + Accelerate
SFT + alignment (DPO, KTO, PPO) with the smallest amount of glue code.
Axolotl + Unsloth
LoRA / QLoRA + sane defaults for small / mid teams. Fast onramp.
JAX / Flax (on GPU or TPU)
JIT-compiled training when you need the throughput edge and the team's JAX-comfortable.
Custom in-house
We work inside your existing training framework when there is one and it makes sense.
Your handover pack
What lands at sign-off
Concrete, version-controlled artefacts that make the run investable, not a one-shot. Your team can pick up the next training cycle without us.
Dataset + tokeniser package
Tokenised shards, dataset card, contamination report. The training data your model actually saw, with a paper trail.
Training as-code repo
Run scripts, parallelism config, optimiser + LR schedule, checkpoint policy. Reproducible from a git SHA + seed.
Run-record + telemetry pack
Full run history, loss curves, MFU, grad-norm, throughput. Importable into your existing W&B / MLflow / Aim.
Incident + recovery playbook
What on-call does when a node dies, loss spikes, or eval regresses. Tested on the day, not invented during the next incident.
Final checkpoint + model card
The signed-off model with its evaluation results, intended use, known limitations, and the licence position.
Continuation plan
What changes if you want to re-run on more data, alignment-tune, or swap base model. So the work is investable, not one-shot.
How we engage
Pick the shape that fits your team
Yobitel-led
We own the training stack end-to-end
Data prep through final-checkpoint sign-off, plus optional managed-ops handover for the running fleet. Best for teams without a dedicated training-engineering function.
Collaborative
We engineer with your team
Paired work on the tricky surfaces: parallelism tile, optimiser tuning, eval design, divergence recovery. Your team owns the run; we sign off on the design and join the long-watch.
Advisory
Time-boxed review
Fixed-window engagement to review a training plan you've already drafted. We spot risk, suggest focused changes, deliver a written report.
Related
Inference engineering
The checkpoint you trained still has to serve. We engineer the serving stack to your latency and cost-per-token targets.
Related
Network fabrics for AI clusters
The east-west fabric that decides whether your training cluster scales linearly or hits comms-bound MFU at 256 GPUs.
Tell us what you want trained.
A short questionnaire covers workload, scale, and engagement model. Our training practice lead replies inside one working day with a fitted recipe, a parallelism plan, and a checkpoint timeline.
Same engineering bench that designs the fabric, the platform, and the inference cluster around the model. Engagements scoped to any sovereignty perimeter. Optional 24/7 day-2 handover. Reproducible from git SHA + seed.