Annotation Practice · Synthetic Data Generation
Synthetic data your training run can defend
LLM-augmented prompt and response synthesis, diffusion-conditioned imagery, and the human-in-the-loop quality gates that decide what reaches the training set. Contamination checks, MinHash dedup, toxicity scoring, factuality verification, and signed lineage on every row.
Representative run
QA-gatedProduct-support intents · text · en-GB + en-US
Seed
10k items
200 product-support intents · taxonomy v3 · human-authored
Generate
500k rows · 50× factor
Magpie self-instruct expansion + Evol-Instruct difficulty steps
Quality gates
Contamination
n-gram + embed
94%
Toxicity
Detoxify < 0.05
98%
Human review
5% sample
89%
Accepted to dataset
412k rows
Seeds
10k
Generated
500k
Accept rate
82%
Dedup by MinHash similarity > 0.85. Provenance signed per row.
Real synthesis funnel
What a real synthesis run does to your candidates
A representative run: 10k seeds expand to 500k candidates, four QA gates prune to 412k accepted rows. Below: a per-row audit showing which gate accepted what, and the exact reason each rejected row failed.
Run: syn-2026-q2-csat-v3
seed → generate → 4 gates → ship · accept-rate 82%
Seed corpus
10k
customer-grounded topics
Self-Instruct + Magpie
500k
50× expansion
Accepted to dataset
412k
signed provenance per row
Dedup (MinHash > 0.85)
93.6%pass
−32k dropped · 468k kept
Toxicity (Detoxify < 0.05)
98.0%pass
−10k dropped · 459k kept
Contamination (vs eval)
96.4%pass
−16k dropped · 442k kept
Human review (5% sample)
93.2%pass
−30k dropped · 412k kept
Per-row gate audit · 5 sample candidates
full audit log per accepted row, signed provenance
I see the duplicate charge on the 14th. The original invoice was paid; the second attempt was an automatic retry that should not have run. Refunding now; you'll see it within 3 working days.
I see the duplicate charge on the 14th. The original invoice was paid; the second attempt was an automatic retry that should not have run. We will refund this and the credit will appear within 3 working days.
Reject reason · MinHash 0.91 vs syn-018221 — near-duplicate seed expansion
Line 14: `while left <= right` should be `while left < right` if you intend a half-open interval, otherwise the recursive call on line 17 will overshoot when the array has one element.
What an absolute mess. You should expect compensation for this kind of incompetence. We'll write off the entire month and you can demand more.
Reject reason · Detoxify score 0.21 — combative tone in agent voice
Common side effects of amiodarone include thyroid changes, photosensitivity, and pulmonary toxicity. Routine TFT and PFT monitoring is standard. Patients should report new cough or breathlessness.
Reject reason · 12-gram overlap with held-out clinical eval set (BMJ Best Practice)
Every accepted row carries a lineage record: seed_id, generator_id, gate scores, human_reviewer_id_hashed. Rejected rows kept for audit but never reach the training split.
Sample rows written for illustration; counters reflect the gate-pass-rate band a real Magpie + Self-Instruct run typically lands in.
The shapes of synthesis
Different corpora collapse differently
Synthesis is not one technique. Instruction data, code data, imagery, dialogue, and long-tail slices each have a method that works and a way they fail. We run them distinctly.
Instruction-pair synthesis
Self-Instruct expansion off a seed taxonomy, plus Magpie-style template-primed synthesis for scale. Useful when the SFT corpus is thin and the model needs broad task coverage before it can be evaluated honestly.
Self-Instruct · Magpie · Evol-Instruct difficulty steps
Code-pair synthesis
OSS-Instruct methodology against a real-code seed pool. Function bodies, paired tests, docstring rewrites, refactor pairs. The mix that supervised code fine-tunes actually need.
OSS-Instruct · WizardCoder-style · paired tests
Diffusion-conditioned imagery
Stable Diffusion XL or FLUX.1 with ControlNet conditioning on pose, depth, edge (Canny), and layout primitives, or Stable Diffusion 3.5 Large with the supported Canny / depth / blur conditioners. Used to balance under-represented slices in vision datasets without scraping more raw web data.
SDXL · FLUX.1 · SD3.5 Large · ControlNet conditioning
Multi-turn dialogue augmentation
Conversation trees grown from a single-turn seed, with persona variation and follow-up branching. Where supervised dialogue or agent-trace data is most expensive to author by hand.
Tree expansion · persona variation · branch sampling
Long-tail slice generation
Targeted synthesis against the slices your evaluation says you fail on. Generate to the under-represented intent, locale, or visual condition; not to the bulk of the distribution you already cover.
Slice-targeted · failure-mode-driven · eval-aligned
Adversarial prompt synthesis
Jailbreak attempts, prompt-injection patterns, and contrastive hard-negative pairs. The data your safety post-training and your red-team release gate both consume.
Jailbreaks · injections · hard negatives
What we engineer around
The failure modes that silently ship
Every synthetic-data programme we audit hits some subset of these. The recipe runs, the row counts look healthy, and a downstream eval exposes the cracks. Knowing they exist is most of the win.
Mode collapse looks like signal
What bad looks like
Generator returns 5,000 paraphrases of the same intent
What we design for
MinHash + embedding-diversity check before acceptance
Self-instruction loops drift toward the easiest part of the distribution. The dataset grows, the IAA stays high, and the model trains on a corpus that has the variance of a single template. We measure inter-row diversity and reject batches that collapse on it.
Eval-set contamination is the silent killer
What bad looks like
Generated rows leak the held-out benchmark verbatim
What we design for
n-gram + embedding overlap check against the eval set
A teacher model that was trained on your evaluation set will reproduce it under generation. The benchmark number then says nothing. Every generated row is checked against the named eval sets before it can join the training corpus.
Distribution drift from the real corpus
What bad looks like
Synthetic batch length distribution looks nothing like real
What we design for
Real-vs-synth distribution tests on every batch (length, topic, sentiment)
If the synthetic share of the dataset is shaped differently from the real share, the fine-tune learns the difference instead of the task. We score real-vs-synth on a small set of distribution probes and gate batches on the delta.
Hallucinated facts becoming ground truth
What bad looks like
Teacher LLM invents an API that does not exist, dataset ships it
What we design for
Factuality verifier on factual rows, retrieval-grounded rewrite on the rest
Generative teachers hallucinate. For factual or code rows, we verify against a ground source (docs, API spec, retrieval index) and rewrite or reject. The model card declares the verifier coverage so consumers know what is checked and what is not.
Toxicity slippage in expanded outputs
What bad looks like
Toxic edge cases appear in the long-tail expansion only
What we design for
Per-row toxicity scoring with batch-level escalation review
Toxicity often sneaks in at the long tail of self-instruction trees, where the seed never went. Per-row scoring catches the rows; batch-level escalation catches the patterns. Both are needed.
No provenance, no recourse
What bad looks like
Row is bad in production, no trace back to seed or generator version
What we design for
Signed lineage: seed-id, generator-id, gate-results, reviewer-id per row
When a generated row causes a downstream regression, the question is which seed, which generator version, which gate let it through. Without per-row lineage the answer is a re-run. With it, the answer is a fix.
The gates each row clears
Generation is cheap. Acceptance is the work.
A teacher model can produce a million rows in a weekend. The hard, slow part is filtering the rows that should never join the training set. These are the gates we run.
Eval-set contamination check
n-gram overlap and embedding similarity against every named held-out eval set. Rows over threshold are dropped, logged, and counted in the dataset card so the benchmark score remains honest.
Near-duplicate dedup
MinHash + LSH similarity at a configurable threshold (typically 0.85). Removes the paraphrase-cluster collapse that breaks model diversity without showing up in row counts.
Toxicity + safety scoring
Per-row scoring with classifier models trained for the task (Detoxify family and equivalents). Threshold per project, batch-level escalation when a slice exceeds the bar.
Human-review sampling
A statistically scoped sample of accepted rows reviewed by a vetted reviewer pool. Sample size is set by the risk tier of the dataset, not by convenience.
Distribution-shift monitoring
Length, topic, sentiment, and embedding-density probes against the real-data baseline. Batches that drift outside the agreed envelope are returned to synthesis, not shipped.
Factuality verification on factual rows
Retrieval-grounded verification on rows that carry factual claims or code interfaces. Rejected rows are rewritten against the ground source or dropped with a logged reason.
The synthesis stack
Open frameworks, opinionated orchestration
The tools below cover most of what synthetic data needs. The interesting work is wiring them into a gated, lineage-tracked loop that your training run can rely on batch over batch.
Distilabel
Argilla's pipeline framework for LLM-driven synthesis and preference data.
Argilla
Human-in-the-loop curation and labelling surface for review samples.
NVIDIA NeMo Curator
GPU-accelerated dedup, contamination, and filtering at scale.
Datatrove
Hugging Face's distributed text-processing toolkit for pipelines.
fastdup
Image-corpus near-duplicate detection and quality triage.
ControlNet
Conditioning layer for diffusion gen on pose, depth, Canny edge, and layout (SDXL / SD3.5 supported set).
SDXL · FLUX.1 · SD3.5 Large
Open-weights diffusion stacks for image synthesis.
Custom orchestration
Bespoke synthesis-gate-review loop where off-the-shelf does not fit.
Where regulated source data rules out a managed teacher model, we operate an open-weights teacher inside your perimeter. The recipe stays the same; the hosting posture changes.
Your handover pack
What ships with the dataset
A synthetic dataset on its own is a liability. A dataset plus its synthesis recipe, gate config, lineage trace, and reject log is an asset your training programme and your release-gate review can both work with.
Every batch ships with these artefacts. If you commission a one-shot programme they arrive once. If you commission a steady cadence they refresh per batch.
Synthesis recipe document
Versioned. Names every generator, prompt template, sampling parameter, and expansion factor. Reproducible end-to-end so the next batch is not an art project.
Seed corpus + taxonomy
The cleaned, versioned seed set the synthesis ran against. Taxonomy mapping included so additions in v2 don't quietly change the meaning of v1 rows.
Gate configuration + thresholds
Every gate, its threshold, and the rationale. The same config file that ran in production. Cited in the dataset card so consumers know exactly what was filtered.
Per-row lineage trace
Seed-id, generator-id and version, prompt-template-id, gate-results, reviewer-id (where reviewed). Cryptographically signed so a downstream audit can trust it.
Accept / reject log
Every generated row, its gate outcomes, and the reject reason where applicable. The artefact that lets you tune the recipe without re-running everything.
Dataset card with synthesis disclosure
Composition, real-vs-synth share, generators named, gates and thresholds named, known biases. The dataset card a model release can cite verbatim.
How we engage
Pick the shape that fits your team
From end-to-end programme delivery to a time-boxed audit of the pipeline you already run. The scope call confirms which fits; the statement of work names the deliverables.
Yobitel-led
We own seeds, generators, gates, review, lineage
End-to-end synthesis programme. You give us the taxonomy and the eval sets to avoid. We deliver accepted batches against the agreed gate config, with the full handover pack per batch. Best when synthetic data is on the critical path of a model release.
Collaborative
You bring seeds and reviewers, we run the stack
You hold the seed corpus and the reviewer pool. We operate the synthesis pipeline, the gates, the dedup, the lineage emission. Best when you already have a curation team and want the methodology to lift.
Advisory
Audit the synthetic-data pipeline you already run
Fixed-window review of your existing synthesis stack. We sample your outputs, re-run our gates on them, write a remediation plan. Best when last release used synthetic data and you want a second pair of eyes before the next one.
Back to hub
Annotation + RLHF practice
The wider practice. Supervised labelling, preference data, instruction-tune curation, eval sets, safety datasets, multimodal, synthetic.
Related
Model training + fine-tuning
The training-run engineering that consumes the synthetic corpus we curate. SFT, DPO, RLHF, evaluation. Same engineering bench across both.
Related
ML pipelines + continuous evaluation
The synthesis recipe runs as a pipeline. Same orchestration practice that runs your eval re-runs and your drift-triggered re-labelling loops.
Tell us what the synthesis is for.
A short questionnaire covers modality, seed corpus, target output size, and the gate posture your release can defend. Our synthetic-data lead replies inside one working day with a recipe sketch and a candidate gate config fitted to the risk tier of the dataset.
Eval-set contamination checked on every batch. Per-row signed lineage in the handover pack. MinHash dedup against the agreed similarity threshold. Programmes scoped to any sovereignty perimeter (NCSC, GDPR, HIPAA, MeitY, and beyond).