Annotation Practice · Synthetic Data Generation

Synthetic data your training run can defend

LLM-augmented prompt and response synthesis, diffusion-conditioned imagery, and the human-in-the-loop quality gates that decide what reaches the training set. Contamination checks, MinHash dedup, toxicity scoring, factuality verification, and signed lineage on every row.

See the quality gates

Self-Instruct · Magpie · Evol-Instruct · OSS-Instruct · ControlNetEval-set contamination checked on every batchPer-row signed lineage shipped with the dataset

Representative run

QA-gated

Product-support intents · text · en-GB + en-US

Seed

10k items

200 product-support intents · taxonomy v3 · human-authored

Generate

500k rows · 50× factor

Magpie self-instruct expansion + Evol-Instruct difficulty steps

Quality gates

Contamination

n-gram + embed

94%

Toxicity

Detoxify < 0.05

98%

Human review

5% sample

89%

Accepted to dataset

412k rows

Seeds

10k

Generated

500k

Accept rate

82%

Dedup by MinHash similarity > 0.85. Provenance signed per row.

Real synthesis funnel

What a real synthesis run does to your candidates

A representative run: 10k seeds expand to 500k candidates, four QA gates prune to 412k accepted rows. Below: a per-row audit showing which gate accepted what, and the exact reason each rejected row failed.

Run: syn-2026-q2-csat-v3

seed → generate → 4 gates → ship · accept-rate 82%

Seed corpus

10k

customer-grounded topics

→

Self-Instruct + Magpie

500k

50× expansion

→

Accepted to dataset

412k

signed provenance per row

Dedup (MinHash > 0.85)

93.6%pass

−32k dropped · 468k kept

Toxicity (Detoxify < 0.05)

98.0%pass

−10k dropped · 459k kept

Contamination (vs eval)

96.4%pass

−16k dropped · 442k kept

Human review (5% sample)

93.2%pass

−30k dropped · 412k kept

Per-row gate audit · 5 sample candidates

full audit log per accepted row, signed provenance

syn-018221customer-support · billing-dispute

Accepted

I see the duplicate charge on the 14th. The original invoice was paid; the second attempt was an automatic retry that should not have run. Refunding now; you'll see it within 3 working days.

deduptoxcontam.human

syn-018222customer-support · billing-dispute

Rejected

I see the duplicate charge on the 14th. The original invoice was paid; the second attempt was an automatic retry that should not have run. We will refund this and the credit will appear within 3 working days.

deduptoxcontam.human

Reject reason · MinHash 0.91 vs syn-018221 — near-duplicate seed expansion

syn-018223code-review · python · off-by-one

Accepted

Line 14: `while left <= right` should be `while left < right` if you intend a half-open interval, otherwise the recursive call on line 17 will overshoot when the array has one element.

deduptoxcontam.human

syn-018224customer-support · escalation-tone

Rejected

What an absolute mess. You should expect compensation for this kind of incompetence. We'll write off the entire month and you can demand more.

deduptoxcontam.human

Reject reason · Detoxify score 0.21 — combative tone in agent voice

syn-018225patient-edu · drug-side-effects

Rejected

Common side effects of amiodarone include thyroid changes, photosensitivity, and pulmonary toxicity. Routine TFT and PFT monitoring is standard. Patients should report new cough or breathlessness.

deduptoxcontam.human

Reject reason · 12-gram overlap with held-out clinical eval set (BMJ Best Practice)

Every accepted row carries a lineage record: seed_id, generator_id, gate scores, human_reviewer_id_hashed. Rejected rows kept for audit but never reach the training split.

Sample rows written for illustration; counters reflect the gate-pass-rate band a real Magpie + Self-Instruct run typically lands in.

The shapes of synthesis

Different corpora collapse differently

Synthesis is not one technique. Instruction data, code data, imagery, dialogue, and long-tail slices each have a method that works and a way they fail. We run them distinctly.

Instruction-pair synthesis

Self-Instruct expansion off a seed taxonomy, plus Magpie-style template-primed synthesis for scale. Useful when the SFT corpus is thin and the model needs broad task coverage before it can be evaluated honestly.

Self-Instruct · Magpie · Evol-Instruct difficulty steps

Code-pair synthesis

OSS-Instruct methodology against a real-code seed pool. Function bodies, paired tests, docstring rewrites, refactor pairs. The mix that supervised code fine-tunes actually need.

OSS-Instruct · WizardCoder-style · paired tests

Diffusion-conditioned imagery

Stable Diffusion XL or FLUX.1 with ControlNet conditioning on pose, depth, edge (Canny), and layout primitives, or Stable Diffusion 3.5 Large with the supported Canny / depth / blur conditioners. Used to balance under-represented slices in vision datasets without scraping more raw web data.

SDXL · FLUX.1 · SD3.5 Large · ControlNet conditioning

Multi-turn dialogue augmentation

Conversation trees grown from a single-turn seed, with persona variation and follow-up branching. Where supervised dialogue or agent-trace data is most expensive to author by hand.

Tree expansion · persona variation · branch sampling

Long-tail slice generation

Targeted synthesis against the slices your evaluation says you fail on. Generate to the under-represented intent, locale, or visual condition; not to the bulk of the distribution you already cover.

When a generated row causes a downstream regression, the question is which seed, which generator version, which gate let it through. Without per-row lineage the answer is a re-run. With it, the answer is a fix.

The gates each row clears

Generation is cheap. Acceptance is the work.

A teacher model can produce a million rows in a weekend. The hard, slow part is filtering the rows that should never join the training set. These are the gates we run.

Eval-set contamination check

n-gram overlap and embedding similarity against every named held-out eval set. Rows over threshold are dropped, logged, and counted in the dataset card so the benchmark score remains honest.

Near-duplicate dedup

MinHash + LSH similarity at a configurable threshold (typically 0.85). Removes the paraphrase-cluster collapse that breaks model diversity without showing up in row counts.

Toxicity + safety scoring

Per-row scoring with classifier models trained for the task (Detoxify family and equivalents). Threshold per project, batch-level escalation when a slice exceeds the bar.

Human-review sampling

A statistically scoped sample of accepted rows reviewed by a vetted reviewer pool. Sample size is set by the risk tier of the dataset, not by convenience.

Distribution-shift monitoring

Length, topic, sentiment, and embedding-density probes against the real-data baseline. Batches that drift outside the agreed envelope are returned to synthesis, not shipped.

Factuality verification on factual rows

Retrieval-grounded verification on rows that carry factual claims or code interfaces. Rejected rows are rewritten against the ground source or dropped with a logged reason.

The synthesis stack

Open frameworks, opinionated orchestration

The tools below cover most of what synthetic data needs. The interesting work is wiring them into a gated, lineage-tracked loop that your training run can rely on batch over batch.

Distilabel

Argilla's pipeline framework for LLM-driven synthesis and preference data.

Argilla

Human-in-the-loop curation and labelling surface for review samples.

NVIDIA NeMo Curator

GPU-accelerated dedup, contamination, and filtering at scale.

Datatrove

Hugging Face's distributed text-processing toolkit for pipelines.

fastdup

Image-corpus near-duplicate detection and quality triage.

ControlNet

Conditioning layer for diffusion gen on pose, depth, Canny edge, and layout (SDXL / SD3.5 supported set).

SDXL · FLUX.1 · SD3.5 Large

Open-weights diffusion stacks for image synthesis.

Custom orchestration

Bespoke synthesis-gate-review loop where off-the-shelf does not fit.

Where regulated source data rules out a managed teacher model, we operate an open-weights teacher inside your perimeter. The recipe stays the same; the hosting posture changes.

Your handover pack

What ships with the dataset

A synthetic dataset on its own is a liability. A dataset plus its synthesis recipe, gate config, lineage trace, and reject log is an asset your training programme and your release-gate review can both work with.

Every batch ships with these artefacts. If you commission a one-shot programme they arrive once. If you commission a steady cadence they refresh per batch.

Synthesis recipe document

Versioned. Names every generator, prompt template, sampling parameter, and expansion factor. Reproducible end-to-end so the next batch is not an art project.

Seed corpus + taxonomy

The cleaned, versioned seed set the synthesis ran against. Taxonomy mapping included so additions in v2 don't quietly change the meaning of v1 rows.

Gate configuration + thresholds

Every gate, its threshold, and the rationale. The same config file that ran in production. Cited in the dataset card so consumers know exactly what was filtered.

Per-row lineage trace

Seed-id, generator-id and version, prompt-template-id, gate-results, reviewer-id (where reviewed). Cryptographically signed so a downstream audit can trust it.

Accept / reject log

Every generated row, its gate outcomes, and the reject reason where applicable. The artefact that lets you tune the recipe without re-running everything.

Dataset card with synthesis disclosure

Composition, real-vs-synth share, generators named, gates and thresholds named, known biases. The dataset card a model release can cite verbatim.

How we engage

Pick the shape that fits your team

From end-to-end programme delivery to a time-boxed audit of the pipeline you already run. The scope call confirms which fits; the statement of work names the deliverables.

Yobitel-led

We own seeds, generators, gates, review, lineage

End-to-end synthesis programme. You give us the taxonomy and the eval sets to avoid. We deliver accepted batches against the agreed gate config, with the full handover pack per batch. Best when synthetic data is on the critical path of a model release.

Collaborative

You bring seeds and reviewers, we run the stack

You hold the seed corpus and the reviewer pool. We operate the synthesis pipeline, the gates, the dedup, the lineage emission. Best when you already have a curation team and want the methodology to lift.

Advisory

Audit the synthetic-data pipeline you already run

Fixed-window review of your existing synthesis stack. We sample your outputs, re-run our gates on them, write a remediation plan. Best when last release used synthetic data and you want a second pair of eyes before the next one.

Back to hub

Annotation + RLHF practice

The wider practice. Supervised labelling, preference data, instruction-tune curation, eval sets, safety datasets, multimodal, synthetic.

Model training + fine-tuning

The training-run engineering that consumes the synthetic corpus we curate. SFT, DPO, RLHF, evaluation. Same engineering bench across both.

ML pipelines + continuous evaluation

The synthesis recipe runs as a pipeline. Same orchestration practice that runs your eval re-runs and your drift-triggered re-labelling loops.

Tell us what the synthesis is for.

A short questionnaire covers modality, seed corpus, target output size, and the gate posture your release can defend. Our synthetic-data lead replies inside one working day with a recipe sketch and a candidate gate config fitted to the risk tier of the dataset.

Prefer email? Contact us

Eval-set contamination checked on every batch. Per-row signed lineage in the handover pack. MinHash dedup against the agreed similarity threshold. Programmes scoped to any sovereignty perimeter (NCSC, GDPR, HIPAA, MeitY, and beyond).

Synthetic data your training run can defend

Self-Instruct · Magpie · Evol-Instruct · OSS-Instruct · ControlNetEval-set contamination checked on every batchPer-row signed lineage shipped with the dataset

What ships with the dataset

Every batch ships with these artefacts. If you commission a one-shot programme they arrive once. If you commission a steady cadence they refresh per batch.

Tell us what the synthesis is for.