Annotation Practice · Preference data
The signal your reward model actually trains against
Pairwise comparisons, ranked lists, and pointwise rubrics over model outputs. Double-blind by protocol. Order randomised per pair. Bias surfaces engineered around. Method-fit picked against the alignment loss you actually plan to run.
Pair · double-blind
Prompt
Explain transformer attention to a 10-year-old. Keep it under 80 words and use one concrete everyday analogy.
Response A
·Attention is a mechanism in transformer architectures that computes scaled dot-product similarity scores between query and key vectors over a sequence, normalised via softmax, then weights the corresponding value vectors to produce contextual representations.
Response B
ChosenImagine reading a sentence about a dog at the park. Your brain quietly jumps to the words that matter most. Attention is the model doing that same jumping. It looks at every word, picks the few that help, and pays them the most attention. The rest sit quiet.
Why was B chosen? (rationale required)
142 chars“Age-appropriate. Uses the ‘jumping to useful words’ analogy a 10yo can hold. A's response is technically correct but reads like a paper abstract.”
Three shapes the preference signal can take
Pick the shape that fits your alignment loss
Switch tabs to see each labelling UI. The shape we pick depends on the method below (DPO needs pairs; KTO works from pointwise; ranked lists give you C(N,2) pairs for the cost of one item).
Labeller sees the prompt plus two anonymised responses, picks the better one, and writes one or two sentences of rationale. The signal DPO and reward-model training consume directly.
Response A
Sure! Here's a polite acknowledgement followed by a thorough explanation of the policy in question, with citations to the relevant clauses.
Response B
ChosenThat policy applies in your case. The specific clause is 4.2.1. You'll need to submit form B by the 14th. Anything else?
Use when: reward model or DPO loss is the target. Cheap per pair. Rate ~120 pairs/hour per labeller.
Ships as
{ prompt, chosen, rejected, rationale, iaa }Method-fit
Which alignment method needs which signal shape
The labelling shape is downstream of the loss. Pick the row that matches your training pipeline and we collect the shape it needs.
| Method | Signal we collect | Recipe | When it fits |
|---|---|---|---|
DPO Direct Preference Optimisation | Pairwise (chosen / rejected) | Reference model + pair-loss. No reward model, no RL loop. | First call when you have ≥10k clean pairs and want the simplest stable alignment. |
KTO Kahneman–Tversky Optimisation | Pointwise binary (desirable / undesirable) | Single-sided judgements. No pairs required. | When you have lots of single-response labels but few side-by-side comparisons. |
IPO Identity Preference Optimisation | Pairwise (chosen / rejected) | Regularised DPO variant. Less prone to overfitting on small datasets. | When DPO overfits — typical with <5k pairs or repetitive prompt distributions. |
ORPO Odds Ratio Preference Optimisation | Pairwise + SFT joint | Joint SFT + preference loss in one pass. No separate reference model. | When you want to skip the SFT-then-DPO two-stage handoff. |
RLHF (PPO) Reinforcement Learning from Human Feedback | Pairwise → train reward model → RL with PPO | Reward model from pairs, then PPO against the live policy. | When you want frontier-style alignment and have the team to run the RL loop. |
Constitutional AI AI Feedback (RLAIF) | Pointwise critique against a written constitution | Model self-critiques against principles; pairs used to train preference model. | When human labelling cost is the constraint and you can write a high-quality rubric. |
DPO
Direct Preference Optimisation
- Signal
- Pairwise (chosen / rejected)
- Recipe
- Reference model + pair-loss. No reward model, no RL loop.
- When it fits
- First call when you have ≥10k clean pairs and want the simplest stable alignment.
KTO
Kahneman–Tversky Optimisation
- Signal
- Pointwise binary (desirable / undesirable)
- Recipe
- Single-sided judgements. No pairs required.
- When it fits
- When you have lots of single-response labels but few side-by-side comparisons.
IPO
Identity Preference Optimisation
- Signal
- Pairwise (chosen / rejected)
- Recipe
- Regularised DPO variant. Less prone to overfitting on small datasets.
- When it fits
- When DPO overfits — typical with <5k pairs or repetitive prompt distributions.
ORPO
Odds Ratio Preference Optimisation
- Signal
- Pairwise + SFT joint
- Recipe
- Joint SFT + preference loss in one pass. No separate reference model.
- When it fits
- When you want to skip the SFT-then-DPO two-stage handoff.
RLHF (PPO)
Reinforcement Learning from Human Feedback
- Signal
- Pairwise → train reward model → RL with PPO
- Recipe
- Reward model from pairs, then PPO against the live policy.
- When it fits
- When you want frontier-style alignment and have the team to run the RL loop.
Constitutional AI
AI Feedback (RLAIF)
- Signal
- Pointwise critique against a written constitution
- Recipe
- Model self-critiques against principles; pairs used to train preference model.
- When it fits
- When human labelling cost is the constraint and you can write a high-quality rubric.
Bias traps we engineer around
The four ways preference data silently breaks your reward model
Each trap is real, well-documented, and shows up in batches that look fine on IAA alone. The guardrails below run as standard, not as an upsell.
Trap
Position bias
Symptom
Labellers default to A or B independent of content. Win-rate skews >55/45 toward one slot across thousands of pairs.
Standard guardrail
Randomise A/B order per pair. Audit win-rate-by-position weekly. Reject batches that drift >52/48.
Trap
Length bias
Symptom
Longer response wins regardless of quality. Model-trained-on-it learns to be verbose.
Standard guardrail
Truncate both responses to a common length budget when applicable. Inject reverse-length canaries into 5% of batches as IAA tests.
Trap
Sycophancy preference
Symptom
Labellers prefer responses that agree with the prompter's framing. Reward model learns to flatter, not correct.
Standard guardrail
Constitution rule: rationale must cite an objective criterion, not 'felt right'. Spot-check on 5% of disagree pairs.
Trap
Assertive tone bias
Symptom
Confident-sounding response wins over hedged-but-correct response. The model unlearns calibrated uncertainty.
Standard guardrail
Mix in calibration-critical prompts (medical, legal, financial) and pair confident-wrong with hedged-right. Labellers trained to favour the hedged one.
What lands in your training repo
The actual row your trainer consumes
Versioned in your repo. Lineage-signed. Reversible identity only by Yobitel ops, never by anyone consuming the dataset. Format below ships as JSONL by default; Parquet on request.
{"id":string// Stable across reshuffles"prompt":string// Verbatim, not paraphrased"chosen":string// The preferred response"rejected":string// The other side of the pair"rationale":string// Labeller's written reason (audit hook)"criteria_scores":{helpful, truthful, harmless, concise, grounded}// Optional rubric per side"iaa_batch":float// Krippendorff α for the parent batch"labeller_id_hashed":string// Salted hash. Identity reversible only by Yobitel ops."guidelines_version":string// Pinned to the doc the labeller saw"captured_at":ISO 8601 string// UTC}
How we engage
Three engagement shapes
Yobitel-led labelling programme
We staff the pool. You consume batches.
Guidelines, calibration, blind protocol, adjudication, IAA tracking, dataset shipping. Best when preference data is on the critical path of a training milestone.
Pair with your team
Your labellers, our methodology
You bring the pool (often domain SMEs we cannot easily source). We bring the bias guardrails, calibration cycles, IAA dashboards, and adjudication craft. Your labellers earn the rate uplift; we earn the quality uplift.
Audit an existing dataset
Fixed-window forensic review
We sample your shipped preference data, re-label a control set, compute IAA against your batches, surface the bias surfaces above, and write a remediation plan. Useful when last-quarter's RLHF underperformed.
Tell us the loss you're training against.
We pick the signal shape, calibrate the labellers, run the bias guardrails, and ship the preferences in the schema above. The trainer is unmodified.
Same bench across annotation, training, eval DPO · KTO · IPO · ORPO · PPO · Constitutional AI UK-resident infra available for OFFICIAL