Annotation Practice · Preference data

The signal your reward model actually trains against

Pairwise comparisons, ranked lists, and pointwise rubrics over model outputs. Double-blind by protocol. Order randomised per pair. Bias surfaces engineered around. Method-fit picked against the alignment loss you actually plan to run.

See the three signal shapes

Double-blind, order shuffledKrippendorff α per batchRationale captured per item

#2487

Pair · double-blind

Order shuffled

Prompt

Explain transformer attention to a 10-year-old. Keep it under 80 words and use one concrete everyday analogy.

Response A

Attention is a mechanism in transformer architectures that computes scaled dot-product similarity scores between query and key vectors over a sequence, normalised via softmax, then weights the corresponding value vectors to produce contextual representations.

Helpful

Truthful

Concise

Safe

Response B

Chosen

Imagine reading a sentence about a dog at the park. Your brain quietly jumps to the words that matter most. Attention is the model doing that same jumping. It looks at every word, picks the few that help, and pays them the most attention. The rest sit quiet.

Helpful

Truthful

Concise

Safe

Why was B chosen? (rationale required)

142 chars

“Age-appropriate. Uses the ‘jumping to useful words’ analogy a 10yo can hold. A's response is technically correct but reads like a paper abstract.”

Model identity hidden · labeller ID hashed

IAA 0.78·8.4k pairs · batch 14/32

Three shapes the preference signal can take

Pick the shape that fits your alignment loss

Switch tabs to see each labelling UI. The shape we pick depends on the method below (DPO needs pairs; KTO works from pointwise; ranked lists give you C(N,2) pairs for the cost of one item).

Labeller sees the prompt plus two anonymised responses, picks the better one, and writes one or two sentences of rationale. The signal DPO and reward-model training consume directly.

Response A

Sure! Here's a polite acknowledgement followed by a thorough explanation of the policy in question, with citations to the relevant clauses.

Response B

Chosen

That policy applies in your case. The specific clause is 4.2.1. You'll need to submit form B by the 14th. Anything else?

Use when: reward model or DPO loss is the target. Cheap per pair. Rate ~120 pairs/hour per labeller.

Ships as

{ prompt, chosen, rejected, rationale, iaa }

Method-fit

Which alignment method needs which signal shape

The labelling shape is downstream of the loss. Pick the row that matches your training pipeline and we collect the shape it needs.

Method	Signal we collect	Recipe	When it fits
DPO Direct Preference Optimisation	Pairwise (chosen / rejected)	Reference model + pair-loss. No reward model, no RL loop.	First call when you have ≥10k clean pairs and want the simplest stable alignment.
KTO Kahneman–Tversky Optimisation	Pointwise binary (desirable / undesirable)	Single-sided judgements. No pairs required.	When you have lots of single-response labels but few side-by-side comparisons.
IPO Identity Preference Optimisation	Pairwise (chosen / rejected)	Regularised DPO variant. Less prone to overfitting on small datasets.	When DPO overfits — typical with <5k pairs or repetitive prompt distributions.
ORPO Odds Ratio Preference Optimisation	Pairwise + SFT joint	Joint SFT + preference loss in one pass. No separate reference model.	When you want to skip the SFT-then-DPO two-stage handoff.
RLHF (PPO) Reinforcement Learning from Human Feedback	Pairwise → train reward model → RL with PPO	Reward model from pairs, then PPO against the live policy.	When you want frontier-style alignment and have the team to run the RL loop.
Constitutional AI AI Feedback (RLAIF)	Pointwise critique against a written constitution	Model self-critiques against principles; pairs used to train preference model.	When human labelling cost is the constraint and you can write a high-quality rubric.

DPO

Direct Preference Optimisation

Signal: Pairwise (chosen / rejected)
Recipe: Reference model + pair-loss. No reward model, no RL loop.
When it fits: First call when you have ≥10k clean pairs and want the simplest stable alignment.

KTO

Kahneman–Tversky Optimisation

Signal: Pointwise binary (desirable / undesirable)
Recipe: Single-sided judgements. No pairs required.
When it fits: When you have lots of single-response labels but few side-by-side comparisons.

IPO

Identity Preference Optimisation

Signal: Pairwise (chosen / rejected)
Recipe: Regularised DPO variant. Less prone to overfitting on small datasets.
When it fits: When DPO overfits — typical with <5k pairs or repetitive prompt distributions.

ORPO

Odds Ratio Preference Optimisation

Signal: Pairwise + SFT joint
Recipe: Joint SFT + preference loss in one pass. No separate reference model.
When it fits: When you want to skip the SFT-then-DPO two-stage handoff.

RLHF (PPO)

Reinforcement Learning from Human Feedback

Signal: Pairwise → train reward model → RL with PPO
Recipe: Reward model from pairs, then PPO against the live policy.
When it fits: When you want frontier-style alignment and have the team to run the RL loop.

Constitutional AI

AI Feedback (RLAIF)

Signal: Pointwise critique against a written constitution
Recipe: Model self-critiques against principles; pairs used to train preference model.
When it fits: When human labelling cost is the constraint and you can write a high-quality rubric.

JSONL · one preference item per lineLabeller ID hashed with salted SHA-256Diff against prior batch shipped per release

preferences.jsonl

row schema

{  "id":string// Stable across reshuffles
  "prompt":string// Verbatim, not paraphrased
  "chosen":string// The preferred response
  "rejected":string// The other side of the pair
  "rationale":string// Labeller's written reason (audit hook)
  "criteria_scores":{helpful, truthful, harmless, concise, grounded}// Optional rubric per side
  "iaa_batch":float// Krippendorff α for the parent batch
  "labeller_id_hashed":string// Salted hash. Identity reversible only by Yobitel ops.
  "guidelines_version":string// Pinned to the doc the labeller saw
  "captured_at":ISO 8601 string// UTC
}

How we engage

Three engagement shapes

Yobitel-led labelling programme

We staff the pool. You consume batches.

Guidelines, calibration, blind protocol, adjudication, IAA tracking, dataset shipping. Best when preference data is on the critical path of a training milestone.

Pair with your team

Your labellers, our methodology

You bring the pool (often domain SMEs we cannot easily source). We bring the bias guardrails, calibration cycles, IAA dashboards, and adjudication craft. Your labellers earn the rate uplift; we earn the quality uplift.

Audit an existing dataset

Fixed-window forensic review

We sample your shipped preference data, re-label a control set, compute IAA against your batches, surface the bias surfaces above, and write a remediation plan. Useful when last-quarter's RLHF underperformed.

Tell us the loss you're training against.

We pick the signal shape, calibrate the labellers, run the bias guardrails, and ship the preferences in the schema above. The trainer is unmodified.

Prefer email? Contact us

Same bench across annotation, training, eval DPO · KTO · IPO · ORPO · PPO · Constitutional AI UK-resident infra available for OFFICIAL

The signal your reward model actually trains against

Double-blind, order shuffledKrippendorff α per batchRationale captured per item

Method

Signal we collect

Recipe

When it fits

DPO

Direct Preference Optimisation

Pairwise (chosen / rejected)

Reference model + pair-loss. No reward model, no RL loop.

First call when you have ≥10k clean pairs and want the simplest stable alignment.

KTO

Kahneman–Tversky Optimisation

Pointwise binary (desirable / undesirable)

Single-sided judgements. No pairs required.

When you have lots of single-response labels but few side-by-side comparisons.

IPO

Identity Preference Optimisation

Pairwise (chosen / rejected)

Regularised DPO variant. Less prone to overfitting on small datasets.

When DPO overfits — typical with <5k pairs or repetitive prompt distributions.

ORPO

Odds Ratio Preference Optimisation

Pairwise + SFT joint

Joint SFT + preference loss in one pass. No separate reference model.

When you want to skip the SFT-then-DPO two-stage handoff.

RLHF (PPO)

Reinforcement Learning from Human Feedback

Pairwise → train reward model → RL with PPO

Reward model from pairs, then PPO against the live policy.

When you want frontier-style alignment and have the team to run the RL loop.

Constitutional AI

AI Feedback (RLAIF)

Pointwise critique against a written constitution

Model self-critiques against principles; pairs used to train preference model.

When human labelling cost is the constraint and you can write a high-quality rubric.

{

"id":string// Stable across reshuffles

"prompt":string// Verbatim, not paraphrased

"chosen":string// The preferred response

"rejected":string// The other side of the pair

"rationale":string// Labeller's written reason (audit hook)

"criteria_scores":{helpful, truthful, harmless, concise, grounded}// Optional rubric per side

"iaa_batch":float// Krippendorff α for the parent batch

"labeller_id_hashed":string// Salted hash. Identity reversible only by Yobitel ops.

"guidelines_version":string// Pinned to the doc the labeller saw

"captured_at":ISO 8601 string// UTC

}

Tell us the loss you're training against.

We pick the signal shape, calibrate the labellers, run the bias guardrails, and ship the preferences in the schema above. The trainer is unmodified.

Same bench across annotation, training, eval DPO · KTO · IPO · ORPO · PPO · Constitutional AI UK-resident infra available for OFFICIAL