LLM Evaluation Frameworks

TL;DR

LLM evaluation in 2026 spans three layers: capability benchmarks (MMLU, HumanEval, GSM8K, SWE-bench, AIME, MMMU) run via open harnesses (EleutherAI LM-Eval-Harness, Stanford HELM, OpenCompass, UK AISI Inspect AI), application metric frameworks (RAGAS for RAG, DeepEval for general regression, Promptfoo for matrix comparison), and hosted observability + eval platforms (LangSmith, Braintrust, Phoenix / Arize).
Each tool has a clear niche. RAGAS is the principled choice for retrieval-augmented generation. DeepEval is a pytest-style general regression harness. Promptfoo is the matrix comparator for prompt and provider A/B. LangSmith is the LangChain-native traces-into-datasets-into-evals pipeline. Phoenix is the open-source self-hostable equivalent.
Modern application evaluation relies heavily on LLM-as-judge scoring for open-ended outputs; calibrating the judge against a small hand-graded set is the largest source of error and the practice most teams under-invest in.
Continuous evaluation — running evals on every code change against a versioned dataset — is the single highest-leverage practice for keeping LLM applications stable; the cost is small, the regressions it catches are routinely large.
Yobibyte ships a built-in eval harness that wraps RAGAS for RAG faithfulness and context metrics and DeepEval for general regression; customers pipe LangSmith and Phoenix traces from Yobibyte endpoints into their own eval pipelines, and Yobitel's InferenceBench publishes provider-level scoring for the same model families using the same harness shape.

Overview

Treat LLM evaluation as three distinct activities. Capability benchmarks (MMLU, HumanEval, GSM8K, MATH, MMMU, SWE-bench, AIME, LiveBench) measure base-model abilities and inform model selection; they are the input to choosing which model your application calls. Evaluation harnesses (EleutherAI LM-Eval-Harness, Stanford HELM, OpenCompass, UK AISI Inspect AI) run those benchmarks reproducibly across model providers and produce the numbers you see in release announcements. Application evaluation frameworks (RAGAS for RAG, DeepEval for general regression, Promptfoo for matrix comparison, LangSmith / Braintrust / Phoenix for hosted continuous evaluation) measure whether your specific system, on your specific data, meets your specific quality bar.

The common mistake is conflating them. A model that scores 88 percent on MMLU does not necessarily perform well on your customer support task; a system that passes your application evals may still have capability gaps that only benchmarks expose. Production teams need both — benchmarks at model-selection time, application evals continuously thereafter. Yobitel maintains both surfaces: InferenceBench publishes capability and serving benchmarks across NeoCloud providers, and Yobibyte ships built-in application-level evals that customers run against their own corpora.

By mid-2026 the field has stabilised on a small number of widely-used tools rather than a sprawling list. This entry maps the umbrella, explains where each fits, walks through quick-start composition, gives a sizing model for eval at scale, lists the limits and quotas that bite, captures the security and FinOps story, compares against the obvious alternatives, and shows where Yobitel's built-in harness slots in. This entry helps you stand up a continuous LLM evaluation practice — capability benchmarks for model selection, RAGAS for RAG faithfulness, DeepEval for general regression, Promptfoo for prompt-and-provider matrix comparison, and LangSmith or Phoenix for trace-driven loops — wired to the Yobibyte or NeoCloud endpoints you are already serving from.

Quick start

The block below installs each of the four tools most teams adopt — RAGAS, DeepEval, Promptfoo, and Phoenix — and shows a minimal run for each against a Yobibyte (or any OpenAI-compatible) endpoint. None of these tools require Yobitel-specific configuration: every one targets the same OpenAI-compatible base URL plus API key that the application already uses.

# 0. Common setup — point evals at the same endpoint the app uses
export LLM_BASE_URL=https://api.yobibyte.example/v1
export LLM_API_KEY=sk-yb-...
export JUDGE_BASE_URL=https://api.anthropic.com   # separate judge family
export JUDGE_API_KEY=sk-ant-...

# 1. RAGAS — RAG faithfulness, context precision/recall, answer relevancy
pip install "ragas>=0.2" "langchain-openai>=0.2"
cat > rag_eval.py <<'PY'
from datasets import Dataset
from ragas import evaluate
from ragas.metrics import (faithfulness, answer_relevancy,
                           context_precision, context_recall)
from langchain_openai import ChatOpenAI, OpenAIEmbeddings

ds = Dataset.from_dict({
    "question": [...], "answer": [...], "contexts": [[...], ...],
    "ground_truth": [...],
})
report = evaluate(
    ds,
    metrics=[faithfulness, answer_relevancy, context_precision, context_recall],
    llm=ChatOpenAI(model="claude-opus-4-7", base_url=os.environ["JUDGE_BASE_URL"]),
    embeddings=OpenAIEmbeddings(model="text-embedding-3-large",
                                base_url=os.environ["LLM_BASE_URL"]),
)
print(report.to_pandas())
PY

# 2. DeepEval — pytest-style general regression
pip install "deepeval>=2.0"
cat > test_agent.py <<'PY'
from deepeval import assert_test
from deepeval.metrics import AnswerRelevancyMetric, FaithfulnessMetric, GEval
from deepeval.test_case import LLMTestCase, LLMTestCaseParams
import pytest

@pytest.mark.parametrize("case", load_golden_cases())
def test_agent_response(case):
    actual = my_agent(case["input"], retrieved=case["context"])
    tc = LLMTestCase(input=case["input"], actual_output=actual,
                     retrieval_context=case["context"],
                     expected_output=case["expected"])
    assert_test(tc, [
        AnswerRelevancyMetric(threshold=0.8),
        FaithfulnessMetric(threshold=0.9),
        GEval(name="Tone", criteria="Polite, concise, cites sources.",
              evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT]),
    ])
PY
deepeval test run test_agent.py

# 3. Promptfoo — matrix comparison across prompts / models / providers
npm i -g promptfoo
cat > promptfooconfig.yaml <<'YML'
providers:
  - id: openai:chat:claude-opus-4-7   # judge / strong baseline
    config: { apiBaseUrl: ${JUDGE_BASE_URL} }
  - id: openai:chat:llama-3.1-70b-instruct  # Yobibyte-served
    config: { apiBaseUrl: ${LLM_BASE_URL} }
prompts: [ file://prompts/v1.txt, file://prompts/v2.txt ]
tests: [ file://datasets/golden.csv ]
defaultTest:
  assert:
    - type: contains
      value: "ORD-"
    - type: llm-rubric
      value: "Answer is grounded in the provided context."
YML
promptfoo eval && promptfoo view

# 4. Phoenix — open-source trace + eval UI; runs locally or self-hosted
pip install "arize-phoenix>=5"
python - <<'PY'
import phoenix as px
from phoenix.evals import RAG_RELEVANCY_PROMPT_TEMPLATE, llm_classify
px.launch_app()
# Phoenix UI now serves at http://localhost:6006 with the LangChain /
# LlamaIndex auto-instrumentation feeding live traces.
PY

Tip: Use a different model family for the judge than for the system under test. LLM-as-judge is biased toward outputs from its own family (self-enhancement bias). For a Yobibyte-served Llama-70B, judge with Claude or GPT-4 class on a separate JUDGE_BASE_URL — exactly the pattern shown above.

How it works

Capability harnesses iterate a fixed task suite over a model and report aggregate metrics. EleutherAI's LM-Eval-Harness is the de-facto standard: the numbers in most model release announcements are produced by it. HELM is the gold standard for holistic evaluation across accuracy, calibration, robustness, fairness and bias. OpenCompass is the strongest on Chinese plus English reasoning. UK AISI's Inspect AI is the recommended choice for safety-critical evaluation — designed with programmatic control and audit trails in mind, used by the UK AISI for frontier-model safety testing.

Application metric frameworks operate one layer up: instead of asking 'how capable is this model in general', they ask 'how well does this specific pipeline, on this specific dataset, do at this specific task'. RAGAS focuses on RAG (faithfulness, answer relevancy, context precision and recall, noise sensitivity). DeepEval is a general pytest-style framework with a catalogue of metrics (G-Eval rubrics, hallucination, bias, toxicity, task completion, tool correctness). Promptfoo runs a matrix of prompts x models x providers x test cases and reports comparison tables — the natural choice for prompt A/B and provider bake-off (including a 'this prompt on Yobibyte vs Anthropic vs OpenAI' shape).

Hosted observability-plus-eval platforms close the loop. LangSmith captures every LangChain / LangGraph run, lets you promote production traces into versioned datasets, and runs evals on every commit via CI. Braintrust focuses on experiment comparison with strong side-by-side workflows. Phoenix (Arize) is the open-source self-hostable equivalent, with deep LlamaIndex and LangChain auto-instrumentation. Helicone offers a low-friction proxy approach (one HTTP header to enable). All of them speak the same trace-into-dataset-into-eval pattern; the choice is mostly about deployment shape and where your team already lives.

Yobibyte ships a built-in eval harness that wraps RAGAS for RAG metrics and DeepEval for general regression, exposed through the workspace console without customers having to operate the underlying tooling themselves. Customers who run their own LangSmith / Phoenix instance pipe traces directly from Yobibyte endpoints (no Yobitel-specific instrumentation; the OpenAI-compatible surface is the integration). Yobitel's InferenceBench publishes provider-level scores for the same model families using the same harness shape, giving Yobitel customers a continuous third-party signal on the model + provider mix they consume.

Capability benchmarks — MMLU, HumanEval, GSM8K, MATH, MMMU, SWE-bench, AIME, LiveBench, BIG-Bench Hard.
Open harnesses — EleutherAI LM-Eval-Harness (de-facto), Stanford HELM (holistic), OpenCompass (CN/EN), UK AISI Inspect AI (safety + control), BIG-Bench (Google + community).
Preference leaderboards — Chatbot Arena (LMSYS) for pairwise human preference; Artificial Analysis and InferenceBench for serving-side latency / throughput / price-quality.
Application metric libraries — RAGAS (RAG), DeepEval (general pytest-style), Promptfoo (matrix), Inspect AI (safety), TruLens (RAG + agent observability).
Hosted observability + eval — LangSmith (LangChain-native), Braintrust (experiment comparison), Phoenix / Arize (open-source self-host), Helicone (proxy-based), Langfuse (open-source self-host).
Judge models — recommended: a strong frontier model from a different family than the system under test. Claude Opus or GPT-4 class for a Llama-served Yobibyte tenant.
Dataset sources — hand-curated golden examples (foundation), production traffic samples (bulk), adversarial cases from incidents (edge), synthetic generation (coverage floor only).

Note: Capability benchmarks tell you which model. Application evals tell you whether the model plus your prompt plus your context plus your tools does the job. Production needs both — and InferenceBench publishes the capability + serving signal continuously for the model families Yobitel customers reach through Yobibyte.

Reference and specifications

The table below is the canonical reference for the umbrella of eval tooling production teams actually compose in 2026. Each row captures the niche, the licence, the headline shape, and the typical adoption signal.

Tool	Layer	Licence	Headline shape
EleutherAI LM-Eval-Harness	Capability harness	MIT	100+ academic benchmarks, model-agnostic; the source of most release-announcement numbers.
Stanford HELM	Capability harness (holistic)	Apache 2.0	Accuracy + calibration + robustness + fairness + bias on a fixed task suite.
OpenCompass	Capability harness	Apache 2.0	Strong on Chinese + English reasoning; 100+ datasets.
UK AISI Inspect AI	Capability + safety harness	MIT	Programmatic, auditable; UK AISI's own evaluation tool for frontier safety.
BIG-Bench / Hard	Capability suite	Apache 2.0	200+ hard-reasoning tasks; widely cited.
Chatbot Arena (LMSYS)	Preference leaderboard	Apache 2.0	Pairwise human preference; the most-cited public arena.
Artificial Analysis	Serving leaderboard	Commercial	Latency + throughput + price-quality across providers.
InferenceBench (Yobitel)	Serving + price-quality leaderboard	Open + free	Independent serving and price-performance for the providers Yobitel customers consume.
RAGAS	Application metric library	Apache 2.0	RAG faithfulness, answer relevancy, context precision/recall, noise sensitivity.
DeepEval	Application regression	Apache 2.0	pytest-style; G-Eval rubrics + 20+ pre-built metrics.
Promptfoo	Matrix comparator	MIT	YAML-driven prompts x models x providers x tests; CLI + web UI.
TruLens	RAG + agent observability	MIT	Feedback functions over traces; LangChain / LlamaIndex auto-integration.
LangSmith	Hosted observability + eval	Commercial (free tier)	LangChain-native traces → datasets → evals; prompt hub.
Braintrust	Hosted observability + eval	Commercial	Experiment comparison, side-by-side UI.
Phoenix (Arize)	OSS observability + eval	ELv2 / Apache 2.0	Self-hostable trace + eval UI; deep LlamaIndex / LangChain auto-instrumentation.
Langfuse	OSS observability + eval	MIT	Self-hostable; GDPR-friendly default.
Helicone	Proxy-based observability + eval	Apache 2.0	One HTTP header to enable; very low friction.
Yobibyte built-in evals	Managed regression for customers	Yobitel-operated	Wraps RAGAS and DeepEval; exposed through workspace console.

Warning: Public capability benchmarks suffer from contamination drift — many are present in modern pre-training corpora, so high scores partly measure memorisation. Cross-reference with held-out community benchmarks (LiveBench refreshes monthly) and with your own application evals before treating a leaderboard number as ground truth.

Workload patterns

Three eval workloads cover the bulk of production usage: (A) CI regression — run a small fast eval on every pull request, fail the build below threshold; (B) Nightly full-eval — run the complete versioned dataset, archive a report, gate on a quality budget; (C) Pre-release bake-off — compare candidate model / prompt / provider matrices side-by-side before promoting. Yobitel customers running on Yobibyte typically use the same harness across all three shapes; the difference is dataset size and judge cost.

Pattern A — CI regression. 50-200 examples, DeepEval pytest under GitHub Actions or GitLab CI. Runs in 2-10 minutes, judge model is a fast frontier tier, fails the build if any metric drops below a per-metric threshold. Pattern B — Nightly full-eval. 1,000-10,000 examples, RAGAS + DeepEval composed, report archived into LangSmith or Phoenix, alerts on regression. Pattern C — Pre-release bake-off. Promptfoo or LangSmith comparison runs over a curated matrix (current prompt + candidate prompts) x (current model + candidate models) x (current provider + candidate providers — for Yobitel customers this is typically Yobibyte vs Anthropic vs OpenAI vs Together).

# Pattern A — DeepEval in GitHub Actions
name: llm-regression
on: [pull_request]
jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with: { python-version: '3.11' }
      - run: pip install "deepeval>=2.0" "ragas>=0.2"
      - env:
          LLM_BASE_URL: ${{ secrets.YOBIBYTE_URL }}
          LLM_API_KEY:  ${{ secrets.YOBIBYTE_KEY }}
          JUDGE_API_KEY: ${{ secrets.ANTHROPIC_KEY }}
        run: deepeval test run tests/eval/ --threshold 0.85

# Pattern C — Promptfoo provider bake-off
description: Yobibyte vs Anthropic vs OpenAI for our support agent
providers:
  - id: openai:chat:llama-3.1-70b-instruct
    label: yobibyte-llama70b
    config: { apiBaseUrl: https://api.yobibyte.example/v1 }
  - id: anthropic:messages:claude-opus-4-7
    label: anthropic-opus
  - id: openai:chat:gpt-4.1
    label: openai-gpt41
prompts: [ file://prompts/system_v3.txt ]
tests: [ file://datasets/support_golden_200.csv ]
defaultTest:
  assert:
    - type: llm-rubric
      provider: anthropic:messages:claude-opus-4-7
      value: "Cites the order id, includes the SLA, polite tone."
    - type: latency
      threshold: 4000

Tip: Run InferenceBench's published numbers alongside your application evals when picking a provider. Capability and latency on InferenceBench tell you the ceiling and the speed; your application evals tell you whether your prompt and your context plus that provider hit your quality bar. They answer different questions.

Sizing and capacity planning

Eval throughput is bounded by the judge model's RPS and the eval framework's parallelism. The table below is realistic per-example overhead for the most common shapes; treat it as a planning anchor for budget and CI duration.

The single biggest CI-speed lever is async parallelism plus a fast judge. RAGAS and DeepEval both support async judge evaluation; a 200-example CI regression run that takes 20 minutes serially typically completes in 2-3 minutes with parallelism set to 16-32 against a frontier-class judge. For nightly full-evals at 10,000+ examples, batched judge calls via the provider's batch API (Anthropic Batches, OpenAI Batches) cut judge spend by 50 percent at the cost of longer wall-clock.

Dimension	Typical value	Notes
RAGAS per-example judge latency (frontier judge)	1-4 s	Faithfulness decomposes claims; multiple judge calls per example.
DeepEval AnswerRelevancy / Faithfulness	1-3 s	Per-metric; G-Eval rubric is slower (2-6 s).
Promptfoo per-test	Latency of the model under test + assert overhead	Dominated by the SUT; assertions are cheap.
LM-Eval-Harness MMLU full run	30-90 min	On a single H100; varies with model size and batch.
HELM full task suite	10-40 hours	Full holistic run; usually scoped down.
CI regression (200 ex, parallel 32)	2-5 min	Frontier judge; fits a sensible PR-gate budget.
Nightly full-eval (10K ex, batched)	1-4 hours wall-clock	Anthropic / OpenAI batch API; lowest cost path.
Phoenix UI load on self-host	1-2 vCPU, 2 GB RAM	Per replica; idle most of the time.
LangSmith ingestion rate	Plan-dependent	Free tier caps at 5K runs/month; paid tiers scale to millions.

Limits and quotas

Eval frameworks place no hard limits on dataset size or run frequency. Every limit you hit will be one set by the judge provider's RPS quota, the hosted eval platform's plan tier, or the size of the eval dataset itself. The list below covers what bites in practice when running CI regression at scale.

Judge provider rate limit — Anthropic and OpenAI frontier judges typically allow 5-50 RPS at standard tiers, 100+ on enterprise. RAGAS / DeepEval honour 429 Retry-After; tune the framework's concurrency to stay under the cap.
Judge token cost — frontier judges at ~$15-75 per 1M output tokens dominate eval cost; for high-frequency CI use a cheaper judge (Claude Haiku, GPT-4o-mini class) calibrated against the strong judge.
Dataset versioning — LangSmith, Braintrust and Phoenix all store dataset versions; the limit is plan-tier, not framework. Pin dataset versions per release tag to make regressions reproducible.
Trace ingestion rate — LangSmith Plus and Braintrust paid tiers cap per-minute trace ingestion. High-RPS production traffic should sample (e.g. 1-5 percent) rather than ingest every run.
Promptfoo matrix size — the cartesian product of prompts x models x providers x tests x assertions blows up fast; a 4x4x4x500x3 matrix is 96K calls. Use the repeat and filterFile options to scope intelligently.
Inspect AI evaluation depth — designed for safety-critical eval, so logs include full prompts and completions. Plan storage accordingly.
Capability harness reproducibility — pin the harness version, the model checkpoint hash, and the eval config. Even minor harness version drift can shift MMLU scores by 1-2 points.
Dataset PII — judge calls send the full content to the judge provider. Under UK GDPR, eval datasets containing customer data are subject to the same processing-record obligations as production data.

Warning: Calibrate your judge before trusting it in CI. Hand-grade 50-200 examples, run the judge over the same set, compute Cohen's kappa or simple correlation. If agreement is below 0.7, iterate on the rubric or change the judge model. Recalibrate every quarter and every time you change the rubric or the judge.

Observability

Evaluation observability has two flavours: traces of the evaluation runs themselves (what the judge said about each example, why), and aggregate dashboards across runs (regression vs the last release tag). The hosted platforms — LangSmith, Braintrust, Phoenix, Langfuse — provide both. The open-source RAGAS and DeepEval CLIs emit structured JSON suitable for piping into any backend.

For Yobitel customers, the recommended pattern is to keep eval traces in the same observability tier as production traces. If LangChain is the framework, LangSmith is the natural choice (both production and eval runs land in the same project). If LlamaIndex is the framework, Phoenix is the natural choice (both flow through the same OpenInference instrumentation). Yobibyte's gateway emits OTel GenAI spans for the production side, so a customer's eval runs and their Yobibyte-served production runs can be reconciled by trace_id in a single Datadog / Honeycomb / Tempo backend.

Per-eval-run attributes — dataset id and version, model under test, judge model, framework version, threshold, pass/fail count, per-metric distribution.
Per-example attributes — input, expected, actual, retrieved context (for RAG), judge score per metric, judge reasoning text.
Dashboards worth standing up — pass rate over time per metric, judge spend over time, mean latency per provider, regression alerts on threshold breach.
Promote traces to eval datasets — the highest-leverage practice. Sample production runs (errors, low confidence, user thumbs-down, random N), annotate the desired output, append to the next eval cycle's dataset.
InferenceBench feeds — Yobitel publishes InferenceBench provider scores; consume them as an additional metric alongside your own application evals when picking a provider.

Cost and FinOps

Open-source eval tooling is free; the cost lives in judge calls, hosted-platform plans, and the compute to run capability harnesses. The dominant variable is judge cost: a 10K-example RAGAS run with a frontier judge can cost $50-200 in judge tokens alone; the same run with a cheap judge (Haiku, GPT-4o-mini class) costs $5-20 and, with calibration, achieves comparable quality on most metrics.

The cheapest path is: small-dataset CI regression (200-1000 examples) with a frontier judge daily, medium-dataset nightly full-eval (5-10K) with batched API discount, and quarterly recalibration of the cheap judge against the strong judge on a 500-example reference set.

Cost component	Typical USD range	Driver
Framework licences	$0	RAGAS / DeepEval / Promptfoo / Phoenix / Langfuse all open-source.
Judge calls (frontier, 1K ex RAGAS)	$5-25 per run	Anthropic / OpenAI list price; faithfulness decomposes claims (~3-5 judge calls / ex).
Judge calls (cheap-tier, 1K ex)	$0.50-2.50 per run	Haiku / GPT-4o-mini class; calibrate quarterly.
Judge calls (batched API)	~50% of synchronous	Anthropic Batches, OpenAI Batches; nightly full-eval shape.
LangSmith Plus	$39/user/month + per-run	Free tier 5K runs/month; paid tiers scale.
Braintrust paid tier	Custom commercial	Typically $500-5,000/month for production teams.
Phoenix self-host	$50-300/month	Container + Postgres + S3 trace store.
Capability harness compute (MMLU on H100)	$0.50-3 per full run	On-demand H100 pricing; ~1 hour wall-clock.
InferenceBench data feed	$0	Yobitel publishes openly; no API surcharge.

Tip: Eval spend below 5 percent of inference spend is well-calibrated. Below 1 percent and you are probably not catching regressions you should; above 15 percent and your judge model is over-specified for your CI rhythm. The Yobibyte built-in eval harness lands in the 3-7 percent of inference spend range for most customers.

Security and compliance

Evaluation pipelines touch the same data as production — golden datasets contain example user inputs, retrieved contexts, expected outputs. For regulated workloads they inherit the same handling obligations as production traffic. The checklist below is the production-security stance that aligns with Yobibyte's NCSC Cloud Security Principle alignment and UK GDPR Article 32 evidence requirements.

Judge data path — calling Anthropic / OpenAI as a judge ships dataset content to the judge provider. Under UK GDPR this is processing by a third-party processor; ensure the contract covers it.
Sovereign workloads — for UK Sovereign customers, the judge endpoint should be a UK-resident frontier model (Anthropic UK, OpenAI EU region) or a Yobibyte UK Sovereign tenant serving a frontier-class model. Routing eval data outbound to a US endpoint defeats the sovereign posture of the production stack.
Dataset access control — eval datasets often contain PII; treat the LangSmith / Phoenix / Braintrust dataset store as a production data store with the same RBAC and audit obligations.
Judge calibration audit — keep a record of the calibration set, the judge model, the calibration date, and the agreement metric. This is the evidence that defends an LLM-as-judge metric to a sceptical auditor.
Synthetic data — synthesised eval examples may be safer for sovereign processing, but quality is lower; treat them as the coverage floor, not the ceiling.
Reproducibility — pin framework version, judge model version, dataset version, and prompt version for every run. Article 30 records of processing benefit from the same discipline.
Yobibyte built-in evals — Yobitel-operated; the judge call path stays inside the Yobibyte regulatory boundary unless customers opt to bring their own external judge endpoint.

Migration and alternatives

Most production teams compose several tools; pure-substitution migrations are rare. The table below captures the practical comparison between the main alternative shapes Yobitel customers weigh.

From hand-rolled judge scripts to DeepEval — wrap existing graders as BaseMetric subclasses; reuse golden examples as LLMTestCase objects; CI integration is deepeval test run.
From DeepEval to RAGAS for the RAG slice — keep DeepEval for general regression; add RAGAS for the faithfulness and context metrics specifically. The two compose cleanly.
From LangSmith to Phoenix self-host — auto-instrumentation translates; the substantive piece is operating Postgres and S3 for the trace store.
From self-hosted evals to Yobibyte built-in — keep your datasets; Yobibyte console runs them on the Yobibyte-served model + judge model combination, returns the same metric shape.
From a single judge model to ensemble judges — average scores across two judges from different families; reduces self-enhancement and verbosity bias.

Approach	Strengths	Weaknesses	When to pick
RAGAS only	RAG-focused; principled metric set; reference-free where possible.	RAG only; light on agent / multi-turn metrics.	When the application is exclusively RAG.
DeepEval only	General pytest-style; 20+ metrics; CI-native.	Less RAG-deep than RAGAS; smaller community for RAG-specific rubrics.	When the application is general-purpose.
Promptfoo only	Matrix comparator; CLI + web UI; YAML-driven.	Less production-trace integration.	When the question is prompt or provider A/B.
LangSmith	Hosted; LangChain-native; trace → dataset → eval pipeline.	Commercial; LangChain-shaped abstractions.	When the stack is already LangChain / LangGraph.
Phoenix self-host	Open-source; deep LlamaIndex / LangChain auto-instrumentation.	Self-host operational overhead.	When sovereign or air-gapped, or LlamaIndex-heavy.
Yobibyte built-in evals	Managed; wraps RAGAS + DeepEval; console UI.	Tied to Yobibyte; less flexibility than self-hosted tooling.	When team capacity to operate eval tooling is limited.
Hand-rolled with provider SDK	Total control; no framework overhead.	Reinvent dataset versioning, parallelism, judge calibration.	When the eval shape is very domain-specific.

Troubleshooting

The failure modes below are the ones eval pipeline operators hit repeatedly when running RAGAS, DeepEval, Promptfoo, LangSmith and Phoenix against managed endpoints — including Yobibyte, Yobitel NeoCloud vLLM, and third-party providers.

Symptom	Likely cause	Fix
Judge scores wildly inconsistent run-to-run	Judge temperature too high; or judge model itself drifts.	Set judge temperature=0; pin judge model version; recalibrate.
RAGAS faithfulness 0 on perfectly fine answers	Judge cannot decompose multi-sentence claims; or context contains no overlapping vocabulary.	Use a stronger judge; check that retrieved context actually contains the cited content.
DeepEval test_run hangs	Synchronous judge calls; concurrency=1.	Enable async with `run_async=True`; set concurrency 8-32.
Promptfoo cartesian product explodes	Too many providers x prompts x tests.	Scope with `filterFirst N`; split into per-axis runs.
CI judge cost exploding	Frontier judge on every PR; large dataset.	Switch CI to cheap-tier judge calibrated quarterly; reserve frontier judge for nightly.
LangSmith dataset ingestion slow	Large per-trace payloads; serial upload.	Use the bulk dataset API; pre-truncate inputs / outputs above a sensible size.
MMLU score 2-3 points off published number	Harness version drift; prompt template difference.	Pin LM-Eval-Harness version; use the canonical prompt template from the model card.
Phoenix UI empty despite instrumented app	OpenInference handler not registered; or wrong project.	Verify `LlamaIndexInstrumentor().instrument()` ran before any traced call.
Inspect AI eval fails to record judgements	Dataset definition missing target fields.	Match the Sample / Task definition to Inspect AI's typed schema; consult the upstream tutorial.
Eval scores good in dev, bad in prod	Dataset not representative; production has long-tail content.	Add production-traffic-sampled examples to the dataset; tag by failure mode.
G-Eval rubric ignored	Rubric too vague ("is it good?").	Rewrite as numbered criteria with examples of pass / fail; recalibrate against human scores.

Note: When CI eval scores drop, check the order: (1) was the dataset version pinned, (2) was the judge model version pinned, (3) was the framework version pinned. Nine times out of ten the regression is silent drift in one of these, not in the system under test.

Where this fits in the Yobitel stack

Yobibyte ships a built-in eval harness that wraps RAGAS for RAG metrics and DeepEval for general regression — Yobitel-operated, exposed through the workspace console, no customer-side eval infrastructure to stand up. Customers who already run LangSmith or Phoenix at scale pipe their traces from Yobibyte endpoints directly into their existing eval loop with no Yobitel-specific instrumentation; the OpenAI-compatible surface and the OTel GenAI spans Yobibyte emits at the gateway are the integration.

Yobitel's InferenceBench is the public companion: an open, free, continuously-refreshed leaderboard of serving and price-quality numbers across the providers Yobitel customers consume — Yobibyte, Yobitel NeoCloud, and the third-party providers most teams compare against. Treat InferenceBench as the model-selection signal at the top of the funnel and Yobibyte's built-in evals (or your own LangSmith / Phoenix loop) as the continuous-quality signal below it. For UK Sovereign customers, keep the judge endpoint inside the regulatory boundary, pin every version of dataset, framework, judge and prompt, and let Yobitel Professional Services help wire the reference stack — exactly the four-layer eval-loop discipline this entry documents.

References

LM Evaluation Harness · GitHub (EleutherAI)
Stanford HELM · Stanford CRFM
UK AISI Inspect AI · UK AI Safety Institute
RAGAS Documentation · Exploding Gradients
DeepEval Documentation · Confident AI
Promptfoo Documentation · Promptfoo
LangSmith Documentation · LangChain
Phoenix Documentation (Arize) · Arize
Chatbot Arena (LMSYS) · LMSYS

TL;DR

LLM evaluation in 2026 spans three layers: capability benchmarks (MMLU, HumanEval, GSM8K, SWE-bench, AIME, MMMU) run via open harnesses (EleutherAI LM-Eval-Harness, Stanford HELM, OpenCompass, UK AISI Inspect AI), application metric frameworks (RAGAS for RAG, DeepEval for general regression, Promptfoo for matrix comparison), and hosted observability + eval platforms (LangSmith, Braintrust, Phoenix / Arize).
Each tool has a clear niche. RAGAS is the principled choice for retrieval-augmented generation. DeepEval is a pytest-style general regression harness. Promptfoo is the matrix comparator for prompt and provider A/B. LangSmith is the LangChain-native traces-into-datasets-into-evals pipeline. Phoenix is the open-source self-hostable equivalent.
Modern application evaluation relies heavily on LLM-as-judge scoring for open-ended outputs; calibrating the judge against a small hand-graded set is the largest source of error and the practice most teams under-invest in.
Continuous evaluation — running evals on every code change against a versioned dataset — is the single highest-leverage practice for keeping LLM applications stable; the cost is small, the regressions it catches are routinely large.
Yobibyte ships a built-in eval harness that wraps RAGAS for RAG faithfulness and context metrics and DeepEval for general regression; customers pipe LangSmith and Phoenix traces from Yobibyte endpoints into their own eval pipelines, and Yobitel's InferenceBench publishes provider-level scoring for the same model families using the same harness shape.

Overview

Quick start

# 0. Common setup — point evals at the same endpoint the app uses
export LLM_BASE_URL=https://api.yobibyte.example/v1
export LLM_API_KEY=sk-yb-...
export JUDGE_BASE_URL=https://api.anthropic.com   # separate judge family
export JUDGE_API_KEY=sk-ant-...

# 1. RAGAS — RAG faithfulness, context precision/recall, answer relevancy
pip install "ragas>=0.2" "langchain-openai>=0.2"
cat > rag_eval.py <<'PY'
from datasets import Dataset
from ragas import evaluate
from ragas.metrics import (faithfulness, answer_relevancy,
                           context_precision, context_recall)
from langchain_openai import ChatOpenAI, OpenAIEmbeddings

ds = Dataset.from_dict({
    "question": [...], "answer": [...], "contexts": [[...], ...],
    "ground_truth": [...],
})
report = evaluate(
    ds,
    metrics=[faithfulness, answer_relevancy, context_precision, context_recall],
    llm=ChatOpenAI(model="claude-opus-4-7", base_url=os.environ["JUDGE_BASE_URL"]),
    embeddings=OpenAIEmbeddings(model="text-embedding-3-large",
                                base_url=os.environ["LLM_BASE_URL"]),
)
print(report.to_pandas())
PY

# 2. DeepEval — pytest-style general regression
pip install "deepeval>=2.0"
cat > test_agent.py <<'PY'
from deepeval import assert_test
from deepeval.metrics import AnswerRelevancyMetric, FaithfulnessMetric, GEval
from deepeval.test_case import LLMTestCase, LLMTestCaseParams
import pytest

@pytest.mark.parametrize("case", load_golden_cases())
def test_agent_response(case):
    actual = my_agent(case["input"], retrieved=case["context"])
    tc = LLMTestCase(input=case["input"], actual_output=actual,
                     retrieval_context=case["context"],
                     expected_output=case["expected"])
    assert_test(tc, [
        AnswerRelevancyMetric(threshold=0.8),
        FaithfulnessMetric(threshold=0.9),
        GEval(name="Tone", criteria="Polite, concise, cites sources.",
              evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT]),
    ])
PY
deepeval test run test_agent.py

# 3. Promptfoo — matrix comparison across prompts / models / providers
npm i -g promptfoo
cat > promptfooconfig.yaml <<'YML'
providers:
  - id: openai:chat:claude-opus-4-7   # judge / strong baseline
    config: { apiBaseUrl: ${JUDGE_BASE_URL} }
  - id: openai:chat:llama-3.1-70b-instruct  # Yobibyte-served
    config: { apiBaseUrl: ${LLM_BASE_URL} }
prompts: [ file://prompts/v1.txt, file://prompts/v2.txt ]
tests: [ file://datasets/golden.csv ]
defaultTest:
  assert:
    - type: contains
      value: "ORD-"
    - type: llm-rubric
      value: "Answer is grounded in the provided context."
YML
promptfoo eval && promptfoo view

# 4. Phoenix — open-source trace + eval UI; runs locally or self-hosted
pip install "arize-phoenix>=5"
python - <<'PY'
import phoenix as px
from phoenix.evals import RAG_RELEVANCY_PROMPT_TEMPLATE, llm_classify
px.launch_app()
# Phoenix UI now serves at http://localhost:6006 with the LangChain /
# LlamaIndex auto-instrumentation feeding live traces.
PY

Tip: Use a different model family for the judge than for the system under test. LLM-as-judge is biased toward outputs from its own family (self-enhancement bias). For a Yobibyte-served Llama-70B, judge with Claude or GPT-4 class on a separate JUDGE_BASE_URL — exactly the pattern shown above.

How it works

Capability benchmarks — MMLU, HumanEval, GSM8K, MATH, MMMU, SWE-bench, AIME, LiveBench, BIG-Bench Hard.
Open harnesses — EleutherAI LM-Eval-Harness (de-facto), Stanford HELM (holistic), OpenCompass (CN/EN), UK AISI Inspect AI (safety + control), BIG-Bench (Google + community).
Preference leaderboards — Chatbot Arena (LMSYS) for pairwise human preference; Artificial Analysis and InferenceBench for serving-side latency / throughput / price-quality.
Application metric libraries — RAGAS (RAG), DeepEval (general pytest-style), Promptfoo (matrix), Inspect AI (safety), TruLens (RAG + agent observability).
Hosted observability + eval — LangSmith (LangChain-native), Braintrust (experiment comparison), Phoenix / Arize (open-source self-host), Helicone (proxy-based), Langfuse (open-source self-host).
Judge models — recommended: a strong frontier model from a different family than the system under test. Claude Opus or GPT-4 class for a Llama-served Yobibyte tenant.
Dataset sources — hand-curated golden examples (foundation), production traffic samples (bulk), adversarial cases from incidents (edge), synthetic generation (coverage floor only).

Note: Capability benchmarks tell you which model. Application evals tell you whether the model plus your prompt plus your context plus your tools does the job. Production needs both — and InferenceBench publishes the capability + serving signal continuously for the model families Yobitel customers reach through Yobibyte.

Reference and specifications

Tool	Layer	Licence	Headline shape
EleutherAI LM-Eval-Harness	Capability harness	MIT	100+ academic benchmarks, model-agnostic; the source of most release-announcement numbers.
Stanford HELM	Capability harness (holistic)	Apache 2.0	Accuracy + calibration + robustness + fairness + bias on a fixed task suite.
OpenCompass	Capability harness	Apache 2.0	Strong on Chinese + English reasoning; 100+ datasets.
UK AISI Inspect AI	Capability + safety harness	MIT	Programmatic, auditable; UK AISI's own evaluation tool for frontier safety.
BIG-Bench / Hard	Capability suite	Apache 2.0	200+ hard-reasoning tasks; widely cited.
Chatbot Arena (LMSYS)	Preference leaderboard	Apache 2.0	Pairwise human preference; the most-cited public arena.
Artificial Analysis	Serving leaderboard	Commercial	Latency + throughput + price-quality across providers.
InferenceBench (Yobitel)	Serving + price-quality leaderboard	Open + free	Independent serving and price-performance for the providers Yobitel customers consume.
RAGAS	Application metric library	Apache 2.0	RAG faithfulness, answer relevancy, context precision/recall, noise sensitivity.
DeepEval	Application regression	Apache 2.0	pytest-style; G-Eval rubrics + 20+ pre-built metrics.
Promptfoo	Matrix comparator	MIT	YAML-driven prompts x models x providers x tests; CLI + web UI.
TruLens	RAG + agent observability	MIT	Feedback functions over traces; LangChain / LlamaIndex auto-integration.
LangSmith	Hosted observability + eval	Commercial (free tier)	LangChain-native traces → datasets → evals; prompt hub.
Braintrust	Hosted observability + eval	Commercial	Experiment comparison, side-by-side UI.
Phoenix (Arize)	OSS observability + eval	ELv2 / Apache 2.0	Self-hostable trace + eval UI; deep LlamaIndex / LangChain auto-instrumentation.
Langfuse	OSS observability + eval	MIT	Self-hostable; GDPR-friendly default.
Helicone	Proxy-based observability + eval	Apache 2.0	One HTTP header to enable; very low friction.
Yobibyte built-in evals	Managed regression for customers	Yobitel-operated	Wraps RAGAS and DeepEval; exposed through workspace console.

Warning: Public capability benchmarks suffer from contamination drift — many are present in modern pre-training corpora, so high scores partly measure memorisation. Cross-reference with held-out community benchmarks (LiveBench refreshes monthly) and with your own application evals before treating a leaderboard number as ground truth.

Workload patterns

# Pattern A — DeepEval in GitHub Actions
name: llm-regression
on: [pull_request]
jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with: { python-version: '3.11' }
      - run: pip install "deepeval>=2.0" "ragas>=0.2"
      - env:
          LLM_BASE_URL: ${{ secrets.YOBIBYTE_URL }}
          LLM_API_KEY:  ${{ secrets.YOBIBYTE_KEY }}
          JUDGE_API_KEY: ${{ secrets.ANTHROPIC_KEY }}
        run: deepeval test run tests/eval/ --threshold 0.85

# Pattern C — Promptfoo provider bake-off
description: Yobibyte vs Anthropic vs OpenAI for our support agent
providers:
  - id: openai:chat:llama-3.1-70b-instruct
    label: yobibyte-llama70b
    config: { apiBaseUrl: https://api.yobibyte.example/v1 }
  - id: anthropic:messages:claude-opus-4-7
    label: anthropic-opus
  - id: openai:chat:gpt-4.1
    label: openai-gpt41
prompts: [ file://prompts/system_v3.txt ]
tests: [ file://datasets/support_golden_200.csv ]
defaultTest:
  assert:
    - type: llm-rubric
      provider: anthropic:messages:claude-opus-4-7
      value: "Cites the order id, includes the SLA, polite tone."
    - type: latency
      threshold: 4000

Tip: Run InferenceBench's published numbers alongside your application evals when picking a provider. Capability and latency on InferenceBench tell you the ceiling and the speed; your application evals tell you whether your prompt and your context plus that provider hit your quality bar. They answer different questions.

Sizing and capacity planning

Dimension	Typical value	Notes
RAGAS per-example judge latency (frontier judge)	1-4 s	Faithfulness decomposes claims; multiple judge calls per example.
DeepEval AnswerRelevancy / Faithfulness	1-3 s	Per-metric; G-Eval rubric is slower (2-6 s).
Promptfoo per-test	Latency of the model under test + assert overhead	Dominated by the SUT; assertions are cheap.
LM-Eval-Harness MMLU full run	30-90 min	On a single H100; varies with model size and batch.
HELM full task suite	10-40 hours	Full holistic run; usually scoped down.
CI regression (200 ex, parallel 32)	2-5 min	Frontier judge; fits a sensible PR-gate budget.
Nightly full-eval (10K ex, batched)	1-4 hours wall-clock	Anthropic / OpenAI batch API; lowest cost path.
Phoenix UI load on self-host	1-2 vCPU, 2 GB RAM	Per replica; idle most of the time.
LangSmith ingestion rate	Plan-dependent	Free tier caps at 5K runs/month; paid tiers scale to millions.

Limits and quotas

Judge provider rate limit — Anthropic and OpenAI frontier judges typically allow 5-50 RPS at standard tiers, 100+ on enterprise. RAGAS / DeepEval honour 429 Retry-After; tune the framework's concurrency to stay under the cap.
Judge token cost — frontier judges at ~$15-75 per 1M output tokens dominate eval cost; for high-frequency CI use a cheaper judge (Claude Haiku, GPT-4o-mini class) calibrated against the strong judge.
Dataset versioning — LangSmith, Braintrust and Phoenix all store dataset versions; the limit is plan-tier, not framework. Pin dataset versions per release tag to make regressions reproducible.
Trace ingestion rate — LangSmith Plus and Braintrust paid tiers cap per-minute trace ingestion. High-RPS production traffic should sample (e.g. 1-5 percent) rather than ingest every run.
Promptfoo matrix size — the cartesian product of prompts x models x providers x tests x assertions blows up fast; a 4x4x4x500x3 matrix is 96K calls. Use the repeat and filterFile options to scope intelligently.
Inspect AI evaluation depth — designed for safety-critical eval, so logs include full prompts and completions. Plan storage accordingly.
Capability harness reproducibility — pin the harness version, the model checkpoint hash, and the eval config. Even minor harness version drift can shift MMLU scores by 1-2 points.
Dataset PII — judge calls send the full content to the judge provider. Under UK GDPR, eval datasets containing customer data are subject to the same processing-record obligations as production data.

Warning: Calibrate your judge before trusting it in CI. Hand-grade 50-200 examples, run the judge over the same set, compute Cohen's kappa or simple correlation. If agreement is below 0.7, iterate on the rubric or change the judge model. Recalibrate every quarter and every time you change the rubric or the judge.

Observability

Per-eval-run attributes — dataset id and version, model under test, judge model, framework version, threshold, pass/fail count, per-metric distribution.
Per-example attributes — input, expected, actual, retrieved context (for RAG), judge score per metric, judge reasoning text.
Dashboards worth standing up — pass rate over time per metric, judge spend over time, mean latency per provider, regression alerts on threshold breach.
Promote traces to eval datasets — the highest-leverage practice. Sample production runs (errors, low confidence, user thumbs-down, random N), annotate the desired output, append to the next eval cycle's dataset.
InferenceBench feeds — Yobitel publishes InferenceBench provider scores; consume them as an additional metric alongside your own application evals when picking a provider.

Cost and FinOps

Cost component	Typical USD range	Driver
Framework licences	$0	RAGAS / DeepEval / Promptfoo / Phoenix / Langfuse all open-source.
Judge calls (frontier, 1K ex RAGAS)	$5-25 per run	Anthropic / OpenAI list price; faithfulness decomposes claims (~3-5 judge calls / ex).
Judge calls (cheap-tier, 1K ex)	$0.50-2.50 per run	Haiku / GPT-4o-mini class; calibrate quarterly.
Judge calls (batched API)	~50% of synchronous	Anthropic Batches, OpenAI Batches; nightly full-eval shape.
LangSmith Plus	$39/user/month + per-run	Free tier 5K runs/month; paid tiers scale.
Braintrust paid tier	Custom commercial	Typically $500-5,000/month for production teams.
Phoenix self-host	$50-300/month	Container + Postgres + S3 trace store.
Capability harness compute (MMLU on H100)	$0.50-3 per full run	On-demand H100 pricing; ~1 hour wall-clock.
InferenceBench data feed	$0	Yobitel publishes openly; no API surcharge.

Tip: Eval spend below 5 percent of inference spend is well-calibrated. Below 1 percent and you are probably not catching regressions you should; above 15 percent and your judge model is over-specified for your CI rhythm. The Yobibyte built-in eval harness lands in the 3-7 percent of inference spend range for most customers.

Security and compliance

Judge data path — calling Anthropic / OpenAI as a judge ships dataset content to the judge provider. Under UK GDPR this is processing by a third-party processor; ensure the contract covers it.
Sovereign workloads — for UK Sovereign customers, the judge endpoint should be a UK-resident frontier model (Anthropic UK, OpenAI EU region) or a Yobibyte UK Sovereign tenant serving a frontier-class model. Routing eval data outbound to a US endpoint defeats the sovereign posture of the production stack.
Dataset access control — eval datasets often contain PII; treat the LangSmith / Phoenix / Braintrust dataset store as a production data store with the same RBAC and audit obligations.
Judge calibration audit — keep a record of the calibration set, the judge model, the calibration date, and the agreement metric. This is the evidence that defends an LLM-as-judge metric to a sceptical auditor.
Synthetic data — synthesised eval examples may be safer for sovereign processing, but quality is lower; treat them as the coverage floor, not the ceiling.
Reproducibility — pin framework version, judge model version, dataset version, and prompt version for every run. Article 30 records of processing benefit from the same discipline.
Yobibyte built-in evals — Yobitel-operated; the judge call path stays inside the Yobibyte regulatory boundary unless customers opt to bring their own external judge endpoint.

Migration and alternatives

Most production teams compose several tools; pure-substitution migrations are rare. The table below captures the practical comparison between the main alternative shapes Yobitel customers weigh.

From hand-rolled judge scripts to DeepEval — wrap existing graders as BaseMetric subclasses; reuse golden examples as LLMTestCase objects; CI integration is deepeval test run.
From DeepEval to RAGAS for the RAG slice — keep DeepEval for general regression; add RAGAS for the faithfulness and context metrics specifically. The two compose cleanly.
From LangSmith to Phoenix self-host — auto-instrumentation translates; the substantive piece is operating Postgres and S3 for the trace store.
From self-hosted evals to Yobibyte built-in — keep your datasets; Yobibyte console runs them on the Yobibyte-served model + judge model combination, returns the same metric shape.
From a single judge model to ensemble judges — average scores across two judges from different families; reduces self-enhancement and verbosity bias.

Approach	Strengths	Weaknesses	When to pick
RAGAS only	RAG-focused; principled metric set; reference-free where possible.	RAG only; light on agent / multi-turn metrics.	When the application is exclusively RAG.
DeepEval only	General pytest-style; 20+ metrics; CI-native.	Less RAG-deep than RAGAS; smaller community for RAG-specific rubrics.	When the application is general-purpose.
Promptfoo only	Matrix comparator; CLI + web UI; YAML-driven.	Less production-trace integration.	When the question is prompt or provider A/B.
LangSmith	Hosted; LangChain-native; trace → dataset → eval pipeline.	Commercial; LangChain-shaped abstractions.	When the stack is already LangChain / LangGraph.
Phoenix self-host	Open-source; deep LlamaIndex / LangChain auto-instrumentation.	Self-host operational overhead.	When sovereign or air-gapped, or LlamaIndex-heavy.
Yobibyte built-in evals	Managed; wraps RAGAS + DeepEval; console UI.	Tied to Yobibyte; less flexibility than self-hosted tooling.	When team capacity to operate eval tooling is limited.
Hand-rolled with provider SDK	Total control; no framework overhead.	Reinvent dataset versioning, parallelism, judge calibration.	When the eval shape is very domain-specific.

Troubleshooting

Symptom	Likely cause	Fix
Judge scores wildly inconsistent run-to-run	Judge temperature too high; or judge model itself drifts.	Set judge temperature=0; pin judge model version; recalibrate.
RAGAS faithfulness 0 on perfectly fine answers	Judge cannot decompose multi-sentence claims; or context contains no overlapping vocabulary.	Use a stronger judge; check that retrieved context actually contains the cited content.
DeepEval test_run hangs	Synchronous judge calls; concurrency=1.	Enable async with `run_async=True`; set concurrency 8-32.
Promptfoo cartesian product explodes	Too many providers x prompts x tests.	Scope with `filterFirst N`; split into per-axis runs.
CI judge cost exploding	Frontier judge on every PR; large dataset.	Switch CI to cheap-tier judge calibrated quarterly; reserve frontier judge for nightly.
LangSmith dataset ingestion slow	Large per-trace payloads; serial upload.	Use the bulk dataset API; pre-truncate inputs / outputs above a sensible size.
MMLU score 2-3 points off published number	Harness version drift; prompt template difference.	Pin LM-Eval-Harness version; use the canonical prompt template from the model card.
Phoenix UI empty despite instrumented app	OpenInference handler not registered; or wrong project.	Verify `LlamaIndexInstrumentor().instrument()` ran before any traced call.
Inspect AI eval fails to record judgements	Dataset definition missing target fields.	Match the Sample / Task definition to Inspect AI's typed schema; consult the upstream tutorial.
Eval scores good in dev, bad in prod	Dataset not representative; production has long-tail content.	Add production-traffic-sampled examples to the dataset; tag by failure mode.
G-Eval rubric ignored	Rubric too vague ("is it good?").	Rewrite as numbered criteria with examples of pass / fail; recalibrate against human scores.

Note: When CI eval scores drop, check the order: (1) was the dataset version pinned, (2) was the judge model version pinned, (3) was the framework version pinned. Nine times out of ten the regression is silent drift in one of these, not in the system under test.

Where this fits in the Yobitel stack

References

LM Evaluation Harness · GitHub (EleutherAI)
Stanford HELM · Stanford CRFM
UK AISI Inspect AI · UK AI Safety Institute
RAGAS Documentation · Exploding Gradients
DeepEval Documentation · Confident AI
Promptfoo Documentation · Promptfoo
LangSmith Documentation · LangChain
Phoenix Documentation (Arize) · Arize
Chatbot Arena (LMSYS) · LMSYS

LLM Evaluation Frameworks

Overview

Quick start

How it works

Reference and specifications

Workload patterns

Sizing and capacity planning

Limits and quotas

Observability

Cost and FinOps

Security and compliance

Migration and alternatives

Troubleshooting

Where this fits in the Yobitel stack

References

Browse all entries

Deploy on Yobibyte

LLM Evaluation Frameworks

Overview

Quick start

How it works

Reference and specifications

Workload patterns

Sizing and capacity planning

Limits and quotas

Observability

Cost and FinOps

Security and compliance

Migration and alternatives

Troubleshooting

Where this fits in the Yobitel stack

References

Browse all entries

Deploy on Yobibyte