TL;DR
- LLM evaluation in 2026 spans three layers: capability benchmarks (MMLU, HumanEval, GSM8K, SWE-bench, AIME, MMMU) run via open harnesses (EleutherAI LM-Eval-Harness, Stanford HELM, OpenCompass, UK AISI Inspect AI), application metric frameworks (RAGAS for RAG, DeepEval for general regression, Promptfoo for matrix comparison), and hosted observability + eval platforms (LangSmith, Braintrust, Phoenix / Arize).
- Each tool has a clear niche. RAGAS is the principled choice for retrieval-augmented generation. DeepEval is a pytest-style general regression harness. Promptfoo is the matrix comparator for prompt and provider A/B. LangSmith is the LangChain-native traces-into-datasets-into-evals pipeline. Phoenix is the open-source self-hostable equivalent.
- Modern application evaluation relies heavily on LLM-as-judge scoring for open-ended outputs; calibrating the judge against a small hand-graded set is the largest source of error and the practice most teams under-invest in.
- Continuous evaluation — running evals on every code change against a versioned dataset — is the single highest-leverage practice for keeping LLM applications stable; the cost is small, the regressions it catches are routinely large.
- Yobibyte ships a built-in eval harness that wraps RAGAS for RAG faithfulness and context metrics and DeepEval for general regression; customers pipe LangSmith and Phoenix traces from Yobibyte endpoints into their own eval pipelines, and Yobitel's InferenceBench publishes provider-level scoring for the same model families using the same harness shape.
Overview#
Treat LLM evaluation as three distinct activities. Capability benchmarks (MMLU, HumanEval, GSM8K, MATH, MMMU, SWE-bench, AIME, LiveBench) measure base-model abilities and inform model selection; they are the input to choosing which model your application calls. Evaluation harnesses (EleutherAI LM-Eval-Harness, Stanford HELM, OpenCompass, UK AISI Inspect AI) run those benchmarks reproducibly across model providers and produce the numbers you see in release announcements. Application evaluation frameworks (RAGAS for RAG, DeepEval for general regression, Promptfoo for matrix comparison, LangSmith / Braintrust / Phoenix for hosted continuous evaluation) measure whether your specific system, on your specific data, meets your specific quality bar.
The common mistake is conflating them. A model that scores 88 percent on MMLU does not necessarily perform well on your customer support task; a system that passes your application evals may still have capability gaps that only benchmarks expose. Production teams need both — benchmarks at model-selection time, application evals continuously thereafter. Yobitel maintains both surfaces: InferenceBench publishes capability and serving benchmarks across NeoCloud providers, and Yobibyte ships built-in application-level evals that customers run against their own corpora.
By mid-2026 the field has stabilised on a small number of widely-used tools rather than a sprawling list. This entry maps the umbrella, explains where each fits, walks through quick-start composition, gives a sizing model for eval at scale, lists the limits and quotas that bite, captures the security and FinOps story, compares against the obvious alternatives, and shows where Yobitel's built-in harness slots in. This entry helps you stand up a continuous LLM evaluation practice — capability benchmarks for model selection, RAGAS for RAG faithfulness, DeepEval for general regression, Promptfoo for prompt-and-provider matrix comparison, and LangSmith or Phoenix for trace-driven loops — wired to the Yobibyte or NeoCloud endpoints you are already serving from.
Quick start#
The block below installs each of the four tools most teams adopt — RAGAS, DeepEval, Promptfoo, and Phoenix — and shows a minimal run for each against a Yobibyte (or any OpenAI-compatible) endpoint. None of these tools require Yobitel-specific configuration: every one targets the same OpenAI-compatible base URL plus API key that the application already uses.
# 0. Common setup — point evals at the same endpoint the app uses
export LLM_BASE_URL=https://api.yobibyte.example/v1
export LLM_API_KEY=sk-yb-...
export JUDGE_BASE_URL=https://api.anthropic.com # separate judge family
export JUDGE_API_KEY=sk-ant-...
# 1. RAGAS — RAG faithfulness, context precision/recall, answer relevancy
pip install "ragas>=0.2" "langchain-openai>=0.2"
cat > rag_eval.py <<'PY'
from datasets import Dataset
from ragas import evaluate
from ragas.metrics import (faithfulness, answer_relevancy,
context_precision, context_recall)
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
ds = Dataset.from_dict({
"question": [...], "answer": [...], "contexts": [[...], ...],
"ground_truth": [...],
})
report = evaluate(
ds,
metrics=[faithfulness, answer_relevancy, context_precision, context_recall],
llm=ChatOpenAI(model="claude-opus-4-7", base_url=os.environ["JUDGE_BASE_URL"]),
embeddings=OpenAIEmbeddings(model="text-embedding-3-large",
base_url=os.environ["LLM_BASE_URL"]),
)
print(report.to_pandas())
PY
# 2. DeepEval — pytest-style general regression
pip install "deepeval>=2.0"
cat > test_agent.py <<'PY'
from deepeval import assert_test
from deepeval.metrics import AnswerRelevancyMetric, FaithfulnessMetric, GEval
from deepeval.test_case import LLMTestCase, LLMTestCaseParams
import pytest
@pytest.mark.parametrize("case", load_golden_cases())
def test_agent_response(case):
actual = my_agent(case["input"], retrieved=case["context"])
tc = LLMTestCase(input=case["input"], actual_output=actual,
retrieval_context=case["context"],
expected_output=case["expected"])
assert_test(tc, [
AnswerRelevancyMetric(threshold=0.8),
FaithfulnessMetric(threshold=0.9),
GEval(name="Tone", criteria="Polite, concise, cites sources.",
evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT]),
])
PY
deepeval test run test_agent.py
# 3. Promptfoo — matrix comparison across prompts / models / providers
npm i -g promptfoo
cat > promptfooconfig.yaml <<'YML'
providers:
- id: openai:chat:claude-opus-4-7 # judge / strong baseline
config: { apiBaseUrl: ${JUDGE_BASE_URL} }
- id: openai:chat:llama-3.1-70b-instruct # Yobibyte-served
config: { apiBaseUrl: ${LLM_BASE_URL} }
prompts: [ file://prompts/v1.txt, file://prompts/v2.txt ]
tests: [ file://datasets/golden.csv ]
defaultTest:
assert:
- type: contains
value: "ORD-"
- type: llm-rubric
value: "Answer is grounded in the provided context."
YML
promptfoo eval && promptfoo view
# 4. Phoenix — open-source trace + eval UI; runs locally or self-hosted
pip install "arize-phoenix>=5"
python - <<'PY'
import phoenix as px
from phoenix.evals import RAG_RELEVANCY_PROMPT_TEMPLATE, llm_classify
px.launch_app()
# Phoenix UI now serves at http://localhost:6006 with the LangChain /
# LlamaIndex auto-instrumentation feeding live traces.
PYUse a different model family for the judge than for the system under test. LLM-as-judge is biased toward outputs from its own family (self-enhancement bias). For a Yobibyte-served Llama-70B, judge with Claude or GPT-4 class on a separate `JUDGE_BASE_URL` — exactly the pattern shown above.
How it works#
Capability harnesses iterate a fixed task suite over a model and report aggregate metrics. EleutherAI's LM-Eval-Harness is the de-facto standard: the numbers in most model release announcements are produced by it. HELM is the gold standard for holistic evaluation across accuracy, calibration, robustness, fairness and bias. OpenCompass is the strongest on Chinese plus English reasoning. UK AISI's Inspect AI is the recommended choice for safety-critical evaluation — designed with programmatic control and audit trails in mind, used by the UK AISI for frontier-model safety testing.
Application metric frameworks operate one layer up: instead of asking 'how capable is this model in general', they ask 'how well does this specific pipeline, on this specific dataset, do at this specific task'. RAGAS focuses on RAG (faithfulness, answer relevancy, context precision and recall, noise sensitivity). DeepEval is a general pytest-style framework with a catalogue of metrics (G-Eval rubrics, hallucination, bias, toxicity, task completion, tool correctness). Promptfoo runs a matrix of prompts x models x providers x test cases and reports comparison tables — the natural choice for prompt A/B and provider bake-off (including a 'this prompt on Yobibyte vs Anthropic vs OpenAI' shape).
Hosted observability-plus-eval platforms close the loop. LangSmith captures every LangChain / LangGraph run, lets you promote production traces into versioned datasets, and runs evals on every commit via CI. Braintrust focuses on experiment comparison with strong side-by-side workflows. Phoenix (Arize) is the open-source self-hostable equivalent, with deep LlamaIndex and LangChain auto-instrumentation. Helicone offers a low-friction proxy approach (one HTTP header to enable). All of them speak the same trace-into-dataset-into-eval pattern; the choice is mostly about deployment shape and where your team already lives.
Yobibyte ships a built-in eval harness that wraps RAGAS for RAG metrics and DeepEval for general regression, exposed through the workspace console without customers having to operate the underlying tooling themselves. Customers who run their own LangSmith / Phoenix instance pipe traces directly from Yobibyte endpoints (no Yobitel-specific instrumentation; the OpenAI-compatible surface is the integration). Yobitel's InferenceBench publishes provider-level scores for the same model families using the same harness shape, giving Yobitel customers a continuous third-party signal on the model + provider mix they consume.
- Capability benchmarks — MMLU, HumanEval, GSM8K, MATH, MMMU, SWE-bench, AIME, LiveBench, BIG-Bench Hard.
- Open harnesses — EleutherAI LM-Eval-Harness (de-facto), Stanford HELM (holistic), OpenCompass (CN/EN), UK AISI Inspect AI (safety + control), BIG-Bench (Google + community).
- Preference leaderboards — Chatbot Arena (LMSYS) for pairwise human preference; Artificial Analysis and InferenceBench for serving-side latency / throughput / price-quality.
- Application metric libraries — RAGAS (RAG), DeepEval (general pytest-style), Promptfoo (matrix), Inspect AI (safety), TruLens (RAG + agent observability).
- Hosted observability + eval — LangSmith (LangChain-native), Braintrust (experiment comparison), Phoenix / Arize (open-source self-host), Helicone (proxy-based), Langfuse (open-source self-host).
- Judge models — recommended: a strong frontier model from a different family than the system under test. Claude Opus or GPT-4 class for a Llama-served Yobibyte tenant.
- Dataset sources — hand-curated golden examples (foundation), production traffic samples (bulk), adversarial cases from incidents (edge), synthetic generation (coverage floor only).
Capability benchmarks tell you which model. Application evals tell you whether the model plus your prompt plus your context plus your tools does the job. Production needs both — and InferenceBench publishes the capability + serving signal continuously for the model families Yobitel customers reach through Yobibyte.
Reference and specifications#
The table below is the canonical reference for the umbrella of eval tooling production teams actually compose in 2026. Each row captures the niche, the licence, the headline shape, and the typical adoption signal.
| Tool | Layer | Licence | Headline shape |
|---|---|---|---|
| EleutherAI LM-Eval-Harness | Capability harness | MIT | 100+ academic benchmarks, model-agnostic; the source of most release-announcement numbers. |
| Stanford HELM | Capability harness (holistic) | Apache 2.0 | Accuracy + calibration + robustness + fairness + bias on a fixed task suite. |
| OpenCompass | Capability harness | Apache 2.0 | Strong on Chinese + English reasoning; 100+ datasets. |
| UK AISI Inspect AI | Capability + safety harness | MIT | Programmatic, auditable; UK AISI's own evaluation tool for frontier safety. |
| BIG-Bench / Hard | Capability suite | Apache 2.0 | 200+ hard-reasoning tasks; widely cited. |
| Chatbot Arena (LMSYS) | Preference leaderboard | Apache 2.0 | Pairwise human preference; the most-cited public arena. |
| Artificial Analysis | Serving leaderboard | Commercial | Latency + throughput + price-quality across providers. |
| InferenceBench (Yobitel) | Serving + price-quality leaderboard | Open + free | Independent serving and price-performance for the providers Yobitel customers consume. |
| RAGAS | Application metric library | Apache 2.0 | RAG faithfulness, answer relevancy, context precision/recall, noise sensitivity. |
| DeepEval | Application regression | Apache 2.0 | pytest-style; G-Eval rubrics + 20+ pre-built metrics. |
| Promptfoo | Matrix comparator | MIT | YAML-driven prompts x models x providers x tests; CLI + web UI. |
| TruLens | RAG + agent observability | MIT | Feedback functions over traces; LangChain / LlamaIndex auto-integration. |
| LangSmith | Hosted observability + eval | Commercial (free tier) | LangChain-native traces → datasets → evals; prompt hub. |
| Braintrust | Hosted observability + eval | Commercial | Experiment comparison, side-by-side UI. |
| Phoenix (Arize) | OSS observability + eval | ELv2 / Apache 2.0 | Self-hostable trace + eval UI; deep LlamaIndex / LangChain auto-instrumentation. |
| Langfuse | OSS observability + eval | MIT | Self-hostable; GDPR-friendly default. |
| Helicone | Proxy-based observability + eval | Apache 2.0 | One HTTP header to enable; very low friction. |
| Yobibyte built-in evals | Managed regression for customers | Yobitel-operated | Wraps RAGAS and DeepEval; exposed through workspace console. |
Public capability benchmarks suffer from contamination drift — many are present in modern pre-training corpora, so high scores partly measure memorisation. Cross-reference with held-out community benchmarks (LiveBench refreshes monthly) and with your own application evals before treating a leaderboard number as ground truth.
Workload patterns#
Three eval workloads cover the bulk of production usage: (A) CI regression — run a small fast eval on every pull request, fail the build below threshold; (B) Nightly full-eval — run the complete versioned dataset, archive a report, gate on a quality budget; (C) Pre-release bake-off — compare candidate model / prompt / provider matrices side-by-side before promoting. Yobitel customers running on Yobibyte typically use the same harness across all three shapes; the difference is dataset size and judge cost.
Pattern A — CI regression. 50-200 examples, DeepEval pytest under GitHub Actions or GitLab CI. Runs in 2-10 minutes, judge model is a fast frontier tier, fails the build if any metric drops below a per-metric threshold. Pattern B — Nightly full-eval. 1,000-10,000 examples, RAGAS + DeepEval composed, report archived into LangSmith or Phoenix, alerts on regression. Pattern C — Pre-release bake-off. Promptfoo or LangSmith comparison runs over a curated matrix (current prompt + candidate prompts) x (current model + candidate models) x (current provider + candidate providers — for Yobitel customers this is typically Yobibyte vs Anthropic vs OpenAI vs Together).
# Pattern A — DeepEval in GitHub Actions
name: llm-regression
on: [pull_request]
jobs:
eval:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with: { python-version: '3.11' }
- run: pip install "deepeval>=2.0" "ragas>=0.2"
- env:
LLM_BASE_URL: ${{ secrets.YOBIBYTE_URL }}
LLM_API_KEY: ${{ secrets.YOBIBYTE_KEY }}
JUDGE_API_KEY: ${{ secrets.ANTHROPIC_KEY }}
run: deepeval test run tests/eval/ --threshold 0.85
# Pattern C — Promptfoo provider bake-off
description: Yobibyte vs Anthropic vs OpenAI for our support agent
providers:
- id: openai:chat:llama-3.1-70b-instruct
label: yobibyte-llama70b
config: { apiBaseUrl: https://api.yobibyte.example/v1 }
- id: anthropic:messages:claude-opus-4-7
label: anthropic-opus
- id: openai:chat:gpt-4.1
label: openai-gpt41
prompts: [ file://prompts/system_v3.txt ]
tests: [ file://datasets/support_golden_200.csv ]
defaultTest:
assert:
- type: llm-rubric
provider: anthropic:messages:claude-opus-4-7
value: "Cites the order id, includes the SLA, polite tone."
- type: latency
threshold: 4000Run InferenceBench's published numbers alongside your application evals when picking a provider. Capability and latency on InferenceBench tell you the ceiling and the speed; your application evals tell you whether your prompt and your context plus that provider hit your quality bar. They answer different questions.
Sizing and capacity planning#
Eval throughput is bounded by the judge model's RPS and the eval framework's parallelism. The table below is realistic per-example overhead for the most common shapes; treat it as a planning anchor for budget and CI duration.
The single biggest CI-speed lever is async parallelism plus a fast judge. RAGAS and DeepEval both support async judge evaluation; a 200-example CI regression run that takes 20 minutes serially typically completes in 2-3 minutes with parallelism set to 16-32 against a frontier-class judge. For nightly full-evals at 10,000+ examples, batched judge calls via the provider's batch API (Anthropic Batches, OpenAI Batches) cut judge spend by 50 percent at the cost of longer wall-clock.
| Dimension | Typical value | Notes |
|---|---|---|
| RAGAS per-example judge latency (frontier judge) | 1-4 s | Faithfulness decomposes claims; multiple judge calls per example. |
| DeepEval AnswerRelevancy / Faithfulness | 1-3 s | Per-metric; G-Eval rubric is slower (2-6 s). |
| Promptfoo per-test | Latency of the model under test + assert overhead | Dominated by the SUT; assertions are cheap. |
| LM-Eval-Harness MMLU full run | 30-90 min | On a single H100; varies with model size and batch. |
| HELM full task suite | 10-40 hours | Full holistic run; usually scoped down. |
| CI regression (200 ex, parallel 32) | 2-5 min | Frontier judge; fits a sensible PR-gate budget. |
| Nightly full-eval (10K ex, batched) | 1-4 hours wall-clock | Anthropic / OpenAI batch API; lowest cost path. |
| Phoenix UI load on self-host | 1-2 vCPU, 2 GB RAM | Per replica; idle most of the time. |
| LangSmith ingestion rate | Plan-dependent | Free tier caps at 5K runs/month; paid tiers scale to millions. |
Limits and quotas#
Eval frameworks place no hard limits on dataset size or run frequency. Every limit you hit will be one set by the judge provider's RPS quota, the hosted eval platform's plan tier, or the size of the eval dataset itself. The list below covers what bites in practice when running CI regression at scale.
- Judge provider rate limit — Anthropic and OpenAI frontier judges typically allow 5-50 RPS at standard tiers, 100+ on enterprise. RAGAS / DeepEval honour `429 Retry-After`; tune the framework's concurrency to stay under the cap.
- Judge token cost — frontier judges at ~$15-75 per 1M output tokens dominate eval cost; for high-frequency CI use a cheaper judge (Claude Haiku, GPT-4o-mini class) calibrated against the strong judge.
- Dataset versioning — LangSmith, Braintrust and Phoenix all store dataset versions; the limit is plan-tier, not framework. Pin dataset versions per release tag to make regressions reproducible.
- Trace ingestion rate — LangSmith Plus and Braintrust paid tiers cap per-minute trace ingestion. High-RPS production traffic should sample (e.g. 1-5 percent) rather than ingest every run.
- Promptfoo matrix size — the cartesian product of prompts x models x providers x tests x assertions blows up fast; a 4x4x4x500x3 matrix is 96K calls. Use the `repeat` and `filterFile` options to scope intelligently.
- Inspect AI evaluation depth — designed for safety-critical eval, so logs include full prompts and completions. Plan storage accordingly.
- Capability harness reproducibility — pin the harness version, the model checkpoint hash, and the eval config. Even minor harness version drift can shift MMLU scores by 1-2 points.
- Dataset PII — judge calls send the full content to the judge provider. Under UK GDPR, eval datasets containing customer data are subject to the same processing-record obligations as production data.
Calibrate your judge before trusting it in CI. Hand-grade 50-200 examples, run the judge over the same set, compute Cohen's kappa or simple correlation. If agreement is below 0.7, iterate on the rubric or change the judge model. Recalibrate every quarter and every time you change the rubric or the judge.
Observability#
Evaluation observability has two flavours: traces of the evaluation runs themselves (what the judge said about each example, why), and aggregate dashboards across runs (regression vs the last release tag). The hosted platforms — LangSmith, Braintrust, Phoenix, Langfuse — provide both. The open-source RAGAS and DeepEval CLIs emit structured JSON suitable for piping into any backend.
For Yobitel customers, the recommended pattern is to keep eval traces in the same observability tier as production traces. If LangChain is the framework, LangSmith is the natural choice (both production and eval runs land in the same project). If LlamaIndex is the framework, Phoenix is the natural choice (both flow through the same OpenInference instrumentation). Yobibyte's gateway emits OTel GenAI spans for the production side, so a customer's eval runs and their Yobibyte-served production runs can be reconciled by `trace_id` in a single Datadog / Honeycomb / Tempo backend.
- Per-eval-run attributes — dataset id and version, model under test, judge model, framework version, threshold, pass/fail count, per-metric distribution.
- Per-example attributes — input, expected, actual, retrieved context (for RAG), judge score per metric, judge reasoning text.
- Dashboards worth standing up — pass rate over time per metric, judge spend over time, mean latency per provider, regression alerts on threshold breach.
- Promote traces to eval datasets — the highest-leverage practice. Sample production runs (errors, low confidence, user thumbs-down, random N), annotate the desired output, append to the next eval cycle's dataset.
- InferenceBench feeds — Yobitel publishes InferenceBench provider scores; consume them as an additional metric alongside your own application evals when picking a provider.
Cost and FinOps#
Open-source eval tooling is free; the cost lives in judge calls, hosted-platform plans, and the compute to run capability harnesses. The dominant variable is judge cost: a 10K-example RAGAS run with a frontier judge can cost $50-200 in judge tokens alone; the same run with a cheap judge (Haiku, GPT-4o-mini class) costs $5-20 and, with calibration, achieves comparable quality on most metrics.
The cheapest path is: small-dataset CI regression (200-1000 examples) with a frontier judge daily, medium-dataset nightly full-eval (5-10K) with batched API discount, and quarterly recalibration of the cheap judge against the strong judge on a 500-example reference set.
| Cost component | Typical USD range | Driver |
|---|---|---|
| Framework licences | $0 | RAGAS / DeepEval / Promptfoo / Phoenix / Langfuse all open-source. |
| Judge calls (frontier, 1K ex RAGAS) | $5-25 per run | Anthropic / OpenAI list price; faithfulness decomposes claims (~3-5 judge calls / ex). |
| Judge calls (cheap-tier, 1K ex) | $0.50-2.50 per run | Haiku / GPT-4o-mini class; calibrate quarterly. |
| Judge calls (batched API) | ~50% of synchronous | Anthropic Batches, OpenAI Batches; nightly full-eval shape. |
| LangSmith Plus | $39/user/month + per-run | Free tier 5K runs/month; paid tiers scale. |
| Braintrust paid tier | Custom commercial | Typically $500-5,000/month for production teams. |
| Phoenix self-host | $50-300/month | Container + Postgres + S3 trace store. |
| Capability harness compute (MMLU on H100) | $0.50-3 per full run | On-demand H100 pricing; ~1 hour wall-clock. |
| InferenceBench data feed | $0 | Yobitel publishes openly; no API surcharge. |
Eval spend below 5 percent of inference spend is well-calibrated. Below 1 percent and you are probably not catching regressions you should; above 15 percent and your judge model is over-specified for your CI rhythm. The Yobibyte built-in eval harness lands in the 3-7 percent of inference spend range for most customers.
Security and compliance#
Evaluation pipelines touch the same data as production — golden datasets contain example user inputs, retrieved contexts, expected outputs. For regulated workloads they inherit the same handling obligations as production traffic. The checklist below is the production-security stance that aligns with Yobibyte's NCSC Cloud Security Principle alignment and UK GDPR Article 32 evidence requirements.
- Judge data path — calling Anthropic / OpenAI as a judge ships dataset content to the judge provider. Under UK GDPR this is processing by a third-party processor; ensure the contract covers it.
- Sovereign workloads — for UK Sovereign customers, the judge endpoint should be a UK-resident frontier model (Anthropic UK, OpenAI EU region) or a Yobibyte UK Sovereign tenant serving a frontier-class model. Routing eval data outbound to a US endpoint defeats the sovereign posture of the production stack.
- Dataset access control — eval datasets often contain PII; treat the LangSmith / Phoenix / Braintrust dataset store as a production data store with the same RBAC and audit obligations.
- Judge calibration audit — keep a record of the calibration set, the judge model, the calibration date, and the agreement metric. This is the evidence that defends an LLM-as-judge metric to a sceptical auditor.
- Synthetic data — synthesised eval examples may be safer for sovereign processing, but quality is lower; treat them as the coverage floor, not the ceiling.
- Reproducibility — pin framework version, judge model version, dataset version, and prompt version for every run. Article 30 records of processing benefit from the same discipline.
- Yobibyte built-in evals — Yobitel-operated; the judge call path stays inside the Yobibyte regulatory boundary unless customers opt to bring their own external judge endpoint.
Migration and alternatives#
Most production teams compose several tools; pure-substitution migrations are rare. The table below captures the practical comparison between the main alternative shapes Yobitel customers weigh.
- From hand-rolled judge scripts to DeepEval — wrap existing graders as `BaseMetric` subclasses; reuse golden examples as `LLMTestCase` objects; CI integration is `deepeval test run`.
- From DeepEval to RAGAS for the RAG slice — keep DeepEval for general regression; add RAGAS for the faithfulness and context metrics specifically. The two compose cleanly.
- From LangSmith to Phoenix self-host — auto-instrumentation translates; the substantive piece is operating Postgres and S3 for the trace store.
- From self-hosted evals to Yobibyte built-in — keep your datasets; Yobibyte console runs them on the Yobibyte-served model + judge model combination, returns the same metric shape.
- From a single judge model to ensemble judges — average scores across two judges from different families; reduces self-enhancement and verbosity bias.
| Approach | Strengths | Weaknesses | When to pick |
|---|---|---|---|
| RAGAS only | RAG-focused; principled metric set; reference-free where possible. | RAG only; light on agent / multi-turn metrics. | When the application is exclusively RAG. |
| DeepEval only | General pytest-style; 20+ metrics; CI-native. | Less RAG-deep than RAGAS; smaller community for RAG-specific rubrics. | When the application is general-purpose. |
| Promptfoo only | Matrix comparator; CLI + web UI; YAML-driven. | Less production-trace integration. | When the question is prompt or provider A/B. |
| LangSmith | Hosted; LangChain-native; trace → dataset → eval pipeline. | Commercial; LangChain-shaped abstractions. | When the stack is already LangChain / LangGraph. |
| Phoenix self-host | Open-source; deep LlamaIndex / LangChain auto-instrumentation. | Self-host operational overhead. | When sovereign or air-gapped, or LlamaIndex-heavy. |
| Yobibyte built-in evals | Managed; wraps RAGAS + DeepEval; console UI. | Tied to Yobibyte; less flexibility than self-hosted tooling. | When team capacity to operate eval tooling is limited. |
| Hand-rolled with provider SDK | Total control; no framework overhead. | Reinvent dataset versioning, parallelism, judge calibration. | When the eval shape is very domain-specific. |
Troubleshooting#
The failure modes below are the ones eval pipeline operators hit repeatedly when running RAGAS, DeepEval, Promptfoo, LangSmith and Phoenix against managed endpoints — including Yobibyte, Yobitel NeoCloud vLLM, and third-party providers.
| Symptom | Likely cause | Fix |
|---|---|---|
| Judge scores wildly inconsistent run-to-run | Judge temperature too high; or judge model itself drifts. | Set judge temperature=0; pin judge model version; recalibrate. |
| RAGAS faithfulness 0 on perfectly fine answers | Judge cannot decompose multi-sentence claims; or context contains no overlapping vocabulary. | Use a stronger judge; check that retrieved context actually contains the cited content. |
| DeepEval test_run hangs | Synchronous judge calls; concurrency=1. | Enable async with `run_async=True`; set concurrency 8-32. |
| Promptfoo cartesian product explodes | Too many providers x prompts x tests. | Scope with `filterFirst N`; split into per-axis runs. |
| CI judge cost exploding | Frontier judge on every PR; large dataset. | Switch CI to cheap-tier judge calibrated quarterly; reserve frontier judge for nightly. |
| LangSmith dataset ingestion slow | Large per-trace payloads; serial upload. | Use the bulk dataset API; pre-truncate inputs / outputs above a sensible size. |
| MMLU score 2-3 points off published number | Harness version drift; prompt template difference. | Pin LM-Eval-Harness version; use the canonical prompt template from the model card. |
| Phoenix UI empty despite instrumented app | OpenInference handler not registered; or wrong project. | Verify `LlamaIndexInstrumentor().instrument()` ran before any traced call. |
| Inspect AI eval fails to record judgements | Dataset definition missing target fields. | Match the Sample / Task definition to Inspect AI's typed schema; consult the upstream tutorial. |
| Eval scores good in dev, bad in prod | Dataset not representative; production has long-tail content. | Add production-traffic-sampled examples to the dataset; tag by failure mode. |
| G-Eval rubric ignored | Rubric too vague ("is it good?"). | Rewrite as numbered criteria with examples of pass / fail; recalibrate against human scores. |
When CI eval scores drop, check the order: (1) was the dataset version pinned, (2) was the judge model version pinned, (3) was the framework version pinned. Nine times out of ten the regression is silent drift in one of these, not in the system under test.
Where this fits in the Yobitel stack#
Yobibyte ships a built-in eval harness that wraps RAGAS for RAG metrics and DeepEval for general regression — Yobitel-operated, exposed through the workspace console, no customer-side eval infrastructure to stand up. Customers who already run LangSmith or Phoenix at scale pipe their traces from Yobibyte endpoints directly into their existing eval loop with no Yobitel-specific instrumentation; the OpenAI-compatible surface and the OTel GenAI spans Yobibyte emits at the gateway are the integration.
Yobitel's InferenceBench is the public companion: an open, free, continuously-refreshed leaderboard of serving and price-quality numbers across the providers Yobitel customers consume — Yobibyte, Yobitel NeoCloud, and the third-party providers most teams compare against. Treat InferenceBench as the model-selection signal at the top of the funnel and Yobibyte's built-in evals (or your own LangSmith / Phoenix loop) as the continuous-quality signal below it. For UK Sovereign customers, keep the judge endpoint inside the regulatory boundary, pin every version of dataset, framework, judge and prompt, and let Yobitel Professional Services help wire the reference stack — exactly the four-layer eval-loop discipline this entry documents.
References
- LM Evaluation Harness · GitHub (EleutherAI)
- Stanford HELM · Stanford CRFM
- UK AISI Inspect AI · UK AI Safety Institute
- RAGAS Documentation · Exploding Gradients
- DeepEval Documentation · Confident AI
- Promptfoo Documentation · Promptfoo
- LangSmith Documentation · LangChain
- Phoenix Documentation (Arize) · Arize
- Chatbot Arena (LMSYS) · LMSYS