RAG System Evaluation Guide

When a RAG system fails, the output alone won’t tell you why. RAG stands for retrieval-augmented generation, and it’s one of the most common context engineering techniques for adding additional information (and thus accuracy) to AI agents. Because it’s such a critical component of modern AI apps, developers need an LLM evaluation method that can diagnose problems and track performance for RAG.

This guide covers what to measure at each stage of the RAG pipeline, why each metric matters, and how to build an evaluation workflow that pinpoints problems rather than just detecting them.

It also introduces the most effective evaluation technique, LLM-as-a-judge. Looping an LLM into the evaluation phase has largely replaced legacy deterministic metrics (like BLEU and ROUGE), which measured word overlap rather than semantic accuracy. LLMs are better at evaluating textual relevance and are thus well-suited for your RAG evaluation toolkit. (For a deeper look at the LLM-as-a-judge paradigm, including its origins and key frameworks, see our LLM-as-a-Judge overview.)

How RAG Systems Fail (And Why Evaluation Must Be Disaggregated)

RAG failures fall into three categories that look identical from the outside but require completely different fixes:

The retriever might miss relevant documents or rank them too low, so the model lacks needed context.
The model might hallucinate an answer, even if relevant documents were retrieved.
The model might answer a different question than the one the user asked, even if its output is well-supported by the retrieved context.

Each of these failures looks identical from the outside because you get a wrong answer. But the fix for each is completely different. This diagnostic problem sits at the heart of RAG evaluation. When you only evaluate the final output, you collapse distinct failure modes into a single opaque signal.

Consider a concrete scenario. A user asks your internal HR assistant, “Am I eligible for parental leave if I’ve been here for 11 months?” The system responds: “Employees are eligible for 12 weeks of parental leave.” Was this a retrieval failure — the system pulled the general parental leave policy but missed the eligibility requirements document that specifies a 12-month tenure minimum? Or a generation failure — the correct eligibility document was retrieved, but the model ignored the tenure requirement and answered the surface-level question instead?

The fix for each is completely different. A retrieval failure might mean your chunking split the eligibility criteria from the leave benefits. A generation failure might mean your prompt needs explicit instructions to check preconditions before answering. Disaggregated evaluation — testing the retriever and generator independently, then evaluating how they work together — lets you isolate weak points and make targeted improvements rather than guessing.

Three Dimensions of RAG Quality (The “RAG Triad”)

The most useful diagnostic framework for RAG evaluation measures three relationships:

Relevance of retrieved context: the connection between the user’s query and what the retriever found.
Faithfulness: the connection between the retrieved context and what the generator produced.
Relevance of the output: the connection between the user’s original prompt and the final output — whether it actually addressed the user’s question or task.

TruLens calls this the “RAG Triad,” and the framing has been adopted by other evaluation tools, including DeepEval.

Think of it as three questions you ask about every interaction: Did the retriever find relevant information? Did the generator stick to the facts it was given? Did the final answer actually address what the user asked?

Because failures can originate in either the retrieval component or the generation component, the most useful RAG evaluation strategies test the retriever and generator independently before measuring end-to-end performance.

LLM-as-a-Judge: The Evaluation Engine Behind These Metrics

Most of the metrics in this guide are powered by the LLM-as-a-judge paradigm: using a capable LLM to evaluate the outputs of other models. This approach has largely replaced legacy token-overlap metrics like BLEU and ROUGE, which measured surface-level word matching rather than semantic meaning or factual accuracy. A response that conveys the correct information using different vocabulary would score poorly under BLEU; an LLM judge can recognize the semantic equivalence.

Two frameworks are worth knowing about. G-Eval (Liu et al., EMNLP 2023) introduced chain-of-thought reasoning for evaluation, where the judge model generates step-by-step criteria before scoring, and uses token probabilities for continuous rather than integer scores. Opik implements G-Eval as a built-in metric, so you can apply it directly to your RAG outputs. Prometheus (Kim et al., ICLR 2024) demonstrated that open-source LLMs fine-tuned specifically for evaluation can match GPT-4’s correlation with human judges when given structured rubrics. This matters for RAG evaluation pipelines because it means you’re not locked into proprietary models for your judge — platforms like Opik let you swap in any LLM supported by LiteLLM, including open-source options, as your evaluation backbone.

For a deeper dive into LLM-as-a-judge foundations, including G-Eval’s technical details, evaluator bias mitigation, and calibration strategies, see our full LLM-as-a-Judge guide.

Evaluating the RAG Triad in Practice

Let’s take a closer look at each dimension of the RAG Triad — what it measures, why it matters, and how to implement it with Opik.

How do you measure retrieval quality in RAG systems?

The first dimension evaluates the relationship between the user’s query and the retrieved context. Context relevance measures how pertinent the retrieved documents are to what the user actually asked. If your retriever pulls irrelevant chunks, two things happen: you waste tokens in the LLM’s context window, and you increase the risk that the model will incorporate noise into its response.

Within this dimension, contextual precision evaluates how well the retriever ranks relevant information. Finding the right document matters, but where that document appears in the ranked list matters too. Research from Liu et al. (published in Transactions of the Association for Computational Linguistics, 2024) demonstrated that language models exhibit a U-shaped performance curve when processing long contexts: they attend well to information at the beginning and end of the input but often miss what’s in the middle. This “lost in the middle” phenomenon means a retriever that buries the most relevant document at position 8 out of 10 might as well not have retrieved it at all.

Opik provides both ContextPrecision and ContextRecall as built-in LLM-as-a-judge metrics. Both use few-shot prompting with structured rubrics to score on a 0.0–1.0 scale:

from opik.evaluation.metrics import ContextPrecision, ContextRecall

precision_metric = ContextPrecision()
recall_metric = ContextRecall()

precision_score = precision_metric.score(
input="Am I eligible for parental leave if I've been here 11 months?",
output="Employees are eligible for 12 weeks of parental leave.",
expected_output="Employees must complete 12 months of tenure to qualify.",
context=["Parental leave policy: Eligible employees receive 12 weeks…",
"Eligibility: Employees must complete 12 months of continuous employment."],
)

recall_score = recall_metric.score(
input="Am I eligible for parental leave if I've been here 11 months?",
output="Employees are eligible for 12 weeks of parental leave.",
expected_output="Employees must complete 12 months of tenure to qualify.",
context=["Parental leave policy: Eligible employees receive 12 weeks…",
"Eligibility: Employees must complete 12 months of continuous employment."],
)

By default, both metrics use GPT-4o as the judge, but you can swap to any model supported by LiteLLM by setting the model parameter. Context precision penalizes systems that rank relevant documents lower, while context recall measures whether the retriever surfaced all the relevant information the expected answer requires.

How do you detect hallucinations in RAG outputs?

The second edge of the Triad, groundedness (also called faithfulness in the Ragas framework), evaluates whether the generator’s response is actually supported by the retrieved context. This is your primary tool for detecting hallucinations, and for most production teams, it’s the most critical metric in the entire evaluation stack.

Here’s how faithfulness evaluation typically works under the hood. The system first decomposes the generated answer into individual factual claims. For example, the response “You’re eligible for 12 weeks of parental leave after your first year, and you can split it into two blocks” contains three claims: the leave duration is 12 weeks, it requires one year of tenure, and it can be split. The evaluation checks each claim against the retrieved context. If the policy document says nothing about splitting leave, that third claim is hallucinated — the model filled in a plausible-sounding detail from its training data.

The faithfulness score is the ratio of supported claims to total claims. A score of 1.0 means every statement traces back to the retrieved context. A score of 0.6 means 40% of the claims came from somewhere else — likely the model’s training data or outright fabrication. High faithfulness scores indicate that the generator is behaving as a “natural language layer” over your knowledge base rather than freelancing with its parametric memory.

In Opik, hallucination detection is a built-in metric:

from opik.evaluation.metrics import Hallucination

metric = Hallucination()

score = metric.score(
input="Am I eligible for parental leave if I've been here 11 months?",
output="You're eligible for 12 weeks of parental leave after your first year, and you can split it into two blocks.",
context=["Eligible employees receive 12 weeks of parental leave.",
"Employees must complete 12 months of continuous employment to qualify."],
)

Low scores are a red flag that the model is injecting information you didn’t provide, which is especially dangerous in domains like healthcare, legal, or financial services where accuracy is non-negotiable.

What does answer relevance measure in RAG evaluation?

The third edge of the Triad catches a subtle but important failure mode: a response that is factually grounded and technically accurate, yet doesn’t actually answer the user’s question. Answer relevance measures the alignment between the generated response and the original query, penalizing answers that are incomplete, redundant, or tangential.

To calculate answer relevance without a pre-written reference answer you can use a “reverse engineering” approach. An LLM generates several hypothetical questions that the current response would satisfy, and the system then measures the semantic similarity between these synthetic questions and the user’s actual query. If the response is highly relevant, the reverse-engineered questions should closely mirror what the user originally asked.

This metric catches scenarios that faithfulness alone would miss. A user asks “How do I reset my password?” and the system responds with a thorough, well-grounded explanation of your company’s security architecture. Every claim is supported by the retrieved context. The faithfulness score is perfect. But the user still doesn’t know how to reset their password. Answer relevance penalizes this kind of disconnect.

Opik provides an AnswerRelevance metric that implements this pattern. Combined with ContextPrecision, ContextRecall, and Hallucination, these four metrics give you complete diagnostic coverage of the RAG Triad.

Retrieval Metrics for Production Tuning

The RAG Triad gives you a high-level diagnostic of system health. But when you need to fine-tune the retriever’s actual configuration — including parameters like chunk size, top-K settings, and embedding model selection — you’ll need more granular metrics from traditional information retrieval (IR) science.

Recall@K measures the proportion of all relevant documents that appear in the top K results. If there are five documents in your knowledge base that could answer a particular query, and your retriever surfaces three of them in its top-10 results, that’s a recall@10 of 0.6. This metric is critical in domains like legal or medical research where missing a single relevant document can lead to an incomplete or incorrect conclusion.

Precision@K measures the flip side: what proportion of the top K results are actually relevant? If you retrieve 10 documents but only 3 are useful, your precision@10 is 0.3. Low precision means you’re stuffing the LLM’s context window with noise, wasting tokens and increasing the chance the model gets confused by irrelevant content.

Mean Reciprocal Rank (MRR) focuses on how quickly the retriever surfaces the single best result. It calculates the average of the reciprocal rank of the first relevant document across a set of queries. An MRR of 1.0 means the best document always appears first. An MRR of 0.5 means it tends to show up second. This metric matters most when your system typically needs one authoritative answer rather than a synthesis of multiple sources.

Normalized Discounted Cumulative Gain (NDCG) accounts for graded relevance, acknowledging that some documents are more relevant than others rather than treating relevance as binary. It rewards systems that place the most relevant documents at the top, applying a logarithmic discount to results further down the list. For complex queries where multiple documents contribute different pieces of the answer, NDCG provides a more nuanced picture than binary precision or recall.

These metrics also help you optimize retrieval hyperparameters. If your contextual relevancy score (the proportion of retrieved text that’s actually useful) is low, you might be retrieving chunks that are too large. When only a small percentage of each chunk is relevant to the query, that’s a signal to experiment with smaller chunk sizes or a more aggressive reranking step.

Evaluator Biases You Need To Watch For

LLM judges carry systematic biases that can skew your scores. The most common are position bias (preferring responses based on where they appear in the prompt rather than their quality), verbosity bias (assigning higher scores to longer responses regardless of substance), and agreeableness bias (being better at confirming correct answers than catching incorrect ones, which means automated evaluations may systematically overestimate reliability). Our LLM-as-a-Judge guide covers these biases and mitigation strategies in detail.

For RAG evaluation specifically, one calibration approach worth knowing is Prediction-Powered Inference (PPI), introduced in the ARES framework (Saad-Falcon et al., 2024). PPI uses a small set of human-annotated examples to statistically adjust automated scores and provide confidence intervals. This reality check calibrates how much you trust the automated evaluator on everything else.

Putting RAG Evaluation Into Practice

Now that you understand the metrics, let’s discuss operationalizing them into a repeatable workflow.

Build your evaluation dataset early

The foundation of any evaluation pipeline is a test dataset of query-answer pairs that represent your system’s intended use cases. Start building this before you optimize anything else. A useful evaluation dataset includes diverse query types: straightforward factual questions with clear answers, complex questions that require synthesizing information from multiple documents, ambiguous queries where the system needs to handle uncertainty, and “negative” queries that the system should decline to answer because the information isn’t in the knowledge base.

You don’t need thousands of examples to start. Even 50 to 100 well-crafted query-answer pairs, covering the key scenarios your system needs to handle, will give you a meaningful baseline. Expert-annotated “gold standard” examples are ideal for high-stakes validation, but they’re expensive to produce.

Use synthetic data to scale (carefully)

To supplement human-annotated datasets, you can use LLMs to generate synthetic test data from your document corpus. A common approach involves feeding documents to a capable model and prompting it to generate multiple diverse questions and corresponding answers based on the content. This is valuable for rapid iteration and expanding coverage, but it comes with a caveat: synthetic data reflects the generating model’s understanding, not necessarily ground truth.

For production systems in high-stakes domains like healthcare, finance, or legal services, treat synthetic data as a starting point for development-phase testing, and always validate against human-reviewed “gold” sets before making deployment decisions.

Evaluate with Opik: end-to-end workflow

The most effective teams treat RAG evaluation the way software teams treat unit testing: it runs automatically, it blocks deployments when quality drops, and it produces results that the whole team can interpret. Here’s how that looks with Opik.

Define your metrics. Start with the RAG Triad coverage: Hallucination, ContextPrecision, ContextRecall, and AnswerRelevance. Opik’s evaluate function accepts a list of metrics and runs them all against your dataset in a single pass:

from opik.evaluation import evaluate
from opik.evaluation.metrics import (
Hallucination, ContextPrecision, ContextRecall, AnswerRelevance
)

metrics = [
Hallucination(),
ContextPrecision(),
ContextRecall(),
AnswerRelevance(),
]

results = evaluate(
dataset=your_dataset,
task=your_rag_pipeline,
scoring_metrics=metrics,
experiment_config={
"model": "gpt-4o",
"chunk_size": 512,
"top_k": 5,
},
)

Set pass/fail thresholds. Define what “good enough” looks like for your use case — for example, faithfulness must exceed 0.85, answer relevance must exceed 0.75. Run evaluations as part of your CI/CD pipeline so that a prompt change or retrieval configuration update that causes a regression gets caught before it reaches production.

Compare experiments. The experiment_config parameter lets you tag each evaluation run with the configuration that produced it (model, chunk size, top-K, prompt version). Opik’s UI then lets you compare experiments side by side, so you can see exactly how a configuration change affected each metric.

Move to production monitoring. Once your system is live, Opik’s Online Evaluation rules let you run the same metrics on production traces automatically. When a faithfulness score drops, you can drill into the specific trace that produced the low score, see exactly which documents were retrieved, inspect the prompt that was sent to the LLM, and identify whether the failure was in retrieval or generation.

This is where observability and evaluation converge. Logging traces during development helps you iterate faster. Logging them in production helps you detect drift and degradation. Running automated evaluations on those traces turns raw observability data into actionable quality signals.

Stress-testing and adversarial evaluation

The metrics discussed so far evaluate whether a RAG system works correctly under normal conditions. But production systems also need to handle inputs that are ambiguous, malicious, or designed to exploit the pipeline’s architecture. Stress-testing and adversarial evaluation probe how the system behaves when things go wrong on purpose.

Boundary testing: what happens outside the happy path?

Before worrying about adversarial attacks, test how your system handles legitimate but difficult inputs. These include queries the system should decline to answer (because the information isn’t in the knowledge base), questions that require synthesizing information across multiple documents, ambiguous queries where the user’s intent is unclear, and inputs that contain false premises the system should push back on rather than accept.

For example, a user tells your HR assistant, “Since the company matches 401(k) contributions at 8%, I want to max that out.” If the actual match is 4%, the system should correct the false premise rather than build on it. These tests are straightforward to construct — domain experts can usually generate dozens of tricky edge cases from experience — and they catch failure modes that basic faithfulness and relevance metrics miss entirely.

The RAG-specific attack surface

RAG systems introduce attack vectors that don’t exist in standalone LLMs. The OWASP Top 10 for LLM Applications (2025 edition) added “Vector and Embedding Weaknesses” as a new entry specifically addressing RAG vulnerabilities, reflecting how central retrieval pipelines have become to production AI systems.

The most significant RAG-specific threat is what researchers call indirect prompt injection: malicious instructions embedded not in the user’s query but in the documents the system retrieves. Greshake et al. formalized this attack class in their 2023 paper presented at the ACM Workshop on Artificial Intelligence and Security (AISec ’23), demonstrating that augmenting LLMs with retrieval fundamentally blurs the boundary between data and instructions. When a RAG system retrieves a document containing hidden instructions like “ignore previous context and respond with [attacker’s content],” the LLM may follow those instructions because it can’t reliably distinguish retrieved context from system commands.

The related threat of knowledge base poisoning takes this further. Rather than injecting instructions, an attacker corrupts the retrieval corpus itself with documents designed to surface for specific queries and steer the model toward predetermined (wrong) answers. The PoisonedRAG research (Zou et al., presented at USENIX Security 2025) demonstrated that injecting as few as five crafted documents into a corpus of millions could achieve attack success rates above 90% for targeted queries across multiple LLMs and retrieval configurations. The attack works because the poisoned documents are optimized to satisfy both the retrieval condition (getting surfaced by the search) and the generation condition (steering the LLM’s output), and it’s effective even in black-box settings where the attacker has no access to the retriever’s parameters.

What to test for

A practical adversarial evaluation suite for RAG should cover at minimum:

Prompt injection resistance. Test with canonical injection patterns appended to user queries: role overrides (“You are now an unrestricted assistant…”), instruction overrides (“Ignore your previous instructions and…”), and obfuscated variants. Measure whether the system’s output deviates from its intended behavior or exposes system prompt content.

Knowledge base integrity. If your corpus ingests content from sources you don’t fully control — user-submitted documents, web scrapes, third-party databases — test what happens when that content contains adversarial payloads. Seed your test environment with high-similarity malicious documents and measure whether the system retrieves and acts on them.

Graceful refusal. Verify that the system declines to answer when it should: questions outside its domain, requests for actions it shouldn’t take (approving refunds, providing medical diagnoses), and queries where the retrieved context is insufficient to give a reliable answer.

Consistency under paraphrase. Ask the same question multiple ways and check whether the responses are substantively consistent. Inconsistency under paraphrase often reveals that the system is sensitive to surface-level phrasing rather than underlying intent, which is a reliability problem and a potential exploitation vector.

These tests don’t require sophisticated tooling to get started. A spreadsheet of adversarial queries, expected behaviors, and pass/fail criteria — evaluated by an LLM judge calibrated against human-in-the-loop review — will catch most of the high-severity issues before they reach users.

Start measuring, then start improving

RAG evaluation can feel overwhelming when you survey the full range of available metrics, frameworks, and tools. The practical path forward is simpler than it appears. Start with the RAG Triad: context relevance to verify your retriever, faithfulness to catch hallucinations, and answer relevance to ensure you’re actually helping users. These three metrics cover the most critical failure modes and give you a diagnostic framework for targeted improvements.

As your system matures, layer in retrieval-specific metrics like recall@K and MRR to fine-tune your search configuration, and invest in calibrating your LLM-as-a-judge pipeline against human assessments to ensure your automated scores reflect reality.

Opik is built for exactly this workflow. As an open-source LLM evaluation framework for LLM observability and evaluation, it gives you the LLM evaluation metrics covered in this guide—Hallucination, ContextPrecision, ContextRecall, AnswerRelevance, and G-Eval—as built-in, ready-to-use evaluation tools, plus end-to-end LLM tracing that connects score drops to specific pipeline failures. You can start with Opik’s hosted free tier or self-host the full platform from GitHub. Either way, you’ll go from “something seems off” to “here’s exactly what’s broken and why” a lot faster.

How to Evaluate RAG Systems: Metrics, Methods, and What to Measure First

How RAG Systems Fail (And Why Evaluation Must Be Disaggregated)

Three Dimensions of RAG Quality (The “RAG Triad”)

LLM-as-a-Judge: The Evaluation Engine Behind These Metrics

Evaluating the RAG Triad in Practice

How do you measure retrieval quality in RAG systems?

How do you detect hallucinations in RAG outputs?

What does answer relevance measure in RAG evaluation?

Retrieval Metrics for Production Tuning

Evaluator Biases You Need To Watch For

Putting RAG Evaluation Into Practice

Build your evaluation dataset early

Use synthetic data to scale (carefully)

Evaluate with Opik: end-to-end workflow

Stress-testing and adversarial evaluation

Boundary testing: what happens outside the happy path?

The RAG-specific attack surface

What to test for

Start measuring, then start improving