Evaluation Concepts
Opik provides two complementary approaches to evaluating your LLM application. Understanding when to use each will help you build a robust evaluation strategy.
Test Suites — assertion-based testing
Test Suites let you define expected behaviors as natural-language assertions. An LLM judge checks each assertion against your agent’s output and reports pass/fail results.
Best for:
- Testing specific behaviors (e.g., “the response does not hallucinate”)
- Pass/fail validation of agent outputs
- Iterating on prompts and comparing versions
- Catching regressions after changes
A Test Suite has three main components:
- Test items: Input data for your agent (e.g., questions with context, user scenarios)
- Assertions: Natural-language descriptions of expected behavior, checked by an LLM judge (e.g., “The response is concise”)
- Execution policy: Controls how many times each item is run and how many runs must pass
Assertions can be defined at two levels:
- Suite-level assertions apply to every test item
- Item-level assertions apply only to a specific test item, in addition to suite-level ones
Pass/fail logic
- A run passes if all its assertions pass
- An item passes if the number of passed runs meets the
pass_threshold - The pass rate is the ratio of passed items to total items
Datasets & Metrics — quantitative scoring
Dataset-based evaluation scores your agent’s outputs using quantitative metrics. You define a dataset of test cases, run your agent against them, and score the results using pre-built or custom metrics.
Best for:
- Measuring quality across many traces with a common metric (hallucination, relevance, coherence)
- Comparing model or prompt versions with numeric scores
- Evaluating RAG pipelines with context precision/recall metrics
- Building leaderboards across experiments
A dataset-based evaluation has three main components:
- Dataset: A collection of test cases with inputs and optional expected outputs
- Task: A function that takes a dataset item and returns your agent’s output
- Metrics: Scoring functions that evaluate the output (e.g.,
Hallucination,AnswerRelevance, custom metrics)
Each evaluation run creates an Experiment — a record of every dataset item, your agent’s output, and the metric scores. Experiments are stored in Opik so you can compare them side-by-side.
Choosing between the two
You can use both approaches together. For example, use Test Suites during development to validate specific behaviors, and Datasets & Metrics in CI to track quality scores over time.