For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
Copy to LLMGithubGo to App
DocumentationIntegrationsBuilding Self-Improving AgentsSelf-hosting OpikSDK & API reference
DocumentationIntegrationsBuilding Self-Improving AgentsSelf-hosting OpikSDK & API reference
  • Getting Started
    • Home
    • Quickstart
    • Upgrading to Opik 2.0
    • Ollie Agent
    • FAQ
    • Changelog
  • Observability
    • Overview
    • Getting started
    • Concepts
    • Debugging agents with Ollie and Opik Connect
  • Development
    • Overview
    • Agent playground
    • Prompt playground
  • Evaluation
    • Overview
    • Getting started
    • Concepts
  • Production
  • Administration
    • Overview
    • Roles and Permissions
  • Contributing
    • Contribution Overview
LogoLogo
Copy to LLMGithubGo to App
On this page
  • Test Suites — assertion-based testing
  • Pass/fail logic
  • Datasets & Metrics — quantitative scoring
  • Choosing between the two
Evaluation

Evaluation Concepts

Was this page helpful?
Previous

Building Test Suites

Next
Built with

Opik provides two complementary approaches to evaluating your LLM application. Understanding when to use each will help you build a robust evaluation strategy.

Test Suites — assertion-based testing

Test Suites let you define expected behaviors as natural-language assertions. An LLM judge checks each assertion against your agent’s output and reports pass/fail results.

Best for:

  • Testing specific behaviors (e.g., “the response does not hallucinate”)
  • Pass/fail validation of agent outputs
  • Iterating on prompts and comparing versions
  • Catching regressions after changes

A Test Suite has three main components:

  1. Test items: Input data for your agent (e.g., questions with context, user scenarios)
  2. Assertions: Natural-language descriptions of expected behavior, checked by an LLM judge (e.g., “The response is concise”)
  3. Execution policy: Controls how many times each item is run and how many runs must pass

Assertions can be defined at two levels:

  • Suite-level assertions apply to every test item
  • Item-level assertions apply only to a specific test item, in addition to suite-level ones

Pass/fail logic

  • A run passes if all its assertions pass
  • An item passes if the number of passed runs meets the pass_threshold
  • The pass rate is the ratio of passed items to total items

Datasets & Metrics — quantitative scoring

Dataset-based evaluation scores your agent’s outputs using quantitative metrics. You define a dataset of test cases, run your agent against them, and score the results using pre-built or custom metrics.

Best for:

  • Measuring quality across many traces with a common metric (hallucination, relevance, coherence)
  • Comparing model or prompt versions with numeric scores
  • Evaluating RAG pipelines with context precision/recall metrics
  • Building leaderboards across experiments

A dataset-based evaluation has three main components:

  1. Dataset: A collection of test cases with inputs and optional expected outputs
  2. Task: A function that takes a dataset item and returns your agent’s output
  3. Metrics: Scoring functions that evaluate the output (e.g., Hallucination, AnswerRelevance, custom metrics)

Each evaluation run creates an Experiment — a record of every dataset item, your agent’s output, and the metric scores. Experiments are stored in Opik so you can compare them side-by-side.

Choosing between the two

Test SuitesDatasets & Metrics
OutputPass/fail per assertionNumeric scores per metric
Evaluation methodLLM judge checks natural-language assertionsScoring functions (LLM-based or heuristic)
Best forBehavioral testing, regression checksQuality measurement, benchmarking
Iteration styleUpdate assertions, re-run suiteUpdate dataset or metrics, re-run experiment

You can use both approaches together. For example, use Test Suites during development to validate specific behaviors, and Datasets & Metrics in CI to track quality scores over time.