Getting Started with Evaluation | Opik Documentation

Opik provides two approaches to evaluation. Choose the one that fits your use case:

Test Suites: Define assertions in natural language and let an LLM judge test them. Best for pass/fail behavioral testing.
Datasets & Metrics: Score outputs against a dataset using quantitative metrics. Best for measuring quality across many traces.

Quick start

Test Suites

Datasets & Metrics

Test Suites let you define expected behaviors as natural-language assertions and run them against your agent. An LLM judge checks each assertion automatically.

1 import opik
2 from openai import OpenAI
3 from opik.integrations.openai import track_openai
4 
5 openai_client = track_openai(OpenAI())
6 opik_client = opik.Opik()
7 
8 # Create a suite with assertions
9 suite = opik_client.get_or_create_test_suite(
10     name="my-agent-tests",
11     project_name="my-agent",
12     global_assertions=[
13         "The response directly addresses the user's question",
14         "The response is concise (3 sentences or fewer)",
15     ],
16     global_execution_policy={"runs_per_item": 2, "pass_threshold": 2},
17 )
18 
19 # Add test cases
20 suite.insert([
21     {"data": {"question": "How do I create a new project?", "context": "Go to Dashboard and click 'New Project'."}},
22     {"data": {"question": "What are the pricing tiers?", "context": "Free ($0/month), Pro ($29/month), Enterprise (custom)."}},
23 ])
24 
25 # Define the task
26 def task(item):
27     response = openai_client.chat.completions.create(
28         model="gpt-4o-mini",
29         messages=[
30             {"role": "system", "content": "Answer based ONLY on the provided context."},
31             {"role": "user", "content": f"Question: {item['question']}\n\nContext:\n{item['context']}"},
32         ],
33     )
34     return {"input": item, "output": response.choices[0].message.content}
35 
36 # Run the evaluation
37 result = opik.run_tests(test_suite=suite, task=task)
38 print(f"Pass rate: {result.pass_rate:.0%}")

Each run creates an experiment in the Opik dashboard for easy comparison.

Test suite experiment results showing pass/fail per item with assertion details

See the Building Test Suites guide for the full walkthrough.