Evaluate agents | Opik Documentation

Building AI agents isn’t just about making them work: it’s about making them reliable, intelligent, and scalable.

As agents reason, act, and interact with real users, treating them like black boxes isn’t enough.

To ship production-grade agents, teams need a clear path from development to deployment, grounded in observability, testing, and optimization.

This guide walks you through the agent lifecycle and shows how Opik helps at every stage.

In this guide, we will focus on evaluating complex agentic applications. If you are looking at evaluating single prompts you can refer to the Evaluate A Prompt guide.

1. Start with Observability

The first step in agent development is making its behavior transparent. From day one, you should instrument your agent with trace logging — capturing inputs, intermediate steps, tool calls, outputs, and errors.

With just two lines of code, you unlock full visibility into how your agent thinks and acts. Using Opik, you can inspect every step, understand what happened, and quickly debug issues.

Adding tracing capabilities to an AI agent — Adding tracing to an agent

This guide uses a Python agent built with LangGraph to illustrate tracing and evaluation. If you’re using other frameworks like OpenAI Agents, CrewAI, Haystack, or LlamaIndex, you can check out our Integrations Overview to get started with tracing in your setup.

Once you’ve logged your first traces, Opik gives you immediate access to valuable insights, not just about what your agent did, but how it performed. You can explore detailed trace data, see how many traces and spans your agent is generating, track token usage, and monitor response latency across runs.

Dashboard displaying trace data, spans, and token usage for AI agents — Opik dashboard showing cost, latency, and token usage

For each interaction with the end user, you can also know how the agent planned, chose tools, or crafted an answer based on the user input, the agent graph and much more.

Detailed visualization of a single agent trace with steps and tool interactions — Detailed view of a single agent trace

During development phase, having access to all this information is fundamental for debugging and understanding what is working as expected and what’s not.

Error detection

Having immediate access to all traces that returned an error can also be life-saving, and Opik makes it extremely easy to achieve:

For each of the errors and exceptions captured, you have access to all the details you need to fix the issue:

Expanded error report showing causes and contexts of agent failures — Error details view for a failed agent trace

2. Evaluate Agent’s End-to-end Behavior

Once you have full visibility on the agent interactions, memory and tool usage, and you made sure everything is working at the technical level, the next logical step is to start checking the quality of the responses and the actions your agent takes.

Human Feedback

The fastest and easiest way to do it is providing manual human feedback. Each trace and each span can be rated “Correct” or “Incorrect” by a person (most probably you!) and that will give a baseline to understand the quality of the responses.

You can provide human feedback and a comment for each trace’s score in Opik and when you’re done you can store all results in a dataset that you will be using in next iterations of agent optimization.

Human feedback interface for agent traces with scores and comments — Human feedback for agent traces

Online evaluation

Marking an answer as simply “correct” or “incorrect” is a useful first step, but it’s rarely enough. As your agent grows more complex, you’ll want to measure how well it performs across more nuanced dimensions.

That’s where online evaluation becomes essential.

With Opik, you can automatically score traces using a wide range of metrics, such as answer relevance, hallucination detection, agent moderation, user moderation, or even custom criteria tailored to your specific use case. These evaluations run continuously, giving you structured feedback on your agent’s quality without requiring manual review.

Want to dive deeper? Check out the Metrics Documentation to explore all the heuristic metrics and LLM-as-a-judge evaluations that Opik offers out of the box.

3. Evaluate Agent’s Steps

When building complex agents, evaluating only the final output isn’t enough. Agents reason through sequences of actions—choosing tools, calling functions, retrieving memories, and generating intermediate messages.

Each of these steps can introduce errors long before they show up in the final answer.

That’s why evaluating agent steps independently is a core best practice.

Without step-level evaluation, you might only notice failures after they impact the final user response, without knowing where things went wrong. With step evaluation, you can catch issues as they occur and identify exactly which part of your agent’s reasoning or architecture needs fixing.

What Steps Should You Evaluate?

Depending on your agent architecture, you might want to score:

Step Type	Example Evaluation Questions
Tool Calls	Did the agent pick the right tool for the job? Did it provide correct parameters?
Memory Retrievals	Was the retrieved memory relevant to the query?
Plans	Did the agent generate a coherent, executable plan?
Intermediate Messages	Was the internal reasoning logical and consistent?

For each of those steps you can use one of Opik’s predefined metrics or create your own custom metric that adapts to your needs.

4. Example: Evaluating Tool Selection Quality with Opik

When building agents that use tools (like web search, calculators, APIs…), it’s critical to know how well your agent is choosing and using those tools.

Are they picking the right tool? Are they using it correctly? Are they wasting time or making mistakes?

The easiest way to measure this in Opik is by running a custom evaluation experiment.

What We’ll Do

In this example, we’ll use Opik’s SDK to create a script that will run an experiment to measure how well an agent selects tools.

When you run the experiment, Opik will:

Execute the agent against every item in a dataset of examples.
Evaluate each agent interaction using a custom metric.
Log results (scores and reasoning) into a dashboard you can explore.

This will give you a clear, data-driven view of how good (or bad!) your agent’s tool selection behavior really is.

What We Need

For every Experiment we want to run, the most important elements we need to create are the following:

A Dataset

A set of example user queries and expected correct tool usage.

A Metric

A way to automatically decide if the agent’s behavior was correct or not (we’ll create a custom one).

An Evaluation Task

A function that tells Opik how to run your agent on each dataset item.

Full Example: Tool Selection Evaluation Script

Here’s the full example:

{pytest_codeblocks_skip=true}

1 import os
2 from opik import Opik
3 from opik.evaluation import evaluate
4 from agent import agent_executor
5 from langchain_core.messages import HumanMessage
6 from experiments.tool_selection_metric import ToolSelectionQuality
7 
8 os.environ["OPIK_API_KEY"] = "YOUR_API_KEY"
9 os.environ["OPIK_WORKSPACE"] = "YOUR_WORKSPACE"
10 
11 client = Opik()
12 
13 # This is the dataset with the examples of good tool selection
14 dataset = client.get_dataset(name="Your_Dataset")
15 
16 """
17 Note: if you don't have a dataset yet, you can easily create it this way:
18 dataset = client.get_or_create_dataset(name="My_Dataset")
19 
20 # Define the items
21 items = [
22     {
23         "input": "Find information about adding numbers.",
24         "expected_output": "tavily_search_results_json"
25     },
26     {
27         "input": "Multiply 7×6",
28         "expected_output": "simple_math_tool"
29     }
30     [...]
31 ]
32 
33 # Insert the dataset items
34 dataset.insert(items)
35 """
36 
37 # This function defines how each item in the dataset will be evaluated.
38 # For each dataset item:
39 # - It sends the `input` as a message to the agent (`agent_executor`).
40 # - It captures the agent's actual tool calls from its outputs.
41 # - It packages the original input, the agent's outputs, the detected tool calls, and the expected tool calls.
42 # This structured output is what the evaluation platform will use to compare expected vs actual behavior using the custom metric(s) you define.
43 
44 def evaluation_task(dataset_item):
45     try:
46         user_message_content = dataset_item["input"]
47         expected_tool = dataset_item["expected_output"]
48 
49         # This is where you call your agent with the input message and get the real execution results.
50         result = agent_executor.invoke({"messages": [HumanMessage(content=user_message_content)]})
51 
52         tool_calls = []
53 
54         # Here we extract the tool calls the agent actually made.
55         # We loop through the agent's messages, check tool calls,
56         # and for each tool call, we capture its metadata.
57         for msg in result.get("messages", []):
58             if hasattr(msg, "tool_calls") and msg.tool_calls:
59                 for tool_call in msg.tool_calls:
60                     tool_calls.append({
61                         "function_name": tool_call.get("name"),
62                         "function_parameters": tool_call.get("args", {})
63                     })
64 
65         return {
66             "input": user_message_content,
67             "output": result,
68             "tool_calls": tool_calls,
69             "expected_tool_calls": [{"function_name": expected_tool, "function_parameters": {}}]
70         }
71 
72     except Exception as e:
73         return {
74             "input": dataset_item.get("input", {}),
75             "output": "Error processing input.",
76             "tool_calls": [],
77             "expected_tool_calls": [{"function_name": "unknown", "function_parameters": {}}],
78             "error": str(e)
79         }
80 
81 # This is the custom metric we have defined
82 metrics = [ToolSelectionQuality()]
83 
84 # This function runs the full evaluation process.
85 # It loops over each dataset item and applies the `evaluation_task` function to generate outputs.
86 # It then applies the custom `ToolSelectionQuality` metric (or any provided metrics) to score each result.
87 # It logs the evaluation results to Opik under the specified experiment name ("AgentToolSelectionExperiment").
88 # This allows tracking, comparing, and analyzing your agent's tool selection quality over time in Opik.
89 
90 eval_results = evaluate(
91   experiment_name="AgentToolSelectionExperiment",
92   dataset=dataset,
93   task=evaluation_task,
94   scoring_metrics=metrics
95 )

The Custom Tool Selection metric looks like this:

1 from opik.evaluation.metrics import base_metric, score_result
2 
3 class ToolSelectionQuality(base_metric.BaseMetric):
4     def __init__(self, name: str = "tool_selection_quality"):
5         self.name = name
6 
7     def score(self, tool_calls, expected_tool_calls, **kwargs):
8         try:
9             actual_tool = tool_calls[0]["function_name"]
10             expected_tool = expected_tool_calls[0]["function_name"]
11 
12             if actual_tool == expected_tool:
13                 return score_result.ScoreResult(
14                     name=self.name,
15                     value=1,
16                     reason=f"Correct tool selected: {actual_tool}"
17                 )
18             else:
19                 return score_result.ScoreResult(
20                     name=self.name,
21                     value=0,
22                     reason=f"Wrong tool. Expected {expected_tool}, got {actual_tool}"
23                 )
24         except Exception as e:
25             return score_result.ScoreResult(
26                 name=self.name,
27                 value=0,
28                 reason=f"Scoring error: {e}"
29             )

After running this script:

You will see a new experiment in Opik.
Each item will have a tool selection score and a reason explaining why it was correct or incorrect.
You can then analyze results, filter mistakes, and build better training data for your agent.

This method is a scalable way to move from gut feelings to hard evidence when improving your agent’s behavior.

Experiments dashboard in Opik — Experiments Dashboard in Opik

What Happens Next? Iterate, Improve, and Compare

Running the experiment once gives you a baseline: a first measurement of how good (or bad) your agent’s tool selection behavior is.

But the real power comes from using these results to improve your agent — and then re-running the experiment to measure progress.

Here’s how you can use this workflow:

Run the initial evaluation experiment

See where your agent is making tool selection mistakes.

Analyze the results in Opik

Look at the most common errors and read the reasoning behind low scores.

Make improvements to your agent

Update the system prompt to improve instructions, refine tool descriptions, and adjust tool names or input formats to be more intuitive.

Re-run the evaluation experiment

Use the same dataset to measure how your changes affected tool selection quality.

Compare the results

Review improvements in score, spot reductions in errors, and identify new patterns or regressions.

Comparing results between experiments in Opik — Comparing results between experiments

Repeat the cycle until quality is met

Iterate as many times as needed to reach the level of performance you want from your agent.

And this is just for one module! You can next move to the next component of your agent

You can evaluate modules with metrics like the following:

Router: tool selection and parameter extraction
Tools: Output accuracy, hallucinations
Planner: Plan length, validity, sufficiency
Paths: Looping, redundant steps
Reflection: Output quality, retry logic

5. Wrapping Up: Where to Go From Here

Building great agents is a journey that doesn’t stop at getting them to “work.” It’s about creating agents you can trust, understand, and continuously improve.

In this guide, you’ve learned how to make agent behavior observable, how to evaluate outputs and reasoning steps, and how to design experiments that drive real, measurable improvements.

But this is just the beginning.

From here, you might want to:

Optimize your prompts to drive better agent behavior with Prompt Optimization.
Monitor agents in production to catch regressions, errors, and drift in real-time with Production Monitoring.
Add Guardrails for security, content safety, and sensitive data leakage prevention, ensuring your agents behave responsibly even in dynamic environments.

Each of these steps builds on the foundation you’ve set: observability, evaluation, and continuous iteration.

By combining them, you’ll be ready to take your agents from early prototypes to production-grade systems that are powerful, safe, and scalable.