Evaluate agent trajectories | Opik Documentation

In Opik 2.0, datasets and experiments are project-scoped. Make sure to specify a project_name when creating datasets and running experiments so they are associated with the correct project.

Evaluating agents requires more than checking the final output. You need to assess The trajectory — the steps your agent takes to reach an answer, including tool selection, reasoning chains, and intermediate decisions.

Agent trajectory evaluation helps you catch tool selection errors, identify inefficient reasoning paths, and optimize agent behavior before it reaches production.

Agent trajectory showing multiple steps and tool calls

Prerequisites

Before evaluating agent trajectories, you need:

Opik SDK installed and configured — See Quickstart for setup
Agent with observability enabled — Your agent must be instrumented with Opik tracing
Test dataset — Examples with expected agent behavior

If your agent isn’t traced yet, see Log Traces to add observability first.

Installing the Opik SDK

To install the Opik Python SDK you can run the following command:

$ pip install opik

Then you can configure the SDK by running the following command:

$ opik configure

This will prompt you for your API key and workspace or your instance URL if you are self-hosting.

Adding observability to your agent

In order to be able to evaluate the agent’s trajectory, you need to add tracing to your agent. This will allow us to capture the agent’s trajectory and evaluate it.

1 from langchain.agents import create_agent
2 from opik.integrations.langchain import OpikTracer
3 
4 opik_tracer = OpikTracer()
5 
6 def get_weather(city: str) -> str:
7     """Get weather for a given city."""
8     return f"It's always sunny in {city}!"
9 
10 agent = create_agent(
11     model="openai:gpt-4o",
12     tools=[get_weather],
13     system_prompt="You are a helpful assistant"
14 )
15 
16 # Run the agent
17 agent.invoke(
18     {"messages": [{
19         "role": "user",
20         "content": "what is the weather in sf"
21     }]},
22     config={"callbacks": [opik_tracer]}
23 )

If you’re using specific agent frameworks like CrewAI, LangGraph, or OpenAI Agents, check our integrations for framework-specific setup instructions.

Evaluating your agent’s trajectory

In order to evaluate the agent’s trajectory, we will need to create a dataset, define an evaluation metric and then run the evaluation.

Creating a dataset

We are going to create a dataset with a set of user questions and some expected tools that the agent should be calling:

1 from opik import Opik
2 
3 client = Opik()
4 dataset = client.get_or_create_dataset(name="agent_tool_selection", project_name="my-project")
5 dataset.insert([
6     {
7         "input": "What is 25 * 17?",
8         "expected_tool": []
9     },
10     {
11         "input": "What is the weather in SF?",
12         "expected_tool": ["get_weather"]
13     },
14     {
15         "input": "What is the weather in NY?",
16         "expected_tool": ["get_weather"]
17     }
18 ])

The format of dataset items is very flexible, you can include any fields you want in each item.

Defining the evaluation metric

In this task, we are going to measure Strict Tool Adherence which measures the agent’s adherence to the expected tools in the same order as they are expected.

The key to this metric is the use of the optional task_span parameter, this is available for all custom metrics and can be used to access the agent’s trajectory:

1 from opik.evaluation.metrics import BaseMetric, score_result
2 from opik.message_processing.emulation.models import SpanModel
3 from typing import List
4 
5 class StrictToolAdherenceMetric(BaseMetric):
6     def __init__(self, name: str = "strict_tool_adherence"):
7         self.name = name
8 
9     def find_tools(self, task_span):
10         """Find all tool spans in the SpanModel hierarchy."""
11         tools_used = []
12         
13         def extract_tools_from_spans(spans):
14             """Recursively extract tools from spans list."""
15             for span in spans:
16                 # Check if this span is a tool
17                 if span.type == "tool" and span.name:
18                     tools_used.append(span.name)
19                 
20                 # Recursively check nested spans
21                 if span.spans:
22                     extract_tools_from_spans(span.spans)
23         
24         # Start the recursive search from the top level spans
25         if task_span.spans:
26             extract_tools_from_spans(task_span.spans)
27         
28         return tools_used
29 
30     def score(self, task_span: SpanModel,
31               expected_tool: List[str], **kwargs):
32         # Find tool calls in trajectory
33         tool_used = self.find_tools(task_span)
34 
35         if tool_used == expected_tool:
36             return score_result.ScoreResult(
37                 value=1.0,
38                 name=self.name,
39                 reason=f"Correct: used {tool_used}"
40             )
41         else:
42             return score_result.ScoreResult(
43                 value=0.0,
44                 name=self.name,
45                 reason=f"Used {tool_used}, expected {expected_tool}"
46             )
47

Running the evaluation

Let’s define our evaluation task that will run our agent and return the assistant’s response:

1 def evaluation_task(dataset_item: dict) -> dict:
2     res = agent.invoke(
3         {"messages": [{
4             "role": "user",
5             "content": dataset_item["input"]
6         }]},
7         config={"callbacks": [opik_tracer]}
8     )
9     
10     return {"output": res['messages'][-1].content}

Now that we have our dataset and metric, we can run the evaluation:

Running the evaluation

1 from opik.evaluation import evaluate
2 
3 # Run the evaluation
4 experiment = evaluate(
5     dataset=dataset,
6     task=evaluation_task,
7     scoring_metrics=[StrictToolAdherenceMetric()],
8     project_name="my-project"
9 )

Analyzing the results

The Opik experiment dashboard provides a rich set of tools to help you analyze the results of the trajectory evaluation.

You can see the results of the evaluation in the Opik UI:

Experiment results showing tool selection scores

If you click on a specific test case row, you can view the full trajectory of the agent’s execution using the Trace button.

Next Steps

Now that you can evaluate agent trajectories:

Learn about Task Span Metrics for advanced trajectory analysis patterns
Optimize your agent with Agent Optimization
Monitor agents in production with Production Monitoring