Evaluate agent trajectories

Step-by-step guide to evaluate agent trajectories

Evaluating agents requires more than checking the final output. You need to assess The trajectory — the steps your agent takes to reach an answer, including tool selection, reasoning chains, and intermediate decisions.

Agent trajectory evaluation helps you catch tool selection errors, identify inefficient reasoning paths, and optimize agent behavior before it reaches production.

Agent trajectory showing multiple steps and tool calls

Prerequisites

Before evaluating agent trajectories, you need:

  1. Opik SDK installed and configured — See Quickstart for setup
  2. Agent with observability enabled — Your agent must be instrumented with Opik tracing
  3. Test dataset — Examples with expected agent behavior

If your agent isn’t traced yet, see Log Traces to add observability first.

Installing the Opik SDK

To install the Opik Python SDK you can run the following command:

$pip install opik

Then you can configure the SDK by running the following command:

$opik configure

This will prompt you for your API key and workspace or your instance URL if you are self-hosting.

Adding observability to your agent

In order to be able to evaluate the agent’s trajectory, you need to add tracing to your agent. This will allow us to capture the agent’s trajectory and evaluate it.

1from langchain.agents import create_agent
2from opik.integrations.langchain import OpikTracer
3
4opik_tracer = OpikTracer()
5
6def get_weather(city: str) -> str:
7 """Get weather for a given city."""
8 return f"It's always sunny in {city}!"
9
10agent = create_agent(
11 model="openai:gpt-4o",
12 tools=[get_weather],
13 system_prompt="You are a helpful assistant"
14)
15
16# Run the agent
17agent.invoke(
18 {"messages": [{
19 "role": "user",
20 "content": "what is the weather in sf"
21 }]},
22 config={"callbacks": [opik_tracer]}
23)

If you’re using specific agent frameworks like CrewAI, LangGraph, or OpenAI Agents, check our integrations for framework-specific setup instructions.

Evaluating your agent’s trajectory

In order to evaluate the agent’s trajectory, we will need to create a dataset, define an evaluation metric and then run the evaluation.

Creating a dataset

We are going to create a dataset with a set of user questions and some expected tools that the agent should be calling:

1from opik import Opik
2
3client = Opik()
4dataset = client.get_or_create_dataset(name="agent_tool_selection")
5dataset.insert([
6 {
7 "input": "What is 25 * 17?",
8 "expected_tool": []
9 },
10 {
11 "input": "What is the weather in SF?",
12 "expected_tool": ["get_weather"]
13 },
14 {
15 "input": "What is the weather in NY?",
16 "expected_tool": ["get_weather"]
17 }
18])

The format of dataset items is very flexible, you can include any fields you want in each item.

Defining the evaluation metric

In this task, we are going to measure Strict Tool Adherence which measures the agent’s adherence to the expected tools in the same order as they are expected.

The key to this metric is the use of the optional task_span parameter, this is available for all custom metrics and can be used to access the agent’s trajectory:

1from opik.evaluation.metrics import BaseMetric, score_result
2from opik.message_processing.emulation.models import SpanModel
3from typing import List
4
5class StrictToolAdherenceMetric(BaseMetric):
6 def __init__(self, name: str = "strict_tool_adherence"):
7 self.name = name
8
9 def find_tools(self, task_span):
10 """Find all tool spans in the SpanModel hierarchy."""
11 tools_used = []
12
13 def extract_tools_from_spans(spans):
14 """Recursively extract tools from spans list."""
15 for span in spans:
16 # Check if this span is a tool
17 if span.type == "tool" and span.name:
18 tools_used.append(span.name)
19
20 # Recursively check nested spans
21 if span.spans:
22 extract_tools_from_spans(span.spans)
23
24 # Start the recursive search from the top level spans
25 if task_span.spans:
26 extract_tools_from_spans(task_span.spans)
27
28 return tools_used
29
30 def score(self, task_span: SpanModel,
31 expected_tool: List[str], **kwargs):
32 # Find tool calls in trajectory
33 tool_used = self.find_tools(task_span)
34
35 if tool_used == expected_tool:
36 return score_result.ScoreResult(
37 value=1.0,
38 name=self.name,
39 reason=f"Correct: used {tool_used}"
40 )
41 else:
42 return score_result.ScoreResult(
43 value=0.0,
44 name=self.name,
45 reason=f"Used {tool_used}, expected {expected_tool}"
46 )
47

Running the evaluation

Let’s define our evaluation task that will run our agent and return the assistant’s response:

1def evaluation_task(dataset_item: dict) -> dict:
2 res = agent.invoke(
3 {"messages": [{
4 "role": "user",
5 "content": dataset_item["input"]
6 }]},
7 config={"callbacks": [opik_tracer]}
8 )
9
10 return {"output": res['messages'][-1].content}

Now that we have our dataset and metric, we can run the evaluation:

Running the evaluation
1from opik.evaluation import evaluate
2
3# Run the evaluation
4experiment = evaluate(
5 dataset=dataset,
6 task=evaluation_task,
7 scoring_metrics=[StrictToolAdherenceMetric()]
8)

Analyzing the results

The Opik experiment dashboard provides a rich set of tools to help you analyze the results of the trajectory evaluation.

You can see the results of the evaluation in the Opik UI:

Experiment results showing tool selection scores

If you click on a specific test case row, you can view the full trajectory of the agent’s execution using the Trace button.

Next Steps

Now that you can evaluate agent trajectories: