Evaluate agents
Step-by-step guide on how to evaluate and optimize AI agents throughout their lifecycle
Building AI agents isn’t just about making them work: it’s about making them reliable, intelligent, and scalable.
As agents reason, act, and interact with real users, treating them like black boxes isn’t enough.
To ship production-grade agents, teams need a clear path from development to deployment, grounded in observability, testing, and optimization.
This guide walks you through the agent lifecycle and shows how Opik helps at every stage.
In this guide, we will focus on evaluating complex agentic applications. If you are looking at evaluating single prompts you can refer to the Evaluate A Prompt guide.
1. Start with Observability
The first step in agent development is making its behavior transparent. From day one, you should instrument your agent with trace logging — capturing inputs, intermediate steps, tool calls, outputs, and errors.
With just two lines of code, you unlock full visibility into how your agent thinks and acts. Using Opik, you can inspect every step, understand what happened, and quickly debug issues.

This guide uses a Python agent built with LangGraph to illustrate tracing and evaluation.
If you’re using other frameworks like OpenAI Agents, CrewAI, Haystack, or LlamaIndex,
you can check out our Integrations Overview to get started with tracing in your setup.
Once you’ve logged your first traces, Opik gives you immediate access to valuable insights, not just about what your agent did, but how it performed. You can explore detailed trace data, see how many traces and spans your agent is generating, track token usage, and monitor response latency across runs.

For each interaction with the end user, you can also know how the agent planned, chose tools, or crafted an answer based on the user input, the agent graph and much more.

During development phase, having access to all this information is fundamental for debugging and understanding what is working as expected and what’s not.
Error detection
Having immediate access to all traces that returned an error can also be life-saving, and Opik makes it extremely easy to achieve:

For each of the errors and exceptions captured, you have access to all the details you need to fix the issue:

2. Evaluate Agent’s End-to-end Behavior
Once you have full visibility on the agent interactions, memory and tool usage, and you made sure everything is working at the technical level, the next logical step is to start checking the quality of the responses and the actions your agent takes.
Human Feedback
The fastest and easiest way to do it is providing manual human feedback. Each trace and each span can be rated “Correct” or “Incorrect” by a person (most probably you!) and that will give a baseline to understand the quality of the responses.
You can provide human feedback and a comment for each trace’s score in Opik and when you’re done you can store all results in a dataset that you will be using in next iterations of agent optimization.

Online evaluation
Marking an answer as simply “correct” or “incorrect” is a useful first step, but it’s rarely enough. As your agent grows more complex, you’ll want to measure how well it performs across more nuanced dimensions.
That’s where online evaluation becomes essential.
With Opik, you can automatically score traces using a wide range of metrics, such as answer relevance, hallucination detection, agent moderation, user moderation, or even custom criteria tailored to your specific use case. These evaluations run continuously, giving you structured feedback on your agent’s quality without requiring manual review.
Want to dive deeper?
Check out the Metrics Documentation to explore all the heuristic metrics and LLM-as-a-judge evaluations that Opik offers out of the box.
3. Evaluate Agent’s Steps
When building complex agents, evaluating only the final output isn’t enough. Agents reason through sequences of actions—choosing tools, calling functions, retrieving memories, and generating intermediate messages.
Each of these steps can introduce errors long before they show up in the final answer.
That’s why evaluating agent steps independently is a core best practice.
Without step-level evaluation, you might only notice failures after they impact the final user response, without knowing where things went wrong. With step evaluation, you can catch issues as they occur and identify exactly which part of your agent’s reasoning or architecture needs fixing.
What Steps Should You Evaluate?
Depending on your agent architecture, you might want to score:
For each of those steps you can use one of Opik’s predefined metrics or create your own custom metric that adapts to your needs.
4. Example: Evaluating Tool Selection Quality with Opik
When building agents that use tools (like web search, calculators, APIs…), it’s critical to know how well your agent is choosing and using those tools.
Are they picking the right tool? Are they using it correctly? Are they wasting time or making mistakes?
The easiest way to measure this in Opik is by running a custom evaluation experiment.
What We’ll Do
In this example, we’ll use Opik’s SDK to create a script that will run an experiment to measure how well an agent selects tools.
When you run the experiment, Opik will:
- Execute the agent against every item in a dataset of examples.
- Evaluate each agent interaction using a custom metric.
- Log results (scores and reasoning) into a dashboard you can explore.
This will give you a clear, data-driven view of how good (or bad!) your agent’s tool selection behavior really is.
What We Need
For every Experiment we want to run, the most important elements we need to create are the following:
Full Example: Tool Selection Evaluation Script
Here’s the full example:
The Custom Tool Selection metric looks like this:
After running this script:
- You will see a new experiment in Opik.
- Each item will have a tool selection score and a reason explaining why it was correct or incorrect.
- You can then analyze results, filter mistakes, and build better training data for your agent.
This method is a scalable way to move from gut feelings to hard evidence when improving your agent’s behavior.

What Happens Next? Iterate, Improve, and Compare
Running the experiment once gives you a baseline: a first measurement of how good (or bad) your agent’s tool selection behavior is.
But the real power comes from using these results to improve your agent — and then re-running the experiment to measure progress.
Here’s how you can use this workflow:
Make improvements to your agent
Update the system prompt to improve instructions, refine tool descriptions, and adjust tool names or input formats to be more intuitive.
Re-run the evaluation experiment
Use the same dataset to measure how your changes affected tool selection quality.
Compare the results
Review improvements in score, spot reductions in errors, and identify new patterns or regressions.

And this is just for one module! You can next move to the next component of your agent
You can evaluate modules with metrics like the following:
- Router: tool selection and parameter extraction
- Tools: Output accuracy, hallucinations
- Planner: Plan length, validity, sufficiency
- Paths: Looping, redundant steps
- Reflection: Output quality, retry logic
5. Wrapping Up: Where to Go From Here
Building great agents is a journey that doesn’t stop at getting them to “work.” It’s about creating agents you can trust, understand, and continuously improve.
In this guide, you’ve learned how to make agent behavior observable, how to evaluate outputs and reasoning steps, and how to design experiments that drive real, measurable improvements.
But this is just the beginning.
From here, you might want to:
- Optimize your prompts to drive better agent behavior with Prompt Optimization.
- Monitor agents in production to catch regressions, errors, and drift in real-time with Production Monitoring.
- Add guardrails for security, content safety, and sensitive data leakage prevention, ensuring your agents behave responsibly even in dynamic environments.
Each of these steps builds on the foundation you’ve set: observability, evaluation, and continuous iteration.
By combining them, you’ll be ready to take your agents from early prototypes to production-grade systems that are powerful, safe, and scalable.