Evaluate multi-turn agents

Step-by-step guide to evaluate multi-turn agents

When working on chatbots or multi-turn agents, it can become a challenge to evaluate the agent’s behavior over multiple turns as you don’t know what the user would have asked as a follow-up question based on the previous turns.

To achieve multi-turn evaluation, we can turn to simulation techniques to simulate the user’s response based on the previous turns. The core idea is that we can use an LLM to simulate what the user would have responded based on the previous turns and run this for a number of turns.

Once we have this conversation, we can use Opik evaluation features to score the agent’s behavior.

Creating the user simulator

In order to perform multi-turn evaluation, we need to create a user simulator that will generate the user’s response based on previous turns

User simulator
1from opik.simulation import SimulatedUser
2
3user_simulator = SimulatedUser(
4 persona="You are a frustrated user who wants a refund",
5 model="openai/gpt-4.1",
6)
7
8# Generate a user message that will start the conversation
9print(user_simulator.generate_response([
10 {"role": "assistant", "content": "Hello, how can I help you today?"}
11]))
12
13# Generate a user message based on a couple of back and forth turns
14print(user_simulator.generate_response([
15 {"role": "assistant", "content": "Hello, how can I help you today?"},
16 {"role": "user", "content": "My product just broke 2 days after I bought it, I want a refund."},
17 {"role": "assistant", "content": "I'm sorry to hear that. What happened?"}
18]))

Now that we have a way to simulate the user, we can create multiple simulations that we will in turn evaluate.

Running simulations

Now that we have a way to simulate the user, we can create multiple simulations:

1

1. Create a list of scenarios

In order to more easily keep track of the scenarios we will be running, let’s create a dataset with the user personas we will be using:

Running simulations
1import opik
2
3opik_client = opik.Opik()
4dataset = opik_client.get_or_create_dataset(name="Multi-turn evaluation")
5dataset.insert([
6 {"user_persona": "You are a frustrated user who wants a refund"},
7 {"user_persona": "You are a user who is happy with your product and wants to buy more"},
8 {"user_persona": "You are a user who is having trouble with your product and wants to get help"}
9])
2

2. Create our agent app

In order to run the simulations, we need to create our agent app based on our existing agent. The run_agent function we will be creating will have the following signature:

Run agent function signature
1from langchain.agents import create_agent
2from opik.integrations.langchain import OpikTracer
3
4opik_tracer = OpikTracer()
5
6agent = create_agent(
7 model="openai:gpt-4.1",
8 tools=[],
9 system_prompt="You are a helpful assistant",
10)
11
12agent_history = {}
13
14def run_agent(user_message: str, *, thread_id: str, **kwargs) -> dict[str, str]:
15 if thread_id not in agent_history:
16 agent_history[thread_id] = []
17
18 agent_history[thread_id].append({"role": "user", "content": user_message})
19 messages = agent_history[thread_id]
20
21 response = agent.invoke({"messages": messages}, config={"callbacks": [opik_tracer]})
22 agent_history[thread_id] = response["messages"]
23
24 return response["messages"][-1].content
3

3. Run the simulations

Now that we have a dataset with the user personas, we can run the simulations:

Running simulations
1import opik
2from opik.simulation import run_simulation
3
4# Fetch the user personas
5opik_client = opik.Opik()
6dataset = opik_client.get_or_create_dataset(name="Multi-turn evaluation")
7
8# Run the simulations
9all_simulations = []
10for item in dataset.get_items():
11 user_persona = item["user_persona"]
12 user_simulator = SimulatedUser(
13 persona=user_persona,
14 model="openai/gpt-4.1",
15 )
16 simulation = run_simulation(
17 app=run_agent,
18 user_simulator=user_simulator,
19 max_turns=5,
20 )
21
22 all_simulations.append(simulation)

The run_simulation function keeps track of the internal conversation state by constructing a list of messages with the result of the run_agent function as an assistant message and the UserSimulator’s response as a user message.

If you need more complex conversation state, you can create threads using the UserSimulator’s generate_response method directly.

The simulated threads will be available in the Opik thread UI:

Scoring threads

When working on evaluating multi-turn conversations, you can use one of Opik’s built-in conversation metrics or create your own.

If you’ve used the run_simulation function, you will already have a list of conversation messages that you can pass directly to the metrics, otherwise you can use the evaluate_threads function:

1import opik
2from opik.evaluation.metrics import ConversationalCoherenceMetric, UserFrustrationMetric
3
4opik_client = opik.Opik()
5
6# Define the metrics you want to use
7conversation_coherence_metric = ConversationalCoherenceMetric()
8user_frustration_metric = UserFrustrationMetric()
9
10for simulation in all_simulations:
11 conversation = simulation["conversation_history"]
12
13 coherence_score = conversation_coherence_metric.score(conversation)
14 frustration_score = user_frustration_metric.score(conversation)
15
16 opik_client.log_threads_feedback_scores(
17 scores=[
18 {
19 "id": simulation["thread_id"],
20 "name": "conversation_coherence",
21 "value": coherence_score.value,
22 "reason": coherence_score.reason
23 },
24 {
25 "id": simulation["thread_id"],
26 "name": "user_frustration",
27 "value": frustration_score.value,
28 "reason": frustration_score.reason
29 }
30 ]
31 )

You can learn more about the evaluate_threads function in the evaluate_threads guide.

Once the threads have been scored, you can view the results in the Opik thread UI:

Next steps