Evaluate threads | Opik Documentation

When you are running multi-turn conversations using frameworks that support LLM agents, the Opik integration will automatically group related traces into conversation threads using parameters suitable for each framework.

This guide will walk you through the process of evaluating and optimizing conversation threads in Opik using the evaluate_threads function in the Python SDK.

Using the Python SDK

The Python SDK provides a simple and efficient way to evaluate and optimize conversation threads using the evaluate_threads function. This function allows you to specify a filter string to select specific threads for evaluation, a list of metrics to apply to each thread, and it returns a ThreadsEvaluationResult object containing the evaluation results and feedback scores.

To run the threads evaluation, you can use the following code:

1 from opik.evaluation import evaluate_threads
2 from opik.evaluation.metrics import ConversationalCoherenceMetric, UserFrustrationMetric
3 
4 # Initialize the evaluation metrics
5 conversation_coherence_metric = ConversationalCoherenceMetric()
6 user_frustration_metric = UserFrustrationMetric()
7 
8 # Run the threads evaluation
9 results = evaluate_threads(
10     project_name="ai_team",
11     filter_string='id = "0197ad2a"',
12     eval_project_name="ai_team_evaluation",
13     metrics=[
14         conversation_coherence_metric,
15         user_frustration_metric,
16     ],
17     trace_input_transform=lambda x: x["input"],
18     trace_output_transform=lambda x: x["output"],
19 )

Using filter string

The evaluate_threads function takes a filter string as an argument. This string is used to select the threads that should be evaluated. For example, if you want to evaluate only threads that have a specific ID, you can use the following filter string:

1 filter_string='id = "0197ad2a"'

You can combine multiple filter strings using the AND operator. For example, if you want to evaluate only threads that have a specific ID and have a specific status, you can use the following filter string:

1 filter_string='id = "0197ad2a" AND status = "active"'

Supported filter fields and operators

The evaluate_threads function supports the following filter fields in the filter_string and operators to be applied to the corresponding fields:

Field	Type	Operators
id	string	`=, contains, not_contains`
status	string	`=, contains, not_contains`
start_time	datetime	`=, >, <, >=, <=`
end_time	datetime	`=, >, <, >=, <=`
feedback_scores	dict	`=, >, <, >=, <=`
tags	list	`contains`
duration	number	`=, >, <, >=, <=`
number_of_messages	number	`=, >, <, >=, <=`
created_by	string	`=, contains, not_contains`

The feedback_scores field is a dictionary where the keys are the metric names and the values are the metric values. You can use it to filter threads based on their feedback scores. For example, if you want to evaluate only threads that have a specific user frustration score, you can use the following filter string:

1 filter_string='feedback_scores.user_frustration_score >= 0.5'

Where user_frustration_score is the name of the user frustration metric and 0.5 is the threshold value to filter by.

Using Opik UI to view results

Once the evaluation is complete, you can access the evaluation results in the Opik UI.

Next steps

For more details on what metrics can be used to score conversational threads, refer to the conversational metrics page.