Evaluate threads

Step-by-step guide on how to evaluate conversation threads

When you are running multi-turn conversations using frameworks that support LLM agents, the Opik integration will automatically group related traces into conversation threads using parameters suitable for each framework.

This guide will walk you through the process of evaluating and optimizing conversation threads in Opik using the evaluate_threads function in the Python SDK.

Using the Python SDK

The Python SDK provides a simple and efficient way to evaluate and optimize conversation threads using the evaluate_threads function. This function allows you to specify a filter string to select specific threads for evaluation, a list of metrics to apply to each thread, and it returns a ThreadsEvaluationResult object containing the evaluation results and feedback scores.

To run the threads evaluation, you can use the following code:

1from opik.evaluation import evaluate_threads
2from opik.evaluation.metrics import ConversationalCoherenceMetric, UserFrustrationMetric
3
4# Initialize the evaluation metrics
5conversation_coherence_metric = ConversationalCoherenceMetric()
6user_frustration_metric = UserFrustrationMetric()
7
8# Run the threads evaluation
9results = evaluate_threads(
10 project_name="ai_team",
11 filter_string='id = "0197ad2a"',
12 eval_project_name="ai_team_evaluation",
13 metrics=[
14 conversation_coherence_metric,
15 user_frustration_metric,
16 ],
17 trace_input_transform=lambda x: x["input"],
18 trace_output_transform=lambda x: x["output"],
19)

Using filter string

The evaluate_threads function takes a filter string as an argument. This string is used to select the threads that should be evaluated. For example, if you want to evaluate only threads that have a specific ID, you can use the following filter string:

1filter_string='id = "0197ad2a"'

You can combine multiple filter strings using the AND operator. For example, if you want to evaluate only threads that have a specific ID and have a specific status, you can use the following filter string:

1filter_string='id = "0197ad2a" AND status = "active"'

Supported filter fields and operators

The evaluate_threads function supports the following filter fields in the filter_string and operators to be applied to the corresponding fields:

FieldTypeOperators
idstring=, contains, not_contains
statusstring=, contains, not_contains
start_timedatetime=, >, <, >=, <=
end_timedatetime=, >, <, >=, <=
feedback_scoresdict=, >, <, >=, <=
tagslistcontains
durationnumber=, >, <, >=, <=
number_of_messagesnumber=, >, <, >=, <=
created_bystring=, contains, not_contains

The feedback_scores field is a dictionary where the keys are the metric names and the values are the metric values. You can use it to filter threads based on their feedback scores. For example, if you want to evaluate only threads that have a specific user frustration score, you can use the following filter string:

1filter_string='feedback_scores.user_frustration_score >= 0.5'

Where user_frustration_score is the name of the user frustration metric and 0.5 is the threshold value to filter by.

Using Opik UI to view results

Once the evaluation is complete, you can access the evaluation results in the Opik UI.

Next steps

For more details on what metrics can be used to score conversational threads, refer to the conversational metrics page.