Evaluate threads

Step-by-step guide on how to evaluate conversation threads

When you are running multi-turn conversations using frameworks that support LLM agents, the Opik integration will automatically group related traces into conversation threads using parameters suitable for each framework.

This guide will walk you through the process of evaluating and optimizing conversation threads in Opik using the evaluate_threads function in the Python SDK.

For complete API reference documentation, see the evaluate_threads API reference.

Using the Python SDK

The Python SDK provides a simple and efficient way to evaluate and optimize conversation threads using the evaluate_threads function. This function allows you to specify a filter string to select specific threads for evaluation, a list of metrics to apply to each thread, and it returns a ThreadsEvaluationResult object containing the evaluation results and feedback scores.

Most importantly, this function automatically uploads the feedback scores to your traces in Opik! So, once evaluation is completed, you can also see the results in the UI.

To run the threads evaluation, you can use the following code:

1from opik.evaluation import evaluate_threads
2from opik.evaluation.metrics import ConversationalCoherenceMetric, UserFrustrationMetric
3
4# Initialize the evaluation metrics
5conversation_coherence_metric = ConversationalCoherenceMetric()
6user_frustration_metric = UserFrustrationMetric()
7
8# Run the threads evaluation
9results = evaluate_threads(
10 project_name="ai_team",
11 filter_string='id = "0197ad2a"',
12 eval_project_name="ai_team_evaluation",
13 metrics=[
14 conversation_coherence_metric,
15 user_frustration_metric,
16 ],
17 trace_input_transform=lambda x: x["input"],
18 trace_output_transform=lambda x: x["output"],
19)

Want to create your own custom conversation metrics? Check out the Custom Conversation Metrics guide to learn how to build specialized metrics for evaluating multi-turn dialogues.

Understanding the Transform Arguments

Threads consist of multiple traces, and each trace has an input and output. In practice, these typically contain user messages and agent responses. However, trace inputs and outputs are rarely just simple strings—they are usually complex data structures whose exact format depends on your agent framework.

To handle this complexity, you need to provide trace_input_transform and trace_output_transform functions. These are critical parameters that tell Opik how to extract the actual message content from your framework-specific trace structure.

Why Transform Functions Are Needed

Different agent frameworks structure their trace data differently:

  • LangChain might store messages in {"messages": [{"content": "..."}]}
  • CrewAI might use {"task": {"description": "..."}}
  • Custom implementations can have any structure you’ve defined

Without transform functions, Opik wouldn’t know where to find the actual user questions and agent responses within your trace data.

How Transform Functions Work

Using these functions, the Opik evaluation engine will convert your threads chosen for evaluation into the standardized format expected by all Opik thread evaluation metrics:

1[
2 {
3 "role": "user",
4 "content": "input string from trace 1"
5 },
6 {
7 "role": "assistant",
8 "content": "output string from trace 1"
9 },
10 {
11 "role": "user",
12 "content": "input string from trace 2"
13 },
14 {
15 "role": "assistant",
16 "content": "output string from trace 2"
17 }
18]

Example:

If your trace input has the following structure:

1{
2 "content": {
3 "user_question": "Tell me about your service?"
4 },
5 "metadata": {...}
6}

Then your trace_input_transform should be:

1lambda x: x["content"]["user_question"]

Don’t want to deal with transformations because your traces don’t have a consistent format? Try using LLM-based transformations, language models are good at this!.

Using filter string

The evaluate_threads function takes a filter string as an argument. This string is used to select the threads that should be evaluated. For example, if you want to evaluate only threads that have a specific ID, you can use the following filter string:

1filter_string='id = "0197ad2a"'

You can combine multiple filter strings using the AND operator. For example, if you want to evaluate only threads that have a specific ID and have a specific status, you can use the following filter string:

1filter_string='id = "0197ad2a" AND status = "inactive"'

Supported filter fields and operators

The evaluate_threads function supports the following filter fields in the filter_string using Opik Query Language (OQL). All fields and operators are the same as those supported by search_traces and search_spans:

FieldTypeOperators
idString=, !=, contains, not_contains, starts_with, ends_with, >, <
nameString=, !=, contains, not_contains, starts_with, ends_with, >, <
created_byString=, !=, contains, not_contains, starts_with, ends_with, >, <
thread_idString=, !=, contains, not_contains, starts_with, ends_with, >, <
typeString=, !=, contains, not_contains, starts_with, ends_with, >, <
modelString=, !=, contains, not_contains, starts_with, ends_with, >, <
providerString=, !=, contains, not_contains, starts_with, ends_with, >, <
statusString=, contains, not_contains
start_timeDateTime=, >, <, >=, <=
end_timeDateTime=, >, <, >=, <=
inputString=, contains, not_contains
outputString=, contains, not_contains
metadataDictionary=, contains, >, <
feedback_scoresNumeric=, >, <, >=, <=
tagsListcontains
usage.total_tokensNumeric=, !=, >, <, >=, <=
usage.prompt_tokensNumeric=, !=, >, <, >=, <=
usage.completion_tokensNumeric=, !=, >, <, >=, <=
durationNumeric=, !=, >, <, >=, <=
number_of_messagesNumeric=, !=, >, <, >=, <=
total_estimated_costNumeric=, !=, >, <, >=, <=

Rules:

  • String values must be wrapped in double quotes
  • DateTime fields require ISO 8601 format (e.g., “2024-01-01T00:00:00Z”)
  • Use dot notation for nested objects: metadata.model, feedback_scores.accuracy
  • Multiple conditions can be combined with AND (OR is not supported)

The feedback_scores field is a dictionary where the keys are the metric names and the values are the metric values. You can use it to filter threads based on their feedback scores. For example, if you want to evaluate only threads that have a specific user frustration score, you can use the following filter string:

1filter_string='feedback_scores.user_frustration_score >= 0.5'

Where user_frustration_score is the name of the user frustration metric and 0.5 is the threshold value to filter by.

Best practice: If you are using SDK for thread evaluation, automate it by setting up a scheduled cron job with filters to regularly generate feedback scores for specific traces.

Using Opik UI to view results

Once the evaluation is complete, you can access the evaluation results in the Opik UI. Not only you will be able to see the score values, but the LLM-judge reasoning behind these values too!

Important: The status field represents the status of the thread, inactive means that the thread has not received any new traces in the last 15 minutes (Default value but can be changed). Threads are automatically marked as inactive after the timeout period and you can also manually mark a thread as inactive via UI or via SDK.

You can only evaluate/score threads that are inactive.

Multi-Value Feedback Scores for Threads

Team-based thread evaluation enables multiple evaluators to score conversation threads independently, providing more reliable assessment of multi-turn dialogue quality.

Key benefits for thread evaluation:

  • Conversation complexity scoring - Multiple reviewers can assess different aspects like coherence, user satisfaction, and goal completion across conversation turns
  • Reduced evaluation bias - Individual subjectivity in judging conversational quality is mitigated through team consensus
  • Thread-specific metrics - Teams can collaboratively evaluate conversation-specific aspects like frustration levels, topic drift, and resolution success

This collaborative approach is especially valuable for conversational threads where dialogue quality, context maintenance, and user experience assessment often require multiple expert perspectives.

Next steps

For more details on what metrics can be used to score conversational threads, refer to the conversational metrics page.