evaluate_threads

opik.evaluation.evaluate_threads(project_name: str, filter_string: str | None, eval_project_name: str | None, metrics: List[ConversationThreadMetric], trace_input_transform: Callable[[Dict[str, Any | None] | List[Dict[str, Any | None]] | str], str], trace_output_transform: Callable[[Dict[str, Any | None] | List[Dict[str, Any | None]] | str], str], verbose: int = 1, num_workers: int = 8, max_traces_per_thread: int = 1000) ThreadsEvaluationResult

Evaluate conversation threads using specified metrics.

This function evaluates conversation threads from a project using the provided metrics. It creates a ThreadsEvaluationEngine to fetch threads matching the filter string, converts them to conversation threads, applies the metrics, and logs feedback scores.

Parameters:
  • project_name – The name of the project containing the threads to evaluate.

  • filter_string

    Optional filter string to select specific threads for evaluation using Opik Query Language (OQL). The format is: “<COLUMN> <OPERATOR> <VALUE> [AND <COLUMN> <OPERATOR> <VALUE>]*”

    Supported columns include: - id, name, created_by, thread_id, type, model, provider: String fields with full operator support - status: String field (=, contains, not_contains only) - start_time, end_time: DateTime fields (use ISO 8601 format, e.g., “2024-01-01T00:00:00Z”) - input, output: String fields for content (=, contains, not_contains only) - metadata: Dictionary field (use dot notation, e.g., “metadata.model”) - feedback_scores: Numeric field (use dot notation, e.g., “feedback_scores.accuracy”) - tags: List field (use “contains” operator only) - usage.total_tokens, usage.prompt_tokens, usage.completion_tokens: Numeric usage fields - duration, number_of_messages, total_estimated_cost: Numeric fields

    Examples: ‘status = “inactive”’, ‘id = “thread_123”’, ‘duration > 300’ If None, all threads in the project will be evaluated.

  • eval_project_name – Optional name for the evaluation project where evaluation traces will be stored. If None, the same project_name will be used.

  • metrics – List of ConversationThreadMetric instances to apply to each thread. Must contain at least one metric.

  • trace_input_transform

    Function to transform trace input JSON to string representation. This function extracts the relevant user message from your trace’s input structure. The function receives the raw trace input as a dictionary and should return a string.

    Example: If your trace input is {“content”: {“user_question”: “Hello”}}, use: lambda x: x[“content”][“user_question”]

    This transformation is essential because trace inputs vary by framework, but metrics expect a standardized string format representing the user’s message.

  • trace_output_transform

    Function to transform trace output JSON to string representation. This function extracts the relevant agent response from your trace’s output structure. The function receives the raw trace output as a dictionary and should return a string.

    Example: If your trace output is {“response”: {“text”: “Hi there”}}, use: lambda x: x[“response”][“text”]

    This transformation is essential because trace outputs vary by framework, but metrics expect a standardized string format representing the agent’s response.

  • verbose – Verbosity level for progress reporting (0=silent, 1=progress). Default is 1.

  • num_workers – Number of concurrent workers for thread evaluation. Default is 8.

  • max_traces_per_thread – Maximum number of traces to fetch per thread. Default is 1000.

Returns:

ThreadsEvaluationResult containing evaluation scores for each thread.

Raises:
  • ValueError – If no metrics are provided.

  • MetricComputationError – If no threads are found or if evaluation fails.

Example

>>> from opik.evaluation import evaluate_threads
>>> from opik.evaluation.metrics import ConversationalCoherenceMetric, UserFrustrationMetric
>>>
>>> # Initialize the evaluation metrics
>>> conversation_coherence_metric = ConversationalCoherenceMetric()
>>> user_frustration_metric = UserFrustrationMetric()
>>>
>>> # Run the threads evaluation
>>> results = evaluate_threads(
>>>     project_name="ai_team",
>>>     filter_string='thread_id = "0197ad2a-cf5c-75af-be8b-20e8a23304fe"',
>>>     eval_project_name="ai_team_evaluation",
>>>     metrics=[
>>>         conversation_coherence_metric,
>>>         user_frustration_metric,
>>>     ],
>>>     trace_input_transform=lambda x: x["input"],
>>>     trace_output_transform=lambda x: x["output"],
>>> )