evaluate_threads¶
- opik.evaluation.evaluate_threads(project_name: str, filter_string: str | None, eval_project_name: str | None, metrics: List[ConversationThreadMetric], trace_input_transform: Callable[[Dict[str, Any | None] | List[Dict[str, Any | None]] | str], str], trace_output_transform: Callable[[Dict[str, Any | None] | List[Dict[str, Any | None]] | str], str], verbose: int = 1, num_workers: int = 8, max_traces_per_thread: int = 1000) ThreadsEvaluationResult ¶
Evaluate conversation threads using specified metrics.
This function evaluates conversation threads from a project using the provided metrics. It creates a ThreadsEvaluationEngine to fetch threads matching the filter string, converts them to conversation threads, applies the metrics, and logs feedback scores.
- Parameters:
project_name – The name of the project containing the threads to evaluate.
filter_string –
Optional filter string to select specific threads for evaluation using Opik Query Language (OQL). The format is: “<COLUMN> <OPERATOR> <VALUE> [AND <COLUMN> <OPERATOR> <VALUE>]*”
Supported columns include: - id, name, created_by, thread_id, type, model, provider: String fields with full operator support - status: String field (=, contains, not_contains only) - start_time, end_time: DateTime fields (use ISO 8601 format, e.g., “2024-01-01T00:00:00Z”) - input, output: String fields for content (=, contains, not_contains only) - metadata: Dictionary field (use dot notation, e.g., “metadata.model”) - feedback_scores: Numeric field (use dot notation, e.g., “feedback_scores.accuracy”) - tags: List field (use “contains” operator only) - usage.total_tokens, usage.prompt_tokens, usage.completion_tokens: Numeric usage fields - duration, number_of_messages, total_estimated_cost: Numeric fields
Examples: ‘status = “inactive”’, ‘id = “thread_123”’, ‘duration > 300’ If None, all threads in the project will be evaluated.
eval_project_name – Optional name for the evaluation project where evaluation traces will be stored. If None, the same project_name will be used.
metrics – List of ConversationThreadMetric instances to apply to each thread. Must contain at least one metric.
trace_input_transform –
Function to transform trace input JSON to string representation. This function extracts the relevant user message from your trace’s input structure. The function receives the raw trace input as a dictionary and should return a string.
Example: If your trace input is {“content”: {“user_question”: “Hello”}}, use: lambda x: x[“content”][“user_question”]
This transformation is essential because trace inputs vary by framework, but metrics expect a standardized string format representing the user’s message.
trace_output_transform –
Function to transform trace output JSON to string representation. This function extracts the relevant agent response from your trace’s output structure. The function receives the raw trace output as a dictionary and should return a string.
Example: If your trace output is {“response”: {“text”: “Hi there”}}, use: lambda x: x[“response”][“text”]
This transformation is essential because trace outputs vary by framework, but metrics expect a standardized string format representing the agent’s response.
verbose – Verbosity level for progress reporting (0=silent, 1=progress). Default is 1.
num_workers – Number of concurrent workers for thread evaluation. Default is 8.
max_traces_per_thread – Maximum number of traces to fetch per thread. Default is 1000.
- Returns:
ThreadsEvaluationResult containing evaluation scores for each thread.
- Raises:
ValueError – If no metrics are provided.
MetricComputationError – If no threads are found or if evaluation fails.
Example
>>> from opik.evaluation import evaluate_threads >>> from opik.evaluation.metrics import ConversationalCoherenceMetric, UserFrustrationMetric >>> >>> # Initialize the evaluation metrics >>> conversation_coherence_metric = ConversationalCoherenceMetric() >>> user_frustration_metric = UserFrustrationMetric() >>> >>> # Run the threads evaluation >>> results = evaluate_threads( >>> project_name="ai_team", >>> filter_string='thread_id = "0197ad2a-cf5c-75af-be8b-20e8a23304fe"', >>> eval_project_name="ai_team_evaluation", >>> metrics=[ >>> conversation_coherence_metric, >>> user_frustration_metric, >>> ], >>> trace_input_transform=lambda x: x["input"], >>> trace_output_transform=lambda x: x["output"], >>> )