In Opik 2.0, datasets and experiments are project-scoped. Make sure to specify a project_name when creating datasets and running experiments so they are associated with the correct project.
Task span metrics are a powerful type of evaluation metric in Opik that can analyze the detailed execution information of your LLM tasks. Unlike traditional metrics that only evaluate input-output pairs, task span metrics have access to the complete execution context, including intermediate steps, metadata, timing information, and hierarchical structure.
Important: only spans created with @track decorators and native OPIK integrations are available for task span metrics.
Task span metrics are evaluation metrics that include a task_span parameter in their score method. The Opik evaluation engine automatically detects that.
When a metric has a task_span parameter, it receives a SpanModel object containing the complete execution context of your task.
The task_span parameter provides:
Task span metrics are particularly valuable for:
To create a task span metric, define a class that inherits from BaseMetric and implements a score method that accepts a task_span parameter (you can still add other parameters as in regular metrics, Opik will perform a separate check for task_span argument presence):
The SpanModel object provides rich information about task execution:
Task span metrics can analyze execution failures and errors:
Task span metrics work seamlessly with regular evaluation metrics. The Opik evaluation engine automatically detects task span metrics by checking if the score method includes a task_span parameter, and handles them appropriately:
You can validate a task span metric without running a full evaluation by recording spans locally. The SDK provides a context manager that captures all spans/traces created inside its block and exposes them in-memory.
Note:
Always check for None values in optional span attributes:
Use task span metrics to evaluate how your application executes, not just the final output:
Task span metrics provide the most value when combined with traditional output-based metrics:
Be mindful of sensitive data in span information:
Here’s a comprehensive example that analyzes agent decision-making:
For a complete guide on using task span metrics in LLM evaluation workflows, see the Using task span evaluation metrics section in the LLM evaluation guide.