Task Span Metrics
Task span metrics are a powerful type of evaluation metric in Opik that can analyze the detailed execution information of your LLM tasks. Unlike traditional metrics that only evaluate input-output pairs, task span metrics have access to the complete execution context, including intermediate steps, metadata, timing information, and hierarchical structure.
Important: only spans created with @track decorators and native OPIK integrations are available for task span metrics.
What are Task Span Metrics?
Task span metrics are evaluation metrics that include a task_span parameter in their score method. The Opik evaluation engine automatically detects that.
When a metric has a task_span parameter, it receives a SpanModel object containing the complete execution context of your task.
The task_span parameter provides:
- Execution Details: Input, output, start/end times, and execution metadata
- Nested Operations: Hierarchical structure of sub-operations and function calls
- Performance Data: Timing, cost, usage statistics, and resource consumption
- Error Information: Detailed error context and diagnostic information
- Provider Metadata: Model information, API provider details, and configuration
When to Use Task Span Metrics
Task span metrics are particularly valuable for:
- Performance Analysis: Evaluating execution speed, resource usage, and efficiency
- Quality Assessment: Analyzing the quality of intermediate steps and decision-making
- Cost Optimization: Tracking and optimizing API costs and resource consumption
- Agent Evaluation: Assessing agent trajectories and decision-making patterns
- Debugging: Understanding execution flows and identifying performance bottlenecks
- Compliance: Ensuring tasks execute within expected parameters and constraints
Creating Task Span Metrics
To create a task span metric, define a class that inherits from BaseMetric and implements a score method that accepts a task_span parameter (you can still add other parameters as in regular metrics, Opik will perform a separate check for task_span argument presence):
1 from typing import Any, Dict, Optional 2 from opik.evaluation.metrics import BaseMetric, score_result 3 from opik.message_processing.emulation.models import SpanModel 4 5 class TaskExecutionQualityMetric(BaseMetric): 6 def __init__( 7 self, 8 name: str = "task_execution_quality", 9 track: bool = True, 10 project_name: Optional[str] = None, 11 ): 12 super().__init__(name=name, track=track, project_name=project_name) 13 14 def _check_execution_success_recursively(self, span: SpanModel) -> Dict[str, Any]: 15 """Recursively check execution success across the span tree.""" 16 execution_stats = { 17 'has_errors': False, 18 'error_count': 0, 19 'failed_spans': [], 20 'total_spans_checked': 0 21 } 22 23 # Check current span for errors 24 execution_stats['total_spans_checked'] += 1 25 if span.error_info: 26 execution_stats['has_errors'] = True 27 execution_stats['error_count'] += 1 28 execution_stats['failed_spans'].append(span.name) 29 30 # Recursively check nested spans 31 for nested_span in span.spans: 32 nested_stats = self._check_execution_success_recursively(nested_span) 33 execution_stats['has_errors'] = execution_stats['has_errors'] or nested_stats['has_errors'] 34 execution_stats['error_count'] += nested_stats['error_count'] 35 execution_stats['failed_spans'].extend(nested_stats['failed_spans']) 36 execution_stats['total_spans_checked'] += nested_stats['total_spans_checked'] 37 38 return execution_stats 39 40 def score(self, task_span: SpanModel, **ignored_kwargs: Any) -> score_result.ScoreResult: 41 # Check execution success across the entire span tree. 42 # Only for illustrative purposes. 43 # Please adjust for your specific use case! 44 execution_stats = self._check_execution_success_recursively(task_span) 45 execution_successful = not execution_stats['has_errors'] 46 47 # Check output availability 48 has_output = task_span.output is not None 49 50 # Calculate execution time 51 execution_time = None 52 if task_span.start_time and task_span.end_time: 53 execution_time = (task_span.end_time - task_span.start_time).total_seconds() 54 55 # Custom scoring logic based on execution characteristics 56 if not execution_successful: 57 error_count = execution_stats['error_count'] 58 failed_spans_count = len(execution_stats['failed_spans']) 59 total_spans = execution_stats['total_spans_checked'] 60 61 if error_count == 1 and total_spans > 5: 62 score_value = 0.4 63 reason = f"Minor execution issues: 1 error in {total_spans} spans ({execution_stats['failed_spans'][0]})" 64 elif failed_spans_count <= 2: 65 score_value = 0.2 66 reason = f"Limited execution failures: {failed_spans_count} failed spans out of {total_spans}" 67 else: 68 score_value = 0.0 69 reason = f"Major execution failures: {failed_spans_count} failed spans across {total_spans} operations" 70 elif not has_output: 71 score_value = 0.3 72 reason = f"Task completed without errors across {execution_stats['total_spans_checked']} spans but produced no output" 73 elif execution_time and execution_time > 30.0: 74 score_value = 0.6 75 reason = f"Task executed successfully across {execution_stats['total_spans_checked']} spans but took too long: {execution_time:.2f}s" 76 else: 77 score_value = 1.0 78 span_count = execution_stats['total_spans_checked'] 79 reason = f"Task executed successfully across all {span_count} spans with good performance" 80 81 return score_result.ScoreResult( 82 value=score_value, 83 name=self.name, 84 reason=reason 85 )
Accessing Span Properties
The SpanModel object provides rich information about task execution:
Basic Properties
1 class BasicSpanAnalysisMetric(BaseMetric): 2 def score(self, task_span: SpanModel, **ignored_kwargs: Any) -> score_result.ScoreResult: 3 # Basic span information 4 span_id = task_span.id 5 span_name = task_span.name 6 span_type = task_span.type # "general", "llm", "tool", etc. 7 8 # Input/Output analysis 9 input_data = task_span.input 10 output_data = task_span.output 11 12 # Metadata and tags 13 metadata = task_span.metadata 14 tags = task_span.tags 15 16 # Your scoring logic here 17 return score_result.ScoreResult(value=1.0, name=self.name)
Performance Metrics
1 class PerformanceMetric(BaseMetric): 2 def _find_model_and_provider_recursively(self, span: SpanModel, model_found: str = None, provider_found: str = None): 3 """Recursively search through span tree to find model and provider information.""" 4 # Check current span 5 if not model_found and span.model: 6 model_found = span.model 7 if not provider_found and span.provider: 8 provider_found = span.provider 9 10 # If both found, return early 11 if model_found and provider_found: 12 return model_found, provider_found 13 14 # Recursively search nested spans 15 for nested_span in span.spans: 16 model_found, provider_found = self._find_model_and_provider_recursively( 17 nested_span, model_found, provider_found 18 ) 19 # If both found, return early 20 if model_found and provider_found: 21 return model_found, provider_found 22 23 return model_found, provider_found 24 25 def _calculate_usage_recursively(self, span: SpanModel, usage_summary: dict = None): 26 """Recursively calculate usage statistics from the entire span tree.""" 27 if usage_summary is None: 28 usage_summary = { 29 'total_prompt_tokens': 0, 30 'total_completion_tokens': 0, 31 'total_tokens': 0, 32 'total_spans_count': 0, 33 'llm_spans_count': 0, 34 'tool_spans_count': 0 35 } 36 37 # Count current span 38 usage_summary['total_spans_count'] += 1 39 40 # Count span types 41 if span.type == 'llm': 42 usage_summary['llm_spans_count'] += 1 43 elif span.type == 'tool': 44 usage_summary['tool_spans_count'] += 1 45 46 # Add usage from current span 47 if span.usage and isinstance(span.usage, dict): 48 usage_summary['total_prompt_tokens'] += span.usage.get('prompt_tokens', 0) 49 usage_summary['total_completion_tokens'] += span.usage.get('completion_tokens', 0) 50 usage_summary['total_tokens'] += span.usage.get('total_tokens', 0) 51 52 # Recursively process nested spans 53 for nested_span in span.spans: 54 self._calculate_usage_recursively(nested_span, usage_summary) 55 56 return usage_summary 57 58 def score(self, task_span: SpanModel, **ignored_kwargs: Any) -> score_result.ScoreResult: 59 # Timing analysis 60 # Only for illustrative purposes. 61 # Please adjust for your specific use case! 62 start_time = task_span.start_time 63 end_time = task_span.end_time 64 duration = (end_time - start_time).total_seconds() if start_time and end_time else None 65 66 # Get model and provider from anywhere in the span tree 67 model_used, provider = self._find_model_and_provider_recursively( 68 task_span, task_span.model, task_span.provider 69 ) 70 71 # Calculate comprehensive usage statistics from entire span tree 72 usage_info = self._calculate_usage_recursively(task_span) 73 74 # Performance-based scoring with enhanced analysis 75 if duration and duration < 2.0: 76 score_value = 1.0 77 reason = f"Excellent performance: {duration:.2f}s" 78 if model_used: 79 reason += f" using {model_used}" 80 if provider: 81 reason += f" ({provider})" 82 if usage_info['total_tokens'] > 0: 83 reason += f", {usage_info['total_tokens']} total tokens across {usage_info['llm_spans_count']} LLM calls" 84 elif duration and duration < 10.0: 85 score_value = 0.7 86 reason = f"Good performance: {duration:.2f}s" 87 if usage_info['total_spans_count'] > 1: 88 reason += f" with {usage_info['total_spans_count']} operations" 89 else: 90 score_value = 0.5 91 reason = "Performance could be improved" 92 if duration: 93 reason += f" (took {duration:.2f}s)" 94 if usage_info['llm_spans_count'] > 5: 95 reason += f" - consider optimizing {usage_info['llm_spans_count']} LLM calls" 96 97 return score_result.ScoreResult( 98 value=score_value, 99 name=self.name, 100 reason=reason 101 )
Error Analysis
Task span metrics can analyze execution failures and errors:
1 class ErrorAnalysisMetric(BaseMetric): 2 def _collect_errors_recursively(self, span: SpanModel, errors: list = None): 3 """Recursively collect all errors from the span tree.""" 4 if errors is None: 5 errors = [] 6 7 # Check current span for errors 8 if span.error_info: 9 error_entry = { 10 'span_id': span.id, 11 'span_name': span.name, 12 'span_type': span.type, 13 'error_info': span.error_info 14 } 15 errors.append(error_entry) 16 17 # Recursively check nested spans 18 for nested_span in span.spans: 19 self._collect_errors_recursively(nested_span, errors) 20 21 return errors 22 23 def score(self, task_span: SpanModel, **ignored_kwargs: Any) -> score_result.ScoreResult: 24 # Collect all errors from the entire span tree 25 all_errors = self._collect_errors_recursively(task_span) 26 27 if not all_errors: 28 return score_result.ScoreResult( 29 value=1.0, 30 name=self.name, 31 reason="No errors detected in any span" 32 ) 33 34 reason = f"Found {len(all_errors)} error(s) across multiple spans" 35 return score_result.ScoreResult( 36 value=0.0, 37 name=self.name, 38 reason=reason 39 )
Using Task Span Metrics in Evaluation
Task span metrics work seamlessly with regular evaluation metrics. The Opik evaluation engine automatically detects task span metrics by checking if the score method includes a task_span parameter, and handles them appropriately:
1 from opik import evaluate 2 from opik.evaluation.metrics import Equals 3 4 # Mix regular and task span metrics 5 equals_metric = Equals() 6 quality_metric = TaskExecutionQualityMetric() 7 performance_metric = PerformanceMetric() 8 9 evaluation = evaluate( 10 dataset=dataset, 11 task=evaluation_task, 12 scoring_metrics=[ 13 equals_metric, # Regular metric (input/output) 14 quality_metric, # Task span metric (execution analysis) 15 performance_metric, # Task span metric (performance analysis) 16 ], 17 experiment_name="Comprehensive Task Analysis" 18 )
Quickly testing task span metrics locally
You can validate a task span metric without running a full evaluation by recording spans locally. The SDK provides a context manager that captures all spans/traces created inside its block and exposes them in-memory.
1 import opik 2 from opik import track 3 from opik.evaluation.metrics import score_result 4 from opik.message_processing.emulation.models import SpanModel 5 6 # Example metric under test 7 class ExecutionTimeMetric: 8 def __init__(self, name: str = "execution_time_metric"): 9 self.name = name 10 11 def score(self, task_span: SpanModel, **_): 12 if task_span.start_time and task_span.end_time: 13 duration = (task_span.end_time - task_span.start_time).total_seconds() 14 value = 1.0 if duration < 2.0 else 0.5 15 reason = f"Duration: {duration:.2f}s" 16 else: 17 value = 0.0 18 reason = "Missing timing information" 19 return score_result.ScoreResult(value=value, name=self.name, reason=reason) 20 21 @track 22 def my_tracked_function(question: str) -> str: 23 # Your LLM/tool code here that produces spans 24 return f"Answer to: {question}" 25 26 with opik.record_traces_locally() as storage: 27 # Execute tracked code that creates spans 28 _ = my_tracked_function("What is the capital of France?") 29 30 # Access the in-memory span tree (flush is automatic before reading) 31 span_trees = storage.span_trees 32 assert len(span_trees) > 0, "No spans recorded" 33 root_span = span_trees[0] 34 35 # Evaluate your task span metric directly 36 metric = ExecutionTimeMetric() 37 result = metric.score(task_span=root_span) 38 print(result)
Note:
- Local recording cannot be nested. If a recording block is already active, entering another will raise an error.
- See the Python SDK reference for more details: Local Recording Context Manager
Best Practices
1. Handle Missing Data Gracefully
Always check for None values in optional span attributes:
1 def score(self, task_span: SpanModel, **ignored_kwargs: Any) -> score_result.ScoreResult: 2 # Safe access to optional fields 3 duration = None 4 if task_span.start_time and task_span.end_time: 5 duration = (task_span.end_time - task_span.start_time).total_seconds() 6 7 cost = task_span.total_cost if task_span.total_cost else 0.0 8 metadata = task_span.metadata or {}
2. Focus on Execution Patterns
Use task span metrics to evaluate how your application executes, not just the final output:
1 # Good: Analyzing execution patterns 2 def _analyze_caching_efficiency_recursively(self, span: SpanModel, cache_stats: Dict[str, Any] = None) -> Dict[str, Any]: 3 """Recursively analyze caching efficiency across the span tree.""" 4 if cache_stats is None: 5 cache_stats = { 6 'total_llm_calls': 0, 7 'llm_cache_hits': 0, 8 'llm_cache_misses': 0, 9 'other_cache_hits': 0, 10 'cached_llm_spans': [], 11 'cached_other_spans': [], 12 'llm_spans': [] 13 } 14 15 # Track LLM calls and their caching status 16 if span.type == "llm": 17 cache_stats['total_llm_calls'] += 1 18 cache_stats['llm_spans'].append(span.name) 19 20 # Check for caching indicators in metadata 21 metadata = span.metadata or {} 22 tags = span.tags or [] 23 24 is_cached = ( 25 any(cache_key in metadata for cache_key in ["cache_hit", "cached", "from_cache"]) or 26 any(cache_tag in tags for cache_tag in ["cache_hit", "cached"]) or 27 metadata.get("cache_hit", False) or 28 metadata.get("cached", False) 29 ) 30 31 if is_cached: 32 cache_stats['llm_cache_hits'] += 1 33 cache_stats['cached_llm_spans'].append(span.name) 34 else: 35 cache_stats['llm_cache_misses'] += 1 36 37 # Track non-LLM spans for caching indicators (e.g., database queries, API calls) 38 else: 39 metadata = span.metadata or {} 40 tags = span.tags or [] 41 42 if (any(cache_key in metadata for cache_key in ["cache_hit", "cached", "from_cache"]) or 43 any(cache_tag in tags for cache_tag in ["cache_hit", "cached"])): 44 cache_stats['other_cache_hits'] += 1 45 cache_stats['cached_other_spans'].append(span.name) 46 47 # Recursively check nested spans 48 for nested_span in span.spans: 49 self._analyze_caching_efficiency_recursively(nested_span, cache_stats) 50 51 return cache_stats 52 53 def score(self, task_span: SpanModel, **ignored_kwargs: Any) -> score_result.ScoreResult: 54 # Analyze caching efficiency across an entire span tree. 55 # Only for illustrative purposes. 56 # Please adjust for your specific use case! 57 cache_stats = self._analyze_caching_efficiency_recursively(task_span) 58 59 llm_cache_hits = cache_stats['llm_cache_hits'] 60 total_llm_calls = cache_stats['total_llm_calls'] 61 other_cache_hits = cache_stats['other_cache_hits'] 62 63 # Calculate a cache hit ratio specifically for LLM calls 64 llm_cache_hit_ratio = llm_cache_hits / max(1, total_llm_calls) if total_llm_calls > 0 else 0 65 66 # Score based on LLM caching efficiency and total call volume 67 if total_llm_calls == 0: 68 # Consider other cache hits for non-LLM operations 69 if other_cache_hits > 0: 70 return score_result.ScoreResult( 71 value=0.7, 72 name=self.name, 73 reason=f"No LLM calls, but {other_cache_hits} other operations cached" 74 ) 75 else: 76 return score_result.ScoreResult( 77 value=0.5, 78 name=self.name, 79 reason="No LLM calls detected" 80 ) 81 elif llm_cache_hit_ratio >= 0.8: 82 reason = f"Excellent LLM caching: {llm_cache_hits}/{total_llm_calls} LLM calls cached ({llm_cache_hit_ratio:.1%})" 83 if other_cache_hits > 0: 84 reason += f" + {other_cache_hits} other cached operations" 85 return score_result.ScoreResult( 86 value=1.0, 87 name=self.name, 88 reason=reason 89 ) 90 elif llm_cache_hit_ratio >= 0.5: 91 reason = f"Good LLM caching: {llm_cache_hits}/{total_llm_calls} LLM calls cached ({llm_cache_hit_ratio:.1%})" 92 if other_cache_hits > 0: 93 reason += f" + {other_cache_hits} other cached operations" 94 return score_result.ScoreResult( 95 value=0.9, 96 name=self.name, 97 reason=reason 98 ) 99 elif llm_cache_hit_ratio > 0: 100 reason = f"Some LLM caching: {llm_cache_hits}/{total_llm_calls} LLM calls cached ({llm_cache_hit_ratio:.1%})" 101 if other_cache_hits > 0: 102 reason += f" + {other_cache_hits} other cached operations" 103 return score_result.ScoreResult( 104 value=0.7, 105 name=self.name, 106 reason=reason 107 ) 108 elif total_llm_calls > 5: 109 return score_result.ScoreResult( 110 value=0.2, 111 name=self.name, 112 reason=f"No caching with {total_llm_calls} LLM calls - high cost/latency risk" 113 ) 114 elif total_llm_calls > 3: 115 return score_result.ScoreResult( 116 value=0.4, 117 name=self.name, 118 reason=f"No caching with {total_llm_calls} LLM calls - consider adding cache" 119 ) 120 else: 121 return score_result.ScoreResult( 122 value=0.8, 123 name=self.name, 124 reason=f"Efficient execution: {total_llm_calls} LLM calls (caching not critical)" 125 )
3. Combine with Regular Metrics
Task span metrics provide the most value when combined with traditional output-based metrics:
1 # Comprehensive evaluation approach 2 scoring_metrics = [ 3 # Output quality metrics 4 Equals(), 5 Hallucination(), 6 7 # Execution analysis metrics 8 TaskExecutionQualityMetric(), 9 PerformanceMetric(), 10 11 # Cost optimization metrics 12 CostEfficiencyMetric(), 13 ]
4. Security Considerations
Be mindful of sensitive data in span information:
1 def score(self, task_span: SpanModel, **ignored_kwargs: Any) -> score_result.ScoreResult: 2 # Avoid logging sensitive input data 3 input_size = len(str(task_span.input)) if task_span.input else 0 4 5 # Use aggregated information instead of raw data 6 return score_result.ScoreResult( 7 value=1.0 if input_size < 1000 else 0.5, 8 name=self.name, 9 reason=f"Input size: {input_size} characters" 10 )
Complete Example: Agent Trajectory Analysis metric
Here’s a comprehensive example that analyzes agent decision-making:
1 class AgentTrajectoryMetric(BaseMetric): 2 def __init__(self, max_steps: int = 10, name: str = "agent_trajectory_quality"): 3 super().__init__(name=name) 4 self.max_steps = max_steps 5 6 def _analyze_trajectory_recursively(self, span: SpanModel, trajectory_stats: Dict[str, Any] = None) -> Dict[str, Any]: 7 """Recursively analyze agent trajectory across the span tree.""" 8 if trajectory_stats is None: 9 trajectory_stats = { 10 'total_steps': 0, 11 'tool_uses': 0, 12 'llm_reasoning': 0, 13 'other_steps': 0, 14 'tool_spans': [], 15 'llm_spans': [], 16 'step_names': [], 17 'max_depth': 0, 18 'current_depth': 0 19 } 20 21 # Count current span as a step 22 trajectory_stats['total_steps'] += 1 23 trajectory_stats['step_names'].append(span.name) 24 trajectory_stats['max_depth'] = max(trajectory_stats['max_depth'], trajectory_stats['current_depth']) 25 26 # Categorize span types for agent decision analysis 27 if span.type == "tool": 28 trajectory_stats['tool_uses'] += 1 29 trajectory_stats['tool_spans'].append(span.name) 30 elif span.type == "llm": 31 trajectory_stats['llm_reasoning'] += 1 32 trajectory_stats['llm_spans'].append(span.name) 33 else: 34 trajectory_stats['other_steps'] += 1 35 36 # Recursively analyze nested spans with depth tracking 37 for nested_span in span.spans: 38 trajectory_stats['current_depth'] += 1 39 self._analyze_trajectory_recursively(nested_span, trajectory_stats) 40 trajectory_stats['current_depth'] -= 1 41 42 return trajectory_stats 43 44 def score(self, task_span: SpanModel, **ignored_kwargs: Any) -> score_result.ScoreResult: 45 # Analyze agent trajectory across an entire span tree 46 trajectory_stats = self._analyze_trajectory_recursively(task_span) 47 48 total_steps = trajectory_stats['total_steps'] 49 tool_uses = trajectory_stats['tool_uses'] 50 llm_reasoning = trajectory_stats['llm_reasoning'] 51 max_depth = trajectory_stats['max_depth'] 52 53 # Check for an efficient path 54 if total_steps == 0: 55 return score_result.ScoreResult( 56 value=0.0, name=self.name, 57 reason="No decision steps found" 58 ) 59 60 # Analyze trajectory quality with enhanced metrics. 61 # Only for illustrative purposes. 62 # Please adjust for your specific use case! 63 if tool_uses == 0 and llm_reasoning == 0: 64 score = 0.1 65 reason = f"Poor trajectory: {total_steps} steps with no tools or reasoning" 66 elif tool_uses == 0: 67 score = 0.3 68 reason = f"Agent used {llm_reasoning} reasoning steps but no tools across {total_steps} operations" 69 elif llm_reasoning == 0: 70 score = 0.4 71 reason = f"Agent used {tool_uses} tools but no reasoning across {total_steps} operations" 72 elif total_steps > self.max_steps: 73 # Penalize excessive steps but consider tool/reasoning balance 74 efficiency_penalty = max(0.1, 1.0 - (total_steps - self.max_steps) * 0.05) 75 balance_ratio = min(tool_uses, llm_reasoning) / max(tool_uses, llm_reasoning) 76 score = min(0.6, efficiency_penalty * balance_ratio) 77 reason = f"Excessive steps: {total_steps} > {self.max_steps} (depth: {max_depth}, tools: {tool_uses}, reasoning: {llm_reasoning})" 78 else: 79 # Calculate a comprehensive score based on multiple factors. 80 # Only for illustrative purposes. 81 # Please adjust for your specific use case! 82 # 83 # 1. Step efficiency (fewer steps = better) 84 # 1. Step efficiency (fewer steps = better) 85 step_efficiency = min(1.0, self.max_steps / total_steps) 86 87 # 2. Tool-reasoning balance (closer to 1:1 ratio = better) 88 balance_ratio = min(tool_uses, llm_reasoning) / max(tool_uses, llm_reasoning) if max(tool_uses, llm_reasoning) > 0 else 0 89 balance_ratio = min(tool_uses, llm_reasoning) / max(tool_uses, llm_reasoning) if max(tool_uses, llm_reasoning) > 0 else 0 90 91 # 3. Depth complexity (moderate depth suggests good decomposition) 92 depth_score = 1.0 if max_depth <= 3 else max(0.7, 1.0 - (max_depth - 3) * 0.1) 93 94 # 4. Decision density (good ratio of reasoning to total steps) 95 decision_density = llm_reasoning / total_steps if total_steps > 0 else 0 96 density_score = 1.0 if decision_density >= 0.3 else decision_density / 0.3 97 98 # Combine all factors 99 score = (step_efficiency * 0.3 + balance_ratio * 0.3 + depth_score * 0.2 + density_score * 0.2) 100 101 if score >= 0.8: 102 reason = f"Excellent trajectory: {total_steps} steps (depth: {max_depth}), {tool_uses} tools, {llm_reasoning} reasoning - well balanced" 103 elif score >= 0.6: 104 reason = f"Good trajectory: {total_steps} steps (depth: {max_depth}), {tool_uses} tools, {llm_reasoning} reasoning" 105 else: 106 reason = f"Acceptable trajectory: {total_steps} steps (depth: {max_depth}), {tool_uses} tools, {llm_reasoning} reasoning - could be optimized" 107 108 return score_result.ScoreResult( 109 value=score, 110 name=self.name, 111 reason=reason 112 )
Integration with LLM Evaluation
For a complete guide on using task span metrics in LLM evaluation workflows, see the Using task span evaluation metrics section in the LLM evaluation guide.
Related Documentation
- Custom Metrics - Creating traditional input/output evaluation metrics
- SpanModel API Reference - Complete SpanModel documentation
- Evaluation Overview - Understanding Opik’s evaluation system