Evaluate Your Agent with Opik

In Opik 2.0, Experiments and Evaluation Suites are project-scoped. Make sure to specify a project_name when creating datasets and running experiments.

Evaluating your LLM application allows you to have confidence in the performance of your LLM application. In this guide, we will walk through the process of evaluating complex applications like LLM chains or agents.

The evaluation is done in five steps:

Add tracing to your LLM application
Define the evaluation task
Choose the Dataset that you would like to evaluate your application on
Choose the metrics that you would like to evaluate your application with
Create and run the evaluation experiment

Running an offline evaluation

1. (Optional) Add tracking to your LLM application

While not required, we recommend adding tracking to your LLM application. This allows you to have full visibility into each evaluation run. In the example below we will use a combination of the track decorator and the track_openai function to trace the LLM application.

Python

1 from opik import track
2 from opik.integrations.openai import track_openai
3 import openai
4 
5 openai_client = track_openai(openai.OpenAI())
6 
7 # This method is the LLM application that you want to evaluate
8 # Typically this is not updated when creating evaluations
9 @track
10 def your_llm_application(input: str) -> str:
11     response = openai_client.chat.completions.create(
12         model="gpt-3.5-turbo",
13         messages=[{"role": "user", "content": input}],
14     )
15 
16     return response.choices[0].message.content

Here we have added the track decorator so that this trace and all its nested steps are logged to the platform for further analysis.

2. Define the evaluation task

Once you have added instrumentation to your LLM application, we can define the evaluation task. The evaluation task takes in as an input a dataset item and needs to return a dictionary with keys that match the parameters expected by the metrics you are using. In this example we can define the evaluation task as follows:

1 import { EvaluationTask } from "opik";
2 import { OpenAI } from "openai";
3 
4 // Define dataset item type
5 type DatasetItem = {
6 input: string;
7 expected: string;
8 };
9 
10 const llmTask: EvaluationTask<DatasetItem> = async (datasetItem) => {
11     const { input } = datasetItem;
12 
13     const openai = new OpenAI();
14     const response = await openai.chat.completions.create({
15         model: "gpt-4o",
16         messages: [
17             { role: "system", content: "You are a coding assistant" },
18             { role: "user", content: input }
19         ],
20     });
21 
22     return { output: response.choices[0].message.content };
23 
24 };

If the dictionary returned does not match with the parameters expected by the metrics, you will get inconsistent evaluation results.

3. Choose the evaluation Dataset

In order to create an evaluation experiment, you will need to have a Dataset that includes all your test cases.

If you have already created a Dataset, you can use the get or create dataset methods to fetch it.

1 import { Opik } from "opik";
2 
3 const client = new Opik();
4 const dataset = await client.getOrCreateDataset<DatasetItem>("Example dataset", "Evaluation dataset", "my-project");
5 
6 // Opik deduplicates items that are inserted into a dataset so we can insert them
7 // for multiple times
8 await dataset.insert([
9     {
10         input: "Hello, world!",
11         expected: "Hello, world!"
12     },
13     {
14         input: "What is the capital of France?",
15         expected: "Paris"
16     },
17 ]);

4. Choose evaluation metrics

Opik provides a set of built-in evaluation metrics that you can choose from. These are broken down into two main categories:

Heuristic metrics: These metrics that are deterministic in nature, for example equals or contains
LLM-as-a-judge: These metrics use an LLM to judge the quality of the output; typically these are used for detecting hallucinations or context relevance

In the same evaluation experiment, you can use multiple metrics to evaluate your application:

1 import { ExactMatch } from "opik";
2 
3 const exact_match_metric = new ExactMatch();

Each metric expects the data in a certain format. You will need to ensure that the task you have defined in step 2 returns the data in the correct format.

5. Run the evaluation

Now that we have the task we want to evaluate, the dataset to evaluate on, and the metrics we want to evaluate with, we can run the evaluation:

1 import { EvaluationTask, Opik, ExactMatch, evaluate } from "opik";
2 import { OpenAI } from "openai";
3 
4 // Define dataset item type
5 type DatasetItem = {
6     input: string;
7     expected: string;
8 };
9 
10 // Define the evaluation task
11 const llmTask: EvaluationTask<DatasetItem> = async (datasetItem) => {
12     const { input } = datasetItem;
13 
14     const openai = new OpenAI();
15     const response = await openai.chat.completions.create({
16         model: "gpt-4o",
17         messages: [
18             { role: "system", content: "You are a coding assistant" },
19             { role: "user", content: input }
20         ],
21     });
22 
23     return { output: response.choices[0].message.content };
24 };
25 
26 // Get or create the dataset - items are automatically deduplicated
27 const client = new Opik();
28 const dataset = await client.getOrCreateDataset<DatasetItem>("Example dataset", "Evaluation dataset", "my-project");
29 await dataset.insert([
30     {
31         input: "Hello, world!",
32         expected: "Hello, world!"
33     },
34     {
35         input: "What is the capital of France?",
36         expected: "Paris"
37     },
38 ]);
39 
40 // Define the metric
41 const exact_match_metric = new ExactMatch();
42 
43 // Run the evaluation
44 const result = await evaluate({
45     dataset,
46     task: llmTask,
47     scoringMetrics: [exact_match_metric],
48     experimentName: "Example Evaluation",
49     projectName: "my-project",
50 });
51 console.log(`Experiment ID: ${result.experimentId}`);
52 console.log(`Experiment Name: ${result.experimentName}`);
53 console.log(`Total test cases: ${result.testResults.length}`);

You can use the experiment_config parameter to store information about your evaluation task. Typically we see teams store information about the prompt template, the model used and model parameters used to evaluate the application.

6. Analyze the evaluation results

Once the evaluation is complete, you will get a link to the Opik UI where you can analyze the evaluation results. In addition to being able to deep dive into each test case, you will also be able to compare multiple experiments side by side.

Advanced usage

Missing arguments for scoring methods

When you face the opik.exceptions.ScoreMethodMissingArguments exception, it means that the dataset item and task output dictionaries do not contain all the arguments expected by the scoring method. The way the evaluate function works is by merging the dataset item and task output dictionaries and then passing the result to the scoring method. For example, if the dataset item contains the keys user_question and context while the evaluation task returns a dictionary with the key output, the scoring method will be called as scoring_method.score(user_question='...', context= '...', output= '...'). This can be an issue if the scoring method expects a different set of arguments.

You can solve this by either updating the dataset item or evaluation task to return the missing arguments or by using the scoring_key_mapping parameter of the evaluate function. In the example above, if the scoring method expects input as an argument, you can map the user_question key to the input key as follows:

1 evaluation = evaluate({
2     dataset,
3     task: evaluation_task,
4     scoringMetrics: [hallucination_metric],
5     scoringKeyMapping: {"input": "user_question"},
6 })

Linking prompts to experiments

The Opik prompt library can be used to version your prompt templates.

When creating an Experiment, you can link the Experiment to a specific prompt version:

1 import { Opik, evaluate, evaluatePrompt } from 'opik';
2 import { Hallucination } from 'opik';
3 
4 const client = new Opik();
5 
6 // Create a prompt
7 const prompt = await client.createPrompt({
8     name: "My prompt",
9     prompt: "Translate to French: {{input}}",
10     projectName: "my-project",
11 });
12 
13 // Link prompt to evaluation experiment
14 await evaluatePrompt({
15     dataset: myDataset,
16     messages: [
17         { role: "user", content: "Translate to French: {{input}}" },
18     ],
19     model: "gpt-4o",
20     scoringMetrics: [new Hallucination()],
21     prompts: [prompt],
22     projectName: "my-project",
23 });

The experiment will now be linked to the prompt allowing you to view all experiments that use a specific prompt:

Logging traces to a specific project

You can use the project_name parameter of the evaluate function to log evaluation traces to a specific project:

1 const evaluation = await evaluate({
2     dataset,
3     task: evaluation_task,
4     scoringMetrics: [hallucination_metric],
5     projectName: "hallucination-detection",
6 })

Evaluating a subset of the dataset

You can use the nb_samples parameter to specify the number of samples to use for the evaluation. This is useful if you only want to evaluate a subset of the dataset.

1 const evaluation = await evaluate({
2     dataset,
3     task: evaluation_task,
4     scoringMetrics: [hallucination_metric],
5     nbSamples: 10,
6 })

Evaluating a filtered subset of the dataset

You can evaluate only a subset of your dataset items by using the dataset_filter_string parameter. This is useful when you want to run experiments on specific categories of data or test particular scenarios:

Python

1 from opik.evaluation import evaluate
2 
3 # Evaluate only items with specific tags
4 evaluation = evaluate(
5     experiment_name="Production test cases",
6     dataset=dataset,
7     task=evaluation_task,
8     scoring_metrics=[hallucination_metric],
9     dataset_filter_string='tags contains "production"',
10 )
11 
12 # Evaluate items matching multiple conditions
13 evaluation = evaluate(
14     experiment_name="Hard finance questions",
15     dataset=dataset,
16     task=evaluation_task,
17     scoring_metrics=[hallucination_metric],
18     dataset_filter_string='data.category = "finance" AND data.difficulty = "hard"',
19 )
20 
21 # Filter by date range
22 evaluation = evaluate(
23     experiment_name="Recent test cases",
24     dataset=dataset,
25     task=evaluation_task,
26     scoring_metrics=[hallucination_metric],
27     dataset_filter_string='created_at >= "2024-06-01T00:00:00Z"',
28 )

The filter uses Opik Query Language (OQL) syntax. For more details on filter syntax and supported columns, see Filtering syntax.

You can combine filtering with other parameters like nb_samples to evaluate a specific number of items from a filtered subset.

Sampling the dataset for evaluation

You can use the dataset_sampler parameter to specify the instance of dataset sampler to use for sampling the dataset. This is useful if you want to sample the dataset differently than the default sampling strategy (accept all items).

For example, you can use the RandomDatasetSampler to sample the dataset randomly:

Python

1 from opik.evaluation import samplers
2 
3 evaluation = evaluate(
4     experiment_name="My experiment",
5     dataset=dataset,
6     task=evaluation_task,
7     scoring_metrics=[hallucination_metric],
8     dataset_sampler=samplers.RandomDatasetSampler(max_samples=10),
9 )

In the example above, the evaluation will sample 10 random items from the dataset.

Also, you can implement your own dataset sampler by extending the BaseDatasetSampler and overriding the sample method.

Python

1 import re
2 from typing import List
3 
4 from opik.api_objects.dataset import dataset_item
5 from opik.evaluation import samplers
6 
7 class MyDatasetSampler(samplers.BaseDatasetSampler):
8 
9     def __init__(self, filter_string: str, field_name: str) -> None:
10         self.filter_regex = re.compile(filter_string)
11         self.field_name = field_name
12 
13     def sample(self, dataset: List[dataset_item.DatasetItem]) -> List[dataset_item.DatasetItem]:
14         # Sample items from the dataset that match the filter string in the 'field_name' field
15         return [item for item in filter(lambda x: self.filter_regex.search(x[self.field_name]), dataset)]
16 
17 # Example usage
18 evaluation = evaluate(
19     experiment_name="My experiment",
20     dataset=dataset,
21     task=evaluation_task,
22     scoring_metrics=[hallucination_metric],
23     dataset_sampler=MyDatasetSampler(filter_string="\\.*SUCCESS\\.*", field_name="output"),
24 )

Implementing your own dataset sampler is useful if you want to implement a custom sampling strategy. For instance, you can implement a dataset sampler that samples the dataset using some filtering criteria as in the example above.

Analyzing the evaluation results

The evaluate function returns an EvaluationResult object that contains the evaluation results. You can create aggregated statistics for each metric by calling its aggregate_evaluation_scores method:

Python

1     evaluation = evaluate(
2         experiment_name="My experiment",
3         dataset=dataset,
4         task=evaluation_task,
5         scoring_metrics=[hallucination_metric],
6     )
7 
8     # Retrieve and print the aggregated scores statistics (mean, min, max, std) per metric
9     scores = evaluation.aggregate_evaluation_scores()
10     for metric_name, statistics in scores.aggregated_scores.items():
11         print(f"{metric_name}: {statistics}")

Aggregated statistics can help analyze evaluation results and are useful for comparing the performance of different models or different versions of the same model, for example.

Computing experiment-level metrics

In addition to per-item metrics, you can compute experiment-level aggregate metrics that are calculated across all test results. These experiment scores are displayed in the Opik UI alongside feedback scores and can be used for sorting and filtering experiments.

Experiment scores are computed after all test results are collected. You define experiment score functions that take a list of TestResult objects and return a list of ScoreResult objects representing aggregate metrics.

Python

1 from typing import List
2 from opik.evaluation import evaluate, test_result
3 from opik.evaluation.metrics import Hallucination, score_result
4 
5 # Define an experiment score function
6 def compute_hallucination_max(
7     test_results: List[test_result.TestResult],
8 ) -> List[score_result.ScoreResult]:
9     """Compute the maximum hallucination score across all test results."""
10     hallucination_scores = [
11         result.score_results[0].value 
12         for result in test_results 
13         if result.score_results and len(result.score_results) > 0
14     ]
15     
16     if not hallucination_scores:
17         return []
18     
19     return [
20         score_result.ScoreResult(
21             name="hallucination_metric (max)",
22             value=max(hallucination_scores),
23             reason=f"Maximum hallucination score across {len(hallucination_scores)} test cases"
24         )
25     ]
26 
27 # Run evaluation with experiment scores
28 evaluation = evaluate(
29     dataset=dataset,
30     task=evaluation_task,
31     scoring_metrics=[Hallucination()],
32     experiment_scoring_functions=[compute_hallucination_max],
33     experiment_name="My experiment"
34 )
35 
36 # Access experiment scores from the result
37 print(f"Experiment scores: {evaluation.experiment_scores}")

Experiment scores are displayed in the Opik UI in the experiments table alongside feedback scores. They can be used for sorting and filtering experiments, making it easy to compare experiments based on aggregate metrics.

You can define multiple experiment score functions to compute different aggregate metrics:

Python

1 from typing import List
2 from opik.evaluation import evaluate, test_result
3 from opik.evaluation.metrics import Equals, score_result
4 
5 def compute_accuracy_stats(
6     test_results: List[test_result.TestResult],
7 ) -> List[score_result.ScoreResult]:
8     """Compute accuracy statistics across all test results."""
9     accuracy_scores = [
10         result.score_results[0].value 
11         for result in test_results 
12         if result.score_results and len(result.score_results) > 0
13     ]
14     
15     if not accuracy_scores:
16         return []
17     
18     return [
19         score_result.ScoreResult(
20             name="accuracy (mean)",
21             value=sum(accuracy_scores) / len(accuracy_scores),
22             reason=f"Mean accuracy across {len(accuracy_scores)} test cases"
23         ),
24         score_result.ScoreResult(
25             name="accuracy (min)",
26             value=min(accuracy_scores),
27             reason=f"Minimum accuracy across {len(accuracy_scores)} test cases"
28         ),
29         score_result.ScoreResult(
30             name="accuracy (max)",
31             value=max(accuracy_scores),
32             reason=f"Maximum accuracy across {len(accuracy_scores)} test cases"
33         ),
34     ]
35 
36 evaluation = evaluate(
37     dataset=dataset,
38     task=evaluation_task,
39     scoring_metrics=[Equals()],
40     experiment_scoring_functions=[compute_accuracy_stats],
41     experiment_name="My experiment"
42 )

Experiment score functions receive all test results after evaluation completes. Make sure your functions handle edge cases like empty test results or missing score values gracefully.

Python SDK

Using async evaluation tasks

The evaluate function does not support async evaluation tasks, if you pass an async task you will get an error similar to:

1 Input should be a valid dictionary [type=dict_type, input_value='<coroutine object kyc_qu...ng_task at 0x3336d0a40>', input_type=str]

As it might not always be possible to convert all your LLM logic to not rely on async logic, we recommend using asyncio.run within the evaluation task:

1 import asyncio
2 
3 async def your_llm_application(input: str) -> str:
4     return "Hello, World"
5 
6 def evaluation_task(x):
7     # your_llm_application here is an async function
8     result = asyncio.run(your_llm_application(x['input']))
9     return {
10         "output": result
11     }

This should solve the issue and allow you to run the evaluation.

If you are running in a Jupyter notebook, you will need to add the following line to the top of your notebook:

1 import nest_asyncio
2 nest_asyncio.apply()

otherwise you might get the error RuntimeError: asyncio.run() cannot be called from a running event loop

The evaluate function uses multi-threading under the hood to speed up the evaluation run. Using both asyncio and multi-threading can lead to unexpected behavior and hard to debug errors.

If you run into any issues, you can disable the multi-threading in the SDK by setting task_threads to 1:

1 evaluation = evaluate(
2     dataset=dataset,
3     task=evaluation_task,
4     scoring_metrics=[hallucination_metric],
5     task_threads=1
6 )

Disabling threading

In order to evaluate datasets more efficiently, Opik uses multiple background threads to evaluate the dataset. If this is causing issues, you can disable these by setting task_threads and scoring_threads to 1 which will lead Opik to run all calculations in the main thread.

Passing additional arguments to `evaluation_task`

Sometimes your evaluation task needs extra context besides the dataset item (commonly referred to as x). For example, you may want to pass a model name, a system prompt, or a pre-initialized client. Since evaluate calls the task as task(x) for each dataset item, the recommended pattern is to create a wrapper (or use functools.partial) that closes over any additional arguments.

Using a wrapper function:

1 # Extra dependencies you want to provide to the task
2 MODEL = "gpt-4o"
3 IMAGE_TYPE = "thumbnail"
4 
5 def evaluation_task(x, model, image_type, client, prompt):
6     full_response = client.get_answer(
7         x["question"],
8         x["image_paths"][image_type],
9         prompt.format(),
10         model=model,
11     )
12     response = full_response["response"]
13     return {
14         "response": response,
15         "bbox": full_response.get("bounding_boxes"),
16         "image_url": full_response.get("image_url"),
17     }
18 
19 def make_task(model, image_type, client, prompt):
20     # Return a unary function that evaluate() can call as task(x)
21     def _task(x):
22         return evaluation_task(x, model, image_type, client, prompt)
23     return _task
24 
25 task = make_task(MODEL, IMAGE_TYPE, bot, system_prompt)
26 
27 evaluation = evaluate(
28     dataset=dataset,
29     task=task,  # evaluate will call task(x) for each item
30     scoring_metrics=[levenshteinratio_metric],
31     scoring_key_mapping={
32         "input": "question",
33         "output": "response",
34         "reference": "expected_answer",
35     },
36 )

Using Scoring Functions

In addition to using built-in metrics, Opik allows you to define custom scoring functions to evaluate your LLM applications. Scoring functions give you complete control over how your outputs are evaluated and can be tailored to your specific use cases.

There are two types of scoring functions you can use:

Plain Scoring Functions: Use dataset_item and task_outputs parameters
Task Span Scoring Functions: Use a task_span parameter for advanced evaluation

Using Plain Scoring Functions in Evaluation

Plain scoring functions receive dataset inputs and task outputs, making them ideal for evaluating the final results of your LLM application:

Python

1 from typing import Dict, Any
2 from opik.evaluation.metrics import score_result
3 
4 def custom_equals_scorer(
5     dataset_item: Dict[str, Any],
6     task_outputs: Dict[str, Any]
7 ) -> score_result.ScoreResult:
8     """
9     Custom scoring function that compares expected output with actual output.
10 
11     Args:
12         dataset_item: Data from the dataset item (includes expected outputs)
13         task_outputs: Outputs from the evaluation task
14     """
15     expected = dataset_item.get("expected_output")
16     actual = task_outputs.get("output")
17 
18     if expected == actual:
19         score = 1.0
20         reason = "Perfect match"
21     else:
22         score = 0.0
23         reason = f"Mismatch: expected '{expected}', got '{actual}'"
24 
25     return score_result.ScoreResult(
26         name="custom_equals_scorer",
27         value=score,
28         reason=reason
29     )

You can use your custom scoring functions alongside built-in metrics:

Python

1 from opik import evaluate
2 from opik.evaluation.metrics import Hallucination
3 
4 # Create dataset
5 dataset = opik_client.create_dataset("custom_evaluation_dataset", project_name="my-project")
6 dataset.insert([
7     {
8         "input": "What is the capital of France?",
9         "expected_output": "Paris"
10     },
11     {
12         "input": "What is 2 + 2?",
13         "expected_output": "4"
14     }
15 ])
16 
17 # Define evaluation task
18 def evaluation_task(item):
19     # Your LLM application logic here
20     return {"output": your_llm_application(item["input"])}
21 
22 # Run evaluation with custom scoring functions
23 evaluation = evaluate(
24     dataset=dataset,
25     task=evaluation_task,
26     scoring_functions=[
27         custom_equals_scorer
28     ],
29     scoring_metrics=[
30         Hallucination()  # Mix with built-in metrics
31     ],
32     experiment_name="Custom Scoring Experiment"
33 )

Task Span Scoring Functions

Task span scoring functions provide access to detailed execution information about your LLM tasks. These functions receive a task_span parameter containing structured data about the task execution, including input, output, metadata, and nested operations.

Task span functions are particularly useful for evaluating:

The internal structure and behavior of your LLM applications
Performance characteristics like execution patterns
Quality of intermediate steps in complex workflows
Cost and usage optimization opportunities
Agent trajectory analysis

Creating Task Span Scoring Functions

Task span scoring functions accept a task_span parameter which is a SpanModel object:

Python

1 from typing import Any
2 from opik.evaluation.metrics import score_result
3 from opik.message_processing.emulation.models import SpanModel
4 
5 def execution_time_scorer(
6     task_span: SpanModel
7 ) -> score_result.ScoreResult:
8     """
9     Scoring function that evaluates based on execution time.
10 
11     Args:
12         task_span: Complete execution information including timing
13     """
14     if task_span.start_time and task_span.end_time:
15         duration = (task_span.end_time - task_span.start_time).total_seconds()
16 
17         # Score based on execution speed
18         if duration < 1.0:
19             score = 1.0
20             reason = f"Fast execution: {duration:.2f}s"
21         elif duration < 5.0:
22             score = 0.8
23             reason = f"Acceptable execution time: {duration:.2f}s"
24         else:
25             score = 0.5
26             reason = f"Slow execution: {duration:.2f}s"
27     else:
28         score = 0.0
29         reason = "Cannot determine execution time"
30 
31     return score_result.ScoreResult(
32         name="execution_time_scorer",
33         value=score,
34         reason=reason
35     )
36 
37 def task_name_scorer(
38     task_span: SpanModel
39 ) -> score_result.ScoreResult:
40     """
41     Scoring function that validates the task span name.
42     """
43     expected_name = "your_llm_application"  # Adjust to your function name
44 
45     score = 1.0 if task_span.name == expected_name else 0.0
46     reason = f"Task name: '{task_span.name}'"
47 
48     return score_result.ScoreResult(
49         name="task_name_scorer",
50         value=score,
51         reason=reason
52     )

Combined Scoring Functions

You can also create scoring functions that use both dataset inputs/outputs AND task span information:

Python

1 def comprehensive_scorer(
2     dataset_item: Dict[str, Any],
3     task_outputs: Dict[str, Any],
4     task_span: SpanModel
5 ) -> score_result.ScoreResult:
6     """
7     Comprehensive scoring function using all available information.
8 
9     Args:
10         dataset_item: Dataset item data
11         task_outputs: Task execution outputs
12         task_span: Detailed execution information
13     """
14     # Check output correctness
15     expected = dataset_item.get("expected_output")
16     actual = task_outputs.get("output")
17     correctness_score = 1.0 if expected == actual else 0.0
18 
19     # Check execution efficiency
20     if task_span.start_time and task_span.end_time:
21         duration = (task_span.end_time - task_span.start_time).total_seconds()
22         efficiency_score = 1.0 if duration < 2.0 else 0.5
23     else:
24         efficiency_score = 0.0
25 
26     # Combined score (weighted average)
27     final_score = (correctness_score * 0.7) + (efficiency_score * 0.3)
28 
29     return score_result.ScoreResult(
30         name="comprehensive_scorer",
31         value=final_score,
32         reason=f"Correctness: {correctness_score}, Efficiency: {efficiency_score}"
33     )

Using Task Span Scoring Functions in Evaluation

Task span scoring functions work seamlessly with the evaluation framework:

Python

1 from opik import track
2 
3 @track  # Enable span collection for task span metrics
4 def evaluation_task(item):
5     return {"output": your_llm_application(item["input"])}
6 
7 # Run evaluation with task span scoring functions
8 evaluation = evaluate(
9     dataset=dataset,
10     task=evaluation_task,  # Must be decorated with @track
11     scoring_functions=[
12         execution_time_scorer,
13         task_name_scorer,
14         comprehensive_scorer  # Mix different types
15     ],
16     experiment_name="Task Span Evaluation"
17 )

When you use task span scoring functions, Opik automatically enables span collection and analysis. You don’t need to configure anything special - the system will detect functions with task_span parameters and handle them appropriately.

Task span scoring functions have access to detailed execution information including inputs, outputs, and metadata. Be mindful of sensitive data and ensure your functions handle this information appropriately.

Using task span evaluation metrics

Opik supports advanced evaluation metrics that can analyze the detailed execution information of your LLM tasks. These metrics receive a task_span parameter containing structured data about the task execution, including input, output, metadata, and nested operations.

Task span metrics are particularly useful for evaluating:

The internal structure and behavior of your LLM applications
Performance characteristics like execution patterns
Quality of intermediate steps in complex workflows
Cost and usage optimization opportunities
Agent trajectory

Creating task span metrics

To create a task span evaluation metric, define a metric class that accepts a task_span parameter in its score method. The task_span parameter is a SpanModel object that contains detailed information about the task execution:

Python

1 from typing import Any, Optional
2 from opik.evaluation.metrics import BaseMetric, score_result
3 from opik.message_processing.emulation.models import SpanModel
4 
5 class ExecutionTimeMetric(BaseMetric):
6 def score(self, task_span: SpanModel, \*\*ignored_kwargs: Any) -> score_result.ScoreResult: # Calculate execution duration
7 if task_span.start_time and task_span.end_time:
8 duration = (task_span.end_time - task_span.start_time).total_seconds()
9 
10             # Score based on execution speed
11             if duration < 1.0:
12                 score = 1.0
13                 reason = f"Fast execution: {duration:.2f}s"
14             elif duration < 5.0:
15                 score = 0.8
16                 reason = f"Acceptable execution time: {duration:.2f}s"
17             else:
18                 score = 0.5
19                 reason = f"Slow execution: {duration:.2f}s"
20         else:
21             score = 0.0
22             reason = "Cannot determine execution time"
23 
24         return score_result.ScoreResult(
25             value=score,
26             name=self.name,
27             reason=reason
28         )

Using task span metrics in evaluation

Task span metrics work alongside regular evaluation metrics and are automatically detected by the evaluation engine:

Python

1 from opik import evaluate
2 from opik.evaluation.metrics import Equals
3 
4 # Create both regular and task span metrics
5 equals_metric = Equals()
6 timing_metric = ExecutionTimeMetric()
7 
8 # Run evaluation with mixed metric types
9 evaluation = evaluate(
10     dataset=dataset,
11     task=evaluation_task,
12     scoring_metrics=[
13         equals_metric,        # Regular metric
14         timing_metric,        # Task span metric
15     ],
16     experiment_name="Comprehensive Evaluation"
17 )

When you use task span metrics, Opik automatically enables span collection and analysis. You don’t need to configure anything special - the system will detect metrics with task_span parameters and handle them appropriately.

Accessing span hierarchy

Task spans can contain nested spans representing sub-operations. You can analyze the complete execution hierarchy.

Here’s an example of a tracked function that produces nested spans:

Python

1 from opik import track
2 from opik.integrations.openai import track_openai
3 import openai
4 
5 openai_client = track_openai(openai.OpenAI())
6 
7 @track
8 def research_topic(topic: str) -> str:
9 """Main research function that creates nested spans."""
10 
11     # This will create a nested span for gathering context
12     context = gather_context(topic)
13 
14     # This will create another nested span for analysis
15     analysis = analyze_information(context, topic)
16 
17     # Final span for generating summary
18     summary = generate_summary(analysis, topic)
19 
20     return summary
21 
22 @track
23 def gather_context(topic: str) -> str:
24 """Gather background context - creates its own span."""
25 response = openai_client.chat.completions.create(
26 model="gpt-3.5-turbo",
27 messages=[{
28 "role": "user",
29 "content": f"Provide background context about: {topic}"
30 }]
31 )
32 return response.choices[0].message.content
33 
34 @track
35 def analyze_information(context: str, topic: str) -> str:
36 """Analyze the gathered information - creates its own span."""
37 response = openai_client.chat.completions.create(
38 model="gpt-3.5-turbo",
39 messages=[{
40 "role": "user",
41 "content": f"Analyze this context about {topic}: {context}"
42 }]
43 )
44 return response.choices[0].message.content
45 
46 @track
47 def generate_summary(analysis: str, topic: str) -> str:
48 """Generate final summary - creates its own span."""
49 response = openai_client.chat.completions.create(
50 model="gpt-3.5-turbo",
51 messages=[{
52 "role": "user",
53 "content": f"Create a summary for {topic} based on: {analysis}"
54 }]
55 )
56 return response.choices[0].message.content

When you call research_topic("artificial intelligence"), Opik will create a hierarchy of spans:

Python

You can then analyze this complete execution hierarchy using task span metrics:

Python

1 class HierarchyAnalysisMetric(BaseMetric):
2     def _analyze_hierarchy_recursively(self, span: SpanModel, hierarchy_stats: dict = None) -> dict:
3         """Recursively analyze span hierarchy across the entire span tree."""
4         if hierarchy_stats is None:
5             hierarchy_stats = {
6                 'total_spans': 0,
7                 'llm_spans': 0,
8                 'tool_spans': 0,
9                 'other_spans': 0,
10                 'max_depth': 0,
11                 'current_depth': 0,
12                 'llm_span_names': [],
13                 'tool_span_names': []
14             }
15 
16         # Count current span
17         hierarchy_stats['total_spans'] += 1
18         hierarchy_stats['max_depth'] = max(hierarchy_stats['max_depth'], hierarchy_stats['current_depth'])
19 
20         # Categorize span types
21         if span.type == "llm":
22             hierarchy_stats['llm_spans'] += 1
23             hierarchy_stats['llm_span_names'].append(span.name)
24         elif span.type == "tool":
25             hierarchy_stats['tool_spans'] += 1
26             hierarchy_stats['tool_span_names'].append(span.name)
27         else:
28             hierarchy_stats['other_spans'] += 1
29 
30         # Recursively analyze nested spans with depth tracking
31         for nested_span in span.spans:
32             hierarchy_stats['current_depth'] += 1
33             self._analyze_hierarchy_recursively(nested_span, hierarchy_stats)
34             hierarchy_stats['current_depth'] -= 1
35 
36         return hierarchy_stats
37 
38     def score(self, task_span: SpanModel, **ignored_kwargs: Any) -> score_result.ScoreResult:
39         # Analyze hierarchy across the entire span tree
40         # Only for illustrative purposes.
41         # Please adjust for your specific use case!
42         hierarchy_stats = self._analyze_hierarchy_recursively(task_span)
43 
44         total_operations = hierarchy_stats['total_spans']
45         llm_operations = hierarchy_stats['llm_spans']
46         tool_operations = hierarchy_stats['tool_spans']
47         max_depth = hierarchy_stats['max_depth']
48 
49         # Analyze the complexity and structure of the operation
50         if llm_operations > 5:
51             # Many LLM calls might indicate inefficient processing
52             if tool_operations == 0:
53                 score = 0.4
54                 reason = f"Over-complex operation: {llm_operations} LLM calls with no tool usage (depth: {max_depth})"
55             else:
56                 score = 0.6
57                 reason = f"Complex operation: {llm_operations} LLM calls, {tool_operations} tool calls (depth: {max_depth})"
58         elif llm_operations == 0:
59             # No reasoning might indicate a purely mechanical process
60             score = 0.3 if tool_operations > 0 else 0.1
61             reason = f"No reasoning detected: {tool_operations} tool calls only" if tool_operations > 0 else "No LLM or tool operations detected"
62         else:
63             # Balanced approach with reasonable LLM usage
64             balance_ratio = min(llm_operations, tool_operations) / max(llm_operations, tool_operations) if tool_operations > 0 else 0.8
65             depth_bonus = 1.0 if max_depth <= 3 else max(0.8, 1.0 - (max_depth - 3) * 0.05)
66 
67             score = min(1.0, 0.7 + balance_ratio * 0.2 + depth_bonus * 0.1)
68 
69             if tool_operations > 0:
70                 reason = f"Well-structured operation: {llm_operations} LLM calls, {tool_operations} tool calls across {total_operations} spans (depth: {max_depth})"
71             else:
72                 reason = f"Reasoning-focused operation: {llm_operations} LLM calls across {total_operations} spans (depth: {max_depth})"
73 
74         return score_result.ScoreResult(
75             value=score,
76             name=self.name,
77             reason=reason
78         )

For the SpanModel’s hierarchy given above the HierarchyAnalysisMetric metric’s score will be:

    Score: 0.96, Reason: Reasoning-focused operation: 3 LLM calls across 7 spans (depth: 2)

Quickly testing task span metrics locally

You can validate a task span metric without running a full evaluation by recording spans locally. The SDK provides a context manager that captures all spans/traces created in the block and exposes them in-memory:

Python

1 import opik
2 from opik import track
3 from opik.evaluation.metrics import score_result
4 from opik.message_processing.emulation.models import SpanModel
5 
6 # Example metric under test
7 class ExecutionTimeMetric:
8     def __init__(self, name: str = "execution_time_metric"):
9         self.name = name
10 
11     def score(self, task_span: SpanModel, **_):
12         if task_span.start_time and task_span.end_time:
13             duration = (task_span.end_time - task_span.start_time).total_seconds()
14             value = 1.0 if duration < 2.0 else 0.5
15             reason = f"Duration: {duration:.2f}s"
16         else:
17             value = 0.0
18             reason = "Missing timing information"
19         return score_result.ScoreResult(value=value, name=self.name, reason=reason)
20 
21 @track
22 def my_tracked_function(question: str) -> str:
23     # Your LLM/tool code here that produces spans
24     return f"Answer to: {question}"
25 
26 with opik.record_traces_locally() as storage:
27     # Execute tracked code that creates spans
28     _ = my_tracked_function("What is the capital of France?")
29 
30     # Access the in-memory span tree (flush is automatic before reading)
31     span_trees = storage.span_trees
32     assert len(span_trees) > 0, "No spans recorded"
33     root_span = span_trees[0]
34 
35     # Evaluate your task span metric directly
36     metric = ExecutionTimeMetric()
37     result = metric.score(task_span=root_span)
38     print(result)

Local recording cannot be nested. If a recording block is already active, entering another will raise an error.

Best practices for task span metrics

Focus on execution patterns: Use task span metrics to evaluate how your application executes, not just the final output
Combine with regular metrics: Mix task span metrics with traditional output-based metrics for comprehensive evaluation
Analyze performance: Leverage timing, cost, and usage information for optimization insights
Handle missing data gracefully: Always check for None values in optional span attributes

Task span metrics have access to detailed execution information including inputs, outputs, and metadata. Be mindful of sensitive data and ensure your metrics handle this information appropriately.

Accessing logged experiments

You can access all the experiments logged to the platform from the SDK with the get experiment by name methods:

1 import { Opik } from "opik";
2 
3 const client = new Opik({
4     apiKey: "your-api-key",
5     apiUrl: "https://www.comet.com/opik/api",
6     projectName: "your-project-name",
7     workspaceName: "your-workspace-name",
8 });
9 const experiments = await client.getExperimentsByName("My experiment");
10 
11 // Access the first experiment content
12 const items = await experiments[0].getItems();
13 console.log(items);

Running an offline evaluation

1. (Optional) Add tracking to your LLM application

2. Define the evaluation task

3. Choose the evaluation Dataset

4. Choose evaluation metrics

5. Run the evaluation

6. Analyze the evaluation results

Advanced usage

Missing arguments for scoring methods

Linking prompts to experiments

Logging traces to a specific project

Evaluating a subset of the dataset

Evaluating a filtered subset of the dataset

Sampling the dataset for evaluation

Analyzing the evaluation results

Computing experiment-level metrics

Python SDK

Using async evaluation tasks

Disabling threading

Passing additional arguments to evaluation_task

Using Scoring Functions

Using Plain Scoring Functions in Evaluation

Task Span Scoring Functions

Creating Task Span Scoring Functions

Combined Scoring Functions

Using Task Span Scoring Functions in Evaluation

Using task span evaluation metrics

Creating task span metrics

Using task span metrics in evaluation

Accessing span hierarchy

Quickly testing task span metrics locally

Best practices for task span metrics

Accessing logged experiments

Passing additional arguments to `evaluation_task`