Evaluate your LLM application | Opik Documentation

Evaluating your LLM application allows you to have confidence in the performance of your LLM application. In this guide, we will walk through the process of evaluating complex applications like LLM chains or agents.

In this guide, we will focus on evaluating complex LLM applications. If you are looking at evaluating single prompts you can refer to the Evaluate A Prompt guide.

The evaluation is done in five steps:

Add tracing to your LLM application
Define the evaluation task
Choose the Dataset that you would like to evaluate your application on
Choose the metrics that you would like to evaluate your application with
Create and run the evaluation experiment

1. Add tracking to your LLM application

While not required, we recommend adding tracking to your LLM application. This allows you to have full visibility into each evaluation run. In the example below we will use a combination of the track decorator and the track_openai function to trace the LLM application.

1 from opik import track
2 from opik.integrations.openai import track_openai
3 import openai
4 
5 openai_client = track_openai(openai.OpenAI())
6 
7 # This method is the LLM application that you want to evaluate
8 # Typically this is not updated when creating evaluations
9 @track
10 def your_llm_application(input: str) -> str:
11     response = openai_client.chat.completions.create(
12         model="gpt-3.5-turbo",
13         messages=[{"role": "user", "content": input}],
14     )
15 
16     return response.choices[0].message.content

Here we have added the track decorator so that this trace and all its nested steps are logged to the platform for further analysis.

2. Define the evaluation task

Once you have added instrumentation to your LLM application, we can define the evaluation task. The evaluation task takes in as an input a dataset item and needs to return a dictionary with keys that match the parameters expected by the metrics you are using. In this example we can define the evaluation task as follows:

1 def evaluation_task(x):
2     return {
3         "output": your_llm_application(x['user_question'])
4     }

If the dictionary returned does not match with the parameters expected by the metrics, you will get inconsistent evaluation results.

3. Choose the evaluation Dataset

In order to create an evaluation experiment, you will need to have a Dataset that includes all your test cases.

If you have already created a Dataset, you can use the Opik.get_or_create_dataset function to fetch it:

1 from opik import Opik
2 
3 client = Opik()
4 dataset = client.get_or_create_dataset(name="Example dataset")

If you don’t have a Dataset yet, you can insert dataset items using the Dataset.insert method. You can call this method multiple times as Opik performs data deduplication before ingestion:

1 from opik import Opik
2 
3 client = Opik()
4 dataset = client.get_or_create_dataset(name="Example dataset")
5 
6 dataset.insert([
7     {"input": "Hello, world!", "expected_output": "Hello, world!"},
8     {"input": "What is the capital of France?", "expected_output": "Paris"},
9 ])

4. Choose evaluation metrics

Opik provides a set of built-in evaluation metrics that you can choose from. These are broken down into two main categories:

Heuristic metrics: These metrics that are deterministic in nature, for example equals or contains
LLM-as-a-judge: These metrics use an LLM to judge the quality of the output; typically these are used for detecting hallucinations or context relevance

In the same evaluation experiment, you can use multiple metrics to evaluate your application:

1 from opik.evaluation.metrics import Hallucination
2 
3 hallucination_metric = Hallucination()

Each metric expects the data in a certain format. You will need to ensure that the task you have defined in step 1 returns the data in the correct format.

5. Run the evaluation

Now that we have the task we want to evaluate, the dataset to evaluate on, and the metrics we want to evaluate with, we can run the evaluation:

1 from opik import Opik, track
2 from opik.evaluation import evaluate
3 from opik.evaluation.metrics import Equals, Hallucination
4 from opik.integrations.openai import track_openai
5 import openai
6 
7 # Define the task to evaluate
8 openai_client = track_openai(openai.OpenAI())
9 
10 MODEL = "gpt-3.5-turbo"
11 
12 @track
13 def your_llm_application(input: str) -> str:
14     response = openai_client.chat.completions.create(
15         model=MODEL,
16         messages=[{"role": "user", "content": input}],
17     )
18     return response.choices[0].message.content
19 
20 # Define the evaluation task
21 def evaluation_task(x):
22     return {
23         "output": your_llm_application(x['input'])
24     }
25 
26 # Create a simple dataset
27 client = Opik()
28 dataset = client.get_or_create_dataset(name="Example dataset")
29 dataset.insert([
30     {"input": "What is the capital of France?"},
31     {"input": "What is the capital of Germany?"},
32 ])
33 
34 # Define the metrics
35 hallucination_metric = Hallucination()
36 
37 evaluation = evaluate(
38     dataset=dataset,
39     task=evaluation_task,
40     scoring_metrics=[hallucination_metric],
41     experiment_config={
42         "model": MODEL
43     }
44 )

You can use the experiment_config parameter to store information about your evaluation task. Typically we see teams store information about the prompt template, the model used and model parameters used to evaluate the application.

Advanced usage

Using async evaluation tasks

The evaluate function does not support async evaluation tasks, if you pass an async task you will get an error similar to:

1 Input should be a valid dictionary [type=dict_type, input_value='<coroutine object kyc_qu...ng_task at 0x3336d0a40>', input_type=str]

As it might not always be possible to convert all your LLM logic to not rely on async logic, we recommend using asyncio.run within the evaluation task:

1 import asyncio
2 
3 async def your_llm_application(input: str) -> str:
4     return "Hello, World"
5 
6 def evaluation_task(x):
7     # your_llm_application here is an async function
8     result = asyncio.run(your_llm_application(x['input']))
9     return {
10         "output": result
11     }

This should solve the issue and allow you to run the evaluation.

If you are running in a Jupyter notebook, you will need to add the following line to the top of your notebook:

1 import nest_asyncio
2 nest_asyncio.apply()

otherwise you might get the error RuntimeError: asyncio.run() cannot be called from a running event loop

The evaluate function uses multi-threading under the hood to speed up the evaluation run. Using both asyncio and multi-threading can lead to unexpected behavior and hard to debug errors.

If you run into any issues, you can disable the multi-threading in the SDK by setting task_threads to 1:

1 evaluation = evaluate(
2     dataset=dataset,
3     task=evaluation_task,
4     scoring_metrics=[hallucination_metric],
5     task_threads=1
6 )

Missing arguments for scoring methods

When you face the opik.exceptions.ScoreMethodMissingArguments exception, it means that the dataset item and task output dictionaries do not contain all the arguments expected by the scoring method. The way the evaluate function works is by merging the dataset item and task output dictionaries and then passing the result to the scoring method. For example, if the dataset item contains the keys user_question and context while the evaluation task returns a dictionary with the key output, the scoring method will be called as scoring_method.score(user_question='...', context= '...', output= '...'). This can be an issue if the scoring method expects a different set of arguments.

You can solve this by either updating the dataset item or evaluation task to return the missing arguments or by using the scoring_key_mapping parameter of the evaluate function. In the example above, if the scoring method expects input as an argument, you can map the user_question key to the input key as follows:

1 evaluation = evaluate(
2     dataset=dataset,
3     task=evaluation_task,
4     scoring_metrics=[hallucination_metric],
5     scoring_key_mapping={"input": "user_question"},
6 )

Linking prompts to experiments

The Opik prompt library can be used to version your prompt templates.

When creating an Experiment, you can link the Experiment to a specific prompt version:

1 import opik
2 
3 # Create a prompt
4 prompt = opik.Prompt(
5     name="My prompt",
6     prompt="..."
7 )
8 
9 # Run the evaluation
10 evaluation = evaluate(
11     dataset=dataset,
12     task=evaluation_task,
13     scoring_metrics=[hallucination_metric],
14     prompts=[prompt],
15 )

The experiment will now be linked to the prompt allowing you to view all experiments that use a specific prompt:

Logging traces to a specific project

You can use the project_name parameter of the evaluate function to log evaluation traces to a specific project:

1 evaluation = evaluate(
2     dataset=dataset,
3     task=evaluation_task,
4     scoring_metrics=[hallucination_metric],
5     project_name="hallucination-detection",
6 )

Evaluating a subset of the dataset

You can use the nb_samples parameter to specify the number of samples to use for the evaluation. This is useful if you only want to evaluate a subset of the dataset.

1 evaluation = evaluate(
2     experiment_name="My experiment",
3     dataset=dataset,
4     task=evaluation_task,
5     scoring_metrics=[hallucination_metric],
6     nb_samples=10,
7 )

Sampling the dataset for evaluation

You can use the dataset_sampler parameter to specify the instance of dataset sampler to use for sampling the dataset. This is useful if you want to sample the dataset differently than the default sampling strategy (accept all items).

For example, you can use the RandomDatasetSampler to sample the dataset randomly:

1 from opik.evaluation import samplers
2 
3 evaluation = evaluate(
4     experiment_name="My experiment",
5     dataset=dataset,
6     task=evaluation_task,
7     scoring_metrics=[hallucination_metric],
8     dataset_sampler=samplers.RandomDatasetSampler(max_samples=10),
9 )

In the example above, the evaluation will sample 10 random items from the dataset.

Also, you can implement your own dataset sampler by extending the BaseDatasetSampler and overriding the sample method.

1 import re
2 from typing import List
3 
4 from opik.api_objects.dataset import dataset_item
5 from opik.evaluation import samplers
6 
7 class MyDatasetSampler(samplers.BaseDatasetSampler):
8 
9     def __init__(self, filter_string: str, field_name: str) -> None:
10         self.filter_regex = re.compile(filter_string)
11         self.field_name = field_name
12 
13     def sample(self, dataset: List[dataset_item.DatasetItem]) -> List[dataset_item.DatasetItem]:
14         # Sample items from the dataset that match the filter string in the 'field_name' field
15         return [item for item in filter(lambda x: self.filter_regex.search(x[self.field_name]), dataset)]
16 
17 # Example usage
18 evaluation = evaluate(
19     experiment_name="My experiment",
20     dataset=dataset,
21     task=evaluation_task,
22     scoring_metrics=[hallucination_metric],
23     dataset_sampler=MyDatasetSampler(filter_string="\\.*SUCCESS\\.*", field_name="output"),
24 )

Implementing your own dataset sampler is useful if you want to implement a custom sampling strategy. For instance, you can implement a dataset sampler that samples the dataset using some filtering criteria as in the example above.

Analyzing the evaluation results

The evaluate function returns an EvaluationResult object that contains the evaluation results. You can create aggregated statistics for each metric by calling its aggregate_evaluation_scores method:

1 evaluation = evaluate(
2     experiment_name="My experiment",
3     dataset=dataset,
4     task=evaluation_task,
5     scoring_metrics=[hallucination_metric],
6 )
7 
8 # Retrieve and print the aggregated scores statistics (mean, min, max, std) per metric
9 scores = evaluation.aggregate_evaluation_scores()
10 for metric_name, statistics in scores.aggregated_scores.items():
11     print(f"{metric_name}: {statistics}")

Aggregated statistics can help analyze evaluation results and are useful for comparing the performance of different models or different versions of the same model, for example.

Disabling threading

In order to evaluate datasets more efficiently, Opik uses multiple background threads to evaluate the dataset. If this is causing issues, you can disable these by setting task_threads and scoring_threads to 1 which will lead Opik to run all calculations in the main thread.

Passing additional arguments to `evaluation_task`

Sometimes your evaluation task needs extra context besides the dataset item (commonly referred to as x). For example, you may want to pass a model name, a system prompt, or a pre-initialized client. Since evaluate calls the task as task(x) for each dataset item, the recommended pattern is to create a wrapper (or use functools.partial) that closes over any additional arguments.

Using a wrapper function:

1 # Extra dependencies you want to provide to the task
2 MODEL = "gpt-4o"
3 IMAGE_TYPE = "thumbnail"
4 
5 def evaluation_task(x, model, image_type, client, prompt):
6     full_response = client.get_answer(
7         x["question"],
8         x["image_paths"][image_type],
9         prompt.format(),
10         model=model,
11     )
12     response = full_response["response"]
13     return {
14         "response": response,
15         "bbox": full_response.get("bounding_boxes"),
16         "image_url": full_response.get("image_url"),
17     }
18 
19 def make_task(model, image_type, client, prompt):
20     # Return a unary function that evaluate() can call as task(x)
21     def _task(x):
22         return evaluation_task(x, model, image_type, client, prompt)
23     return _task
24 
25 task = make_task(MODEL, IMAGE_TYPE, bot, system_prompt)
26 
27 evaluation = evaluate(
28     dataset=dataset,
29     task=task,  # evaluate will call task(x) for each item
30     scoring_metrics=[levenshteinratio_metric],
31     scoring_key_mapping={
32         "input": "question",
33         "output": "response",
34         "reference": "expected_answer",
35     },
36 )

Accessing logged experiments

You can access all the experiments logged to the platform from the SDK with the Opik.get_experiments_by_name and Opik.get_experiment_by_id methods:

1 import opik
2 
3 # Get the experiment
4 opik_client = opik.Opik()
5 experiments = opik_client.get_experiments_by_name("My experiment")
6 
7 # Access the first experiment content
8 items = experiments[0].get_items()
9 print(items)