Multi-Metric Optimization

When optimizing AI agents, you often need to balance multiple quality dimensions simultaneously. Multi-metric optimization allows you to combine several evaluation metrics with customizable weights to create a composite objective function.

Why Use Multi-Metric Optimization?

While you can implement metric combinations within a custom metric function, using Opik Optimizer’s MultiMetricObjective API provides additional benefits:

  • Automatic logging of all component metrics to the Opik platform
  • Individual tracking of each sub-metric alongside the composite score
  • Detailed visibility into how each metric contributes to optimization
  • Trial-level insights for both aggregate and individual trace performance

This visibility helps you understand trade-offs between different quality dimensions during optimization.

Basic Setup

In this guide, we’ll use the HotPot QA dataset to demonstrate multi-metric optimization. The example optimizes a question-answering agent that uses Wikipedia search to balance both accuracy and relevance.

To use multi-metric optimization, you need to:

  1. Define multiple metric functions
  2. Create a MultiMetricObjective class instance using your functions and weights
  3. Pass it to your optimizer as the metric to optimize for

Step 1: Define Your Metrics

Create individual metric functions that evaluate different aspects of your agent’s output:

1from typing import Any, Dict
2from opik.evaluation.metrics import LevenshteinRatio, AnswerRelevance
3from opik.evaluation.metrics.score_result import ScoreResult
4
5def levenshtein_ratio(dataset_item: Dict[str, Any], llm_output: str) -> ScoreResult:
6 """Measures string similarity between output and reference answer."""
7 metric = LevenshteinRatio()
8 return metric.score(reference=dataset_item["answer"], output=llm_output)
9
10def answer_relevance_score(dataset_item: Dict[str, Any], llm_output: str) -> ScoreResult:
11 """Evaluates how relevant the answer is to the question and context."""
12 metric = AnswerRelevance()
13 return metric.score(
14 context=[dataset_item["answer"]],
15 output=llm_output,
16 input=dataset_item["question"]
17 )

Step 2: Create a Multi-Metric Objective

Combine your metrics with weights using MultiMetricObjective:

1import opik_optimizer
2
3multi_metric_objective = opik_optimizer.MultiMetricObjective(
4 weights=[0.4, 0.6],
5 metrics=[levenshtein_ratio, answer_relevance_score],
6 name="my_composite_metric",
7)

Understanding Weights:

The weights parameter controls the relative importance of each metric:

  • weights=[0.4, 0.6] → First metric contributes 40%, second contributes 60%
  • Higher weights emphasize those metrics during optimization
  • Weights don’t need to sum to 1—use any values that represent your priorities

Step 3: Use with Optimizer

Pass the multi-metric objective to your optimizer:

1from opik_optimizer.gepa_optimizer import GepaOptimizer
2
3optimizer = GepaOptimizer(
4 model="openai/gpt-4o-mini",
5 reflection_model="openai/gpt-4o",
6 project_name="GEPA-Hotpot",
7 temperature=0.7,
8 max_tokens=400,
9)
10
11result = optimizer.optimize_prompt(
12 prompt=prompt,
13 dataset=dataset,
14 metric=multi_metric_objective, # Use the composite metric
15 max_metric_calls=60,
16 reflection_minibatch_size=5,
17 candidate_selection_strategy="best",
18 n_samples=12,
19)
20
21result.display()

Complete Example

Here’s a full working example using multi-metric optimization for a question-answering task with tool usage:

1from typing import Any, Dict
2import opik
3import opik_optimizer
4from opik_optimizer import ChatPrompt
5from opik_optimizer.gepa_optimizer import GepaOptimizer
6from opik_optimizer.datasets import hotpot_300
7from opik_optimizer.utils import search_wikipedia
8from opik.evaluation.metrics import LevenshteinRatio, AnswerRelevance
9from opik.evaluation.metrics.score_result import ScoreResult
10
11# Load dataset
12dataset = hotpot_300()
13
14# Define metric functions
15def levenshtein_ratio(dataset_item: Dict[str, Any], llm_output: str) -> ScoreResult:
16 """Measures string similarity between output and reference answer."""
17 metric = LevenshteinRatio()
18 return metric.score(reference=dataset_item["answer"], output=llm_output)
19
20def answer_relevance_score(dataset_item: Dict[str, Any], llm_output: str) -> ScoreResult:
21 """Evaluates how relevant the answer is to the question and context."""
22 metric = AnswerRelevance()
23 return metric.score(
24 context=[dataset_item["answer"]],
25 output=llm_output,
26 input=dataset_item["question"]
27 )
28
29# Define prompt template with Wikipedia search tool
30prompt = ChatPrompt(
31 system="Answer the question",
32 user="{question}",
33 tools=[
34 {
35 "type": "function",
36 "function": {
37 "name": "search_wikipedia",
38 "description": "This function is used to search wikipedia abstracts.",
39 "parameters": {
40 "type": "object",
41 "properties": {
42 "query": {
43 "type": "string",
44 "description": "The query parameter is the term or phrase to search for.",
45 },
46 },
47 "required": ["query"],
48 },
49 },
50 },
51 ],
52 function_map={"search_wikipedia": opik.track(type="tool")(search_wikipedia)},
53)
54
55# Create optimizer
56optimizer = GepaOptimizer(
57 model="openai/gpt-4o-mini",
58 reflection_model="openai/gpt-4o",
59 project_name="GEPA-Hotpot",
60 temperature=0.7,
61 max_tokens=400,
62)
63
64# Create multi-metric objective
65multi_metric_objective = opik_optimizer.MultiMetricObjective(
66 weights=[0.4, 0.6],
67 metrics=[levenshtein_ratio, answer_relevance_score],
68 name="my_composite_metric",
69)
70
71# Optimize with multi-metric objective
72result = optimizer.optimize_prompt(
73 prompt=prompt,
74 dataset=dataset,
75 metric=multi_metric_objective,
76 max_metric_calls=60,
77 reflection_minibatch_size=5,
78 candidate_selection_strategy="best",
79 n_samples=12,
80)
81
82result.display()

Viewing Results

When you run multi-metric optimization, Opik tracks and displays both the composite metric and individual sub-metrics throughout the optimization process.

Progress Chart

The optimization progress chart shows how your composite metric and individual metrics evolve over trials:

Multi-metric optimization progress chart

What you’ll see:

  • Composite metric (my_composite_metric) — The weighted combination of all metrics
  • Individual metrics (levenshtein_ratio, answer_relevance_score) — Each component tracked separately
  • Trial progression — Metric evolution over time

This lets you see not just overall optimization progress, but how each metric contributes to the final score.

Trial Items View

View individual trial items with detailed metric breakdowns:

Trial items with multi-metric scores

What you’ll see:

  • Composite metric score for each trial
  • Individual scores for each component metric
  • Performance comparison across different prompts

Insights you’ll gain:

  • Which metrics are improving or degrading
  • Trade-offs between different quality dimensions
  • Whether your weight balance is appropriate

Next Steps