For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
Copy to LLMGithubGo to App
DocumentationIntegrationsBuilding Self-Improving AgentsSelf-hosting OpikSDK & API reference
DocumentationIntegrationsBuilding Self-Improving AgentsSelf-hosting OpikSDK & API reference
  • Getting Started
    • Home
    • Quickstart
    • MCP Server
    • Ollie Agent
    • FAQ
    • Changelog
    • Upgrading to Opik 2.0
  • Observability
    • Overview
    • Getting started
    • Concepts
    • Debugging agents with Ollie and Opik Connect
  • Development
    • Overview
    • Agent playground
    • Prompt playground
      • Opik Agent Optimizer
      • Optimization Studio
      • Quickstart
      • Quickstart notebook
      • FAQ
      • Changelog
      • Known Issues
        • Extending optimizers
        • Custom metrics
        • Custom optimizer prompts
        • Sampling controls
        • Multiple completions (n)
        • Chaining optimizers
        • API Reference
  • Evaluation
    • Overview
    • Getting started
    • Concepts
  • Production
  • Administration
    • Overview
    • Roles and Permissions
  • Contributing
    • Contribution Overview
LogoLogo
Copy to LLMGithubGo to App
On this page
  • Design principles
  • Example: safety + completeness metric
  • Metric building blocks
  • Testing
  • Related docs
DevelopmentOptimization runsAdvanced Topics

Custom metrics

Was this page helpful?
Previous

Custom Optimizer Prompts

Customize the internal prompts used by optimizers
Next
Built with

Use custom metrics when built-in metrics are not enough (domain-specific scoring, precise safety checks, unique multimodal checks). Start with the core Opik evaluation docs so you know what already exists:

  • Evaluation concepts – terminology and lifecycle.
  • Metrics overview – default heuristic metrics (ROUGE, BLEU, Hallucination, etc.).
  • LLM-as-a-judge patterns – how Opik runs judge models against multi-turn traces.

Design principles

  • Deterministic – cache external model calls. Where supported by the model, set temperature to 0 and a seed value to increase the likelihood of repeated runs matching. Note that not all models guarantee deterministic outputs even with these settings.
  • Explainable – always set reason on ScoreResult for better dashboards.
  • Composable – wrap helpers into utility modules so multiple optimizers share them.
  • Layered – start with single metrics, then combine them via MultiMetricObjective when you need trade-offs.
  • Cost - consider the cost implications if you rely on compute and API calls for evaluations.

Example: safety + completeness metric

1from opik.evaluation.metrics import AnswerRelevance
2from opik.evaluation.metrics.score_result import ScoreResult
3from some_safety_model import classify_risk
4
5safety_model = classify_risk.Client()
6
7def safety_and_completeness(item, output):
8 relevance = AnswerRelevance().score(
9 context=[item["answer"]], output=output, input=item["question"]
10 )
11 safety = safety_model.score(text=output)
12
13 value = 1.0 if relevance.value > 0.75 and safety["label"] == "safe" else 0.0
14 reason = f"Relevant={relevance.value:.2f}, safety={safety['label']}"
15
16 return ScoreResult(name="safety_completeness", value=value, reason=reason)

Metric building blocks

  • Single metrics – implement one callable per concern (accuracy, tone, cost). Keep them reusable across prompts.
  • Multi-metric objectives – combine single metrics with weights when you need to balance, e.g., accuracy (0.7) + style (0.3). See Multi-metric optimization for templates.
  • LLM-as-a-judge – call out to an evaluation model (OpenAI, Anthropic, etc.) inside the metric. Always include detailed prompts so results stay stable, and understand that reflective optimizers will inherit any noise from these judge calls.
  • Heuristics – leverage built-ins from /evaluation/metrics instead of reinventing classic scores. You can compose heuristics with custom logic as shown above.

Testing

  • Write pytest cases that feed canned dataset items into the metric and assert expected scores.
  • Run metrics against a golden dataset on CI to catch regressions.
  • For multi-metric objectives, add tests that verify weight changes behave as expected (e.g., higher weight increases sensitivity).

Related docs

  • Define metrics
  • Evaluation concepts
  • LLM judge workflows
  • Metrics overview