For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
Copy to LLMGithubGo to App
DocumentationIntegrationsBuilding Self-Improving AgentsSelf-hosting OpikSDK & API reference
DocumentationIntegrationsBuilding Self-Improving AgentsSelf-hosting OpikSDK & API reference
  • Getting Started
    • Home
    • Quickstart
    • MCP Server
    • Ollie Agent
    • FAQ
    • Changelog
    • Upgrading to Opik 2.0
  • Observability
    • Overview
    • Getting started
    • Concepts
    • Debugging agents with Ollie and Opik Connect
  • Development
    • Overview
    • Agent playground
    • Prompt playground
  • Evaluation
    • Overview
    • Getting started
    • Concepts
      • Overview
      • Heuristic metrics
      • Hallucination
      • LLM Juries
      • G-Eval
      • Conversation-level GEval
      • Compliance risk
      • Prompt uncertainty
      • Moderation
      • Meaning Match
      • Usefulness
      • Summarization consistency
      • Summarization coherence
      • Dialogue helpfulness
      • Answer relevance
      • Context precision
      • Context recall
      • Trajectory accuracy
      • Agent task completion
      • Agent tool correctness
      • Conversational metrics
      • Custom model
      • Advanced configuration
      • Custom metric
      • Custom conversation metric
      • Structured Output Compliance
      • Task span metrics
  • Production
  • Administration
    • Overview
    • Roles and Permissions
  • Contributing
    • Contribution Overview
LogoLogo
Copy to LLMGithubGo to App
On this page
  • Advanced configuration
  • Asynchronous scoring with ascore
  • Controlling evaluator temperature
  • Log probabilities and evaluator models
  • Tracking controls
EvaluationMetrics

Advanced configuration

Was this page helpful?
Previous

Custom metric

Next
Built with

Advanced configuration

Opik’s metrics expose several power-user controls so you can tailor evaluations to your workflows. This guide covers the most common tweaks: asynchronous scoring, evaluator randomness, and log-probability handling.

Asynchronous scoring with ascore

Every built-in metric inherits from BaseMetric, which defines an async counterpart to score named ascore. Use it when you need to run evaluations inside an async pipeline or when the underlying provider (e.g., LangChain, Ragas) requires an event loop.

Awaiting an async metric
1import asyncio
2
3from opik.evaluation.metrics import Hallucination
4
5metric = Hallucination()
6
7async def evaluate_async():
8 result = await metric.ascore(
9 input="What is the capital of France?",
10 output="The capital is Berlin.",
11 )
12 return result
13
14score = asyncio.run(evaluate_async())
15print(score.value, score.reason)

Within synchronous code you can still call score—Opik will run the async implementation under the hood when needed. When integrating with async frameworks (FastAPI endpoints, streaming agents, or notebooks using nest_asyncio), prefer the explicit await metric.ascore(...) form.

Controlling evaluator temperature

GEval-based judges accept a temperature argument. Lower temperatures improve reproducibility by keeping the evaluator deterministic; higher values explore more rubric variations and can surface edge cases.

Custom temperature
1from opik.evaluation.metrics import ComplianceRiskJudge
2
3deterministic = ComplianceRiskJudge(temperature=0.0)
4exploratory = ComplianceRiskJudge(temperature=0.4)

Opik caches evaluator chain-of-thought prompts per (task, criteria, model, completion_kwargs) combination. Changing temperature or other LiteLLM keyword arguments (e.g., top_p) produces a fresh cache entry so experiments stay isolated.

Log probabilities and evaluator models

When the LiteLLM backend supports logprobs and top_logprobs, Opik automatically requests them to stabilise GEval scores (mirroring the original paper). If you switch to a model that does not expose log probabilities, the metric still works—the score is computed from the raw judgement only.

You can inspect the evaluator’s capabilities at runtime:

1metric = ComplianceRiskJudge(model="gpt-4o-mini")
2print("logprobs" in metric._model.supported_params)

If you need to propagate additional LiteLLM options (for example, response_format or frequency_penalty), instantiate LiteLLMChatModel manually and pass it to the metric:

Custom LiteLLM configuration
1from opik.evaluation.models.litellm import LiteLLMChatModel
2from opik.evaluation.metrics import Hallucination
3
4custom_provider = LiteLLMChatModel(
5 model_name="gpt-4o-mini",
6 temperature=0.2,
7 frequency_penalty=0.3,
8)
9
10metric = Hallucination(model=custom_provider)

Because the model fingerprint is part of the cache key, changing these kwargs forces a new evaluator rubric to be generated.

Tracking controls

Most metrics accept track and project_name keyword arguments so you can decide whether each run writes to Opik and which project it belongs to:

1metric = DialogueHelpfulnessJudge(track=False)

When track=False is set on an LLM judge metric (such as Hallucination, AnswerRelevance, or DialogueHelpfulnessJudge), it disables tracing for both the metric’s score method and the underlying LLM model calls used by the judge. This ensures consistent tracking behavior—if you disable tracking for a metric, all related LLM calls are also excluded from traces.

Disable tracking when running quick, ad-hoc experiments locally, or set project_name="llm-migration" to group evaluations by initiative.