Advanced configuration

Opik’s metrics expose several power-user controls so you can tailor evaluations to your workflows. This guide covers the most common tweaks: asynchronous scoring, evaluator randomness, and log-probability handling.

Asynchronous scoring with `ascore`

Every built-in metric inherits from BaseMetric, which defines an async counterpart to score named ascore. Use it when you need to run evaluations inside an async pipeline or when the underlying provider (e.g., LangChain, Ragas) requires an event loop.

Awaiting an async metric

1 import asyncio
2 
3 from opik.evaluation.metrics import Hallucination
4 
5 metric = Hallucination()
6 
7 async def evaluate_async():
8     result = await metric.ascore(
9         input="What is the capital of France?",
10         output="The capital is Berlin.",
11     )
12     return result
13 
14 score = asyncio.run(evaluate_async())
15 print(score.value, score.reason)

Within synchronous code you can still call score—Opik will run the async implementation under the hood when needed. When integrating with async frameworks (FastAPI endpoints, streaming agents, or notebooks using nest_asyncio), prefer the explicit await metric.ascore(...) form.

Controlling evaluator temperature

GEval-based judges accept a temperature argument. Lower temperatures improve reproducibility by keeping the evaluator deterministic; higher values explore more rubric variations and can surface edge cases.

Custom temperature

1 from opik.evaluation.metrics import ComplianceRiskJudge
2 
3 deterministic = ComplianceRiskJudge(temperature=0.0)
4 exploratory = ComplianceRiskJudge(temperature=0.4)

Opik caches evaluator chain-of-thought prompts per (task, criteria, model, completion_kwargs) combination. Changing temperature or other LiteLLM keyword arguments (e.g., top_p) produces a fresh cache entry so experiments stay isolated.

Log probabilities and evaluator models

When the LiteLLM backend supports logprobs and top_logprobs, Opik automatically requests them to stabilise GEval scores (mirroring the original paper). If you switch to a model that does not expose log probabilities, the metric still works—the score is computed from the raw judgement only.

You can inspect the evaluator’s capabilities at runtime:

1 metric = ComplianceRiskJudge(model="gpt-4o-mini")
2 print("logprobs" in metric._model.supported_params)

If you need to propagate additional LiteLLM options (for example, response_format or frequency_penalty), instantiate LiteLLMChatModel manually and pass it to the metric:

Custom LiteLLM configuration

1 from opik.evaluation.models.litellm import LiteLLMChatModel
2 from opik.evaluation.metrics import Hallucination
3 
4 custom_provider = LiteLLMChatModel(
5     model_name="gpt-4o-mini",
6     temperature=0.2,
7     frequency_penalty=0.3,
8 )
9 
10 metric = Hallucination(model=custom_provider)

Because the model fingerprint is part of the cache key, changing these kwargs forces a new evaluator rubric to be generated.

Tracking controls

Most metrics accept track and project_name keyword arguments so you can decide whether each run writes to Opik and which project it belongs to:

1 metric = DialogueHelpfulnessJudge(track=False)

When track=False is set on an LLM judge metric (such as Hallucination, AnswerRelevance, or DialogueHelpfulnessJudge), it disables tracing for both the metric’s score method and the underlying LLM model calls used by the judge. This ensures consistent tracking behavior—if you disable tracking for a metric, all related LLM calls are also excluded from traces.

Disable tracking when running quick, ad-hoc experiments locally, or set project_name="llm-migration" to group evaluations by initiative.

Advanced configuration

Asynchronous scoring with ascore

Controlling evaluator temperature

Log probabilities and evaluator models

Tracking controls

Asynchronous scoring with `ascore`