Agent Tool Correctness Judge

AgentToolCorrectnessJudge checks if an agent called the right tools with valid arguments and interpreted the outputs accurately. It’s invaluable for diagnosing production agents that orchestrate APIs, databases, or internal services.

Inspect tool usage
1from opik.evaluation.metrics import AgentToolCorrectnessJudge
2
3payload = """TOOL weather_api(city='Paris') -> 12°C and raining.
4AGENT: Responded "Sunny and warm".
5"""
6
7metric = AgentToolCorrectnessJudge()
8score = metric.score(output=payload)
9
10print(score.value) # 0.0–1.0 after normalisation
11print(score.reason)

Inputs

ArgumentTypeRequiredDescription
outputstrYesPayload describing the task, tool calls, and observed behaviour.

Configuration

ParameterDefaultNotes
modelgpt-5-nanoUpgrade to a larger evaluator if analysing lengthy traces.
temperature0.0Keep low for repeatable scoring.
trackTrueControls Opik tracking.
project_nameNoneOverride logging destination.

The judge emits an integer between 0 and 10 (scaled to 0–1 by Opik); read score.reason to pinpoint incorrect calls, missing validations, or misinterpreted outputs.