Agent Tool Correctness Judge

AgentToolCorrectnessJudge checks if an agent called the right tools with valid arguments and interpreted the outputs accurately. It’s invaluable for diagnosing production agents that orchestrate APIs, databases, or internal services.

Inspect tool usage

1 from opik.evaluation.metrics import AgentToolCorrectnessJudge
2 
3 payload = """TOOL weather_api(city='Paris') -> 12°C and raining.
4 AGENT: Responded "Sunny and warm".
5 """
6 
7 metric = AgentToolCorrectnessJudge()
8 score = metric.score(output=payload)
9 
10 print(score.value)   # 0.0–1.0 after normalisation
11 print(score.reason)

Inputs

Argument	Type	Required	Description
`output`	`str`	Yes	Payload describing the task, tool calls, and observed behaviour.

Configuration

Parameter	Default	Notes
`model`	`gpt-5-nano`	Upgrade to a larger evaluator if analysing lengthy traces.
`temperature`	`0.0`	Keep low for repeatable scoring.
`track`	`True`	Controls Opik tracking.
`project_name`	`None`	Override logging destination.

The judge emits an integer between 0 and 10 (scaled to 0–1 by Opik); read score.reason to pinpoint incorrect calls, missing validations, or misinterpreted outputs.