AgentToolCorrectnessJudge checks if an agent called the right tools with valid arguments and interpreted the outputs accurately. It’s invaluable for diagnosing production agents that orchestrate APIs, databases, or internal services.
The judge emits an integer between 0 and 10 (scaled to 0–1 by Opik); read score.reason to pinpoint incorrect calls, missing validations, or misinterpreted outputs.