For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
Copy to LLMGithubGo to App
DocumentationIntegrationsBuilding Self-Improving AgentsSelf-hosting OpikSDK & API reference
DocumentationIntegrationsBuilding Self-Improving AgentsSelf-hosting OpikSDK & API reference
  • Getting Started
    • Home
    • Quickstart
    • MCP Server
    • Ollie Agent
    • FAQ
    • Changelog
    • Upgrading to Opik 2.0
  • Observability
    • Overview
    • Getting started
    • Concepts
    • Debugging agents with Ollie and Opik Connect
  • Development
    • Overview
    • Agent playground
    • Prompt playground
  • Evaluation
    • Overview
    • Getting started
    • Concepts
      • Overview
      • Heuristic metrics
      • Hallucination
      • LLM Juries
      • G-Eval
      • Conversation-level GEval
      • Compliance risk
      • Prompt uncertainty
      • Moderation
      • Meaning Match
      • Usefulness
      • Summarization consistency
      • Summarization coherence
      • Dialogue helpfulness
      • Answer relevance
      • Context precision
      • Context recall
      • Trajectory accuracy
      • Agent task completion
      • Agent tool correctness
      • Conversational metrics
      • Custom model
      • Advanced configuration
      • Custom metric
      • Custom conversation metric
      • Structured Output Compliance
      • Task span metrics
  • Production
  • Administration
    • Overview
    • Roles and Permissions
  • Contributing
    • Contribution Overview
LogoLogo
Copy to LLMGithubGo to App
On this page
  • Agent Tool Correctness Judge
  • Inputs
  • Configuration
EvaluationMetrics

Agent tool correctness

Was this page helpful?
Previous

Conversational metrics

Next
Built with

Agent Tool Correctness Judge

AgentToolCorrectnessJudge checks if an agent called the right tools with valid arguments and interpreted the outputs accurately. It’s invaluable for diagnosing production agents that orchestrate APIs, databases, or internal services.

Inspect tool usage
1from opik.evaluation.metrics import AgentToolCorrectnessJudge
2
3payload = """TOOL weather_api(city='Paris') -> 12°C and raining.
4AGENT: Responded "Sunny and warm".
5"""
6
7metric = AgentToolCorrectnessJudge()
8score = metric.score(output=payload)
9
10print(score.value) # 0.0–1.0 after normalisation
11print(score.reason)

Inputs

ArgumentTypeRequiredDescription
outputstrYesPayload describing the task, tool calls, and observed behaviour.

Configuration

ParameterDefaultNotes
modelgpt-5-nanoUpgrade to a larger evaluator if analysing lengthy traces.
temperature0.0Keep low for repeatable scoring.
trackTrueControls Opik tracking.
project_nameNoneOverride logging destination.

The judge emits an integer between 0 and 10 (scaled to 0–1 by Opik); read score.reason to pinpoint incorrect calls, missing validations, or misinterpreted outputs.