For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
Copy to LLMGithubGo to App
DocumentationIntegrationsBuilding Self-Improving AgentsSelf-hosting OpikSDK & API reference
DocumentationIntegrationsBuilding Self-Improving AgentsSelf-hosting OpikSDK & API reference
  • Getting Started
    • Home
    • Quickstart
    • MCP Server
    • Ollie Agent
    • FAQ
    • Changelog
    • Upgrading to Opik 2.0
  • Observability
    • Overview
    • Getting started
    • Concepts
    • Debugging agents with Ollie and Opik Connect
  • Development
    • Overview
    • Agent playground
    • Prompt playground
  • Evaluation
    • Overview
    • Getting started
    • Concepts
      • Overview
      • Heuristic metrics
      • Hallucination
      • LLM Juries
      • G-Eval
      • Conversation-level GEval
      • Compliance risk
      • Prompt uncertainty
      • Moderation
      • Meaning Match
      • Usefulness
      • Summarization consistency
      • Summarization coherence
      • Dialogue helpfulness
      • Answer relevance
      • Context precision
      • Context recall
      • Trajectory accuracy
      • Agent task completion
      • Agent tool correctness
      • Conversational metrics
      • Custom model
      • Advanced configuration
      • Custom metric
      • Custom conversation metric
      • Structured Output Compliance
      • Task span metrics
  • Production
  • Administration
    • Overview
    • Roles and Permissions
  • Contributing
    • Contribution Overview
LogoLogo
Copy to LLMGithubGo to App
On this page
  • Agent Task Completion Judge
  • Inputs
  • Configuration
EvaluationMetrics

Agent task completion

Was this page helpful?
Previous

Agent tool correctness

Next
Built with

Agent Task Completion Judge

AgentTaskCompletionJudge reviews an agent run (often a natural-language summary of what happened) and decides whether the high-level objective was met. It is particularly helpful for multi-step agents where success cannot be inferred from the final response alone.

Did the agent finish the job?
1from opik.evaluation.metrics import AgentTaskCompletionJudge
2
3metric = AgentTaskCompletionJudge()
4
5payload = """TASK: Extract company name, address, and tax ID from the invoice.
6OUTCOME: Agent retrieved company name and address but failed to extract the tax ID.
7"""
8
9score = metric.score(output=payload)
10
11print(score.value) # 0.0–1.0 after normalisation
12print(score.reason)

Inputs

ArgumentTypeRequiredDescription
outputstrYesPayload describing the task, evidence, and outcome for the judge.

Configuration

ParameterDefaultNotes
modelgpt-5-nanoSwitch to heavier evaluators for complex workflows.
temperature0.0Increase slightly if you want more creative feedback.
trackTrueToggle evaluation logging.
project_nameNoneOverride project for logging.

The evaluator returns an integer between 0 and 10; Opik divides it by 10 so score.value falls in the 0.0–1.0 range, while score.reason summarises which sub-tasks were completed or missed.