Agent Task Completion Judge

AgentTaskCompletionJudge reviews an agent run (often a natural-language summary of what happened) and decides whether the high-level objective was met. It is particularly helpful for multi-step agents where success cannot be inferred from the final response alone.

Did the agent finish the job?
1from opik.evaluation.metrics import AgentTaskCompletionJudge
2
3metric = AgentTaskCompletionJudge()
4
5payload = """TASK: Extract company name, address, and tax ID from the invoice.
6OUTCOME: Agent retrieved company name and address but failed to extract the tax ID.
7"""
8
9score = metric.score(output=payload)
10
11print(score.value) # 0.0–1.0 after normalisation
12print(score.reason)

Inputs

ArgumentTypeRequiredDescription
outputstrYesPayload describing the task, evidence, and outcome for the judge.

Configuration

ParameterDefaultNotes
modelgpt-5-nanoSwitch to heavier evaluators for complex workflows.
temperature0.0Increase slightly if you want more creative feedback.
trackTrueToggle evaluation logging.
project_nameNoneOverride project for logging.

The evaluator returns an integer between 0 and 10; Opik divides it by 10 so score.value falls in the 0.0–1.0 range, while score.reason summarises which sub-tasks were completed or missed.