Agent Task Completion Judge

AgentTaskCompletionJudge reviews an agent run (often a natural-language summary of what happened) and decides whether the high-level objective was met. It is particularly helpful for multi-step agents where success cannot be inferred from the final response alone.

Did the agent finish the job?

1 from opik.evaluation.metrics import AgentTaskCompletionJudge
2 
3 metric = AgentTaskCompletionJudge()
4 
5 payload = """TASK: Extract company name, address, and tax ID from the invoice.
6 OUTCOME: Agent retrieved company name and address but failed to extract the tax ID.
7 """
8 
9 score = metric.score(output=payload)
10 
11 print(score.value)  # 0.0–1.0 after normalisation
12 print(score.reason)

Inputs

Argument	Type	Required	Description
`output`	`str`	Yes	Payload describing the task, evidence, and outcome for the judge.

Configuration

Parameter	Default	Notes
`model`	`gpt-5-nano`	Switch to heavier evaluators for complex workflows.
`temperature`	`0.0`	Increase slightly if you want more creative feedback.
`track`	`True`	Toggle evaluation logging.
`project_name`	`None`	Override project for logging.

The evaluator returns an integer between 0 and 10; Opik divides it by 10 so score.value falls in the 0.0–1.0 range, while score.reason summarises which sub-tasks were completed or missed.