Summarization Consistency Judge

SummarizationConsistencyJudge compares a generated summary with the original document (or transcript) and scores how faithfully key facts were preserved. It follows the GEval method: expanding your instructions into a chain-of-thought rubric, then grading on a 0.0–1.0 scale (derived from a raw 0–10 judgement) with detailed explanations.

Use it when you automatically summarise support tickets, research reports, or call transcripts and want to catch hallucinations before they reach end users.

Checking summary faithfulness

1 from opik.evaluation.metrics import SummarizationConsistencyJudge
2 
3 metric = SummarizationConsistencyJudge(model="gpt-4o")
4 
5 payload = """CONTEXT: Acme's Q2 revenue grew 12% thanks to the launch of Product Vega.
6 CONTEXT: Operating margin declined to 14% because of R&D hiring.
7 SUMMARY: Acme's revenue was flat but margins improved due to new hires.
8 """
9 
10 score = metric.score(output=payload)
11 
12 print(score.value)   # 0.0–1.0 after normalisation
13 print(score.reason)

Inputs

Argument	Type	Required	Description
`input`	`str`	Optional	Source document or context.
`output`	`str`	Yes	Payload combining the source material and the candidate summary.

Configuration

Parameter	Default	Notes
`model`	`gpt-5-nano`	Swap to a larger evaluator for longer or more technical content.
`temperature`	`0.0`	Keep low for deterministic scoring; raise slightly to sample different critiques.
`track`	`True`	Disable to skip sending traces to Opik.
`project_name`	`None`	Override when logging scores.

The evaluator emits an integer between 0 and 10 that Opik normalises to 0–1; the reason field captures the rubric notes explaining the judgement.