Summarization Consistency Judge

SummarizationConsistencyJudge compares a generated summary with the original document (or transcript) and scores how faithfully key facts were preserved. It follows the GEval method: expanding your instructions into a chain-of-thought rubric, then grading on a 0.0–1.0 scale (derived from a raw 0–10 judgement) with detailed explanations.

Use it when you automatically summarise support tickets, research reports, or call transcripts and want to catch hallucinations before they reach end users.

Checking summary faithfulness
1from opik.evaluation.metrics import SummarizationConsistencyJudge
2
3metric = SummarizationConsistencyJudge(model="gpt-4o")
4
5payload = """CONTEXT: Acme's Q2 revenue grew 12% thanks to the launch of Product Vega.
6CONTEXT: Operating margin declined to 14% because of R&D hiring.
7SUMMARY: Acme's revenue was flat but margins improved due to new hires.
8"""
9
10score = metric.score(output=payload)
11
12print(score.value) # 0.0–1.0 after normalisation
13print(score.reason)

Inputs

ArgumentTypeRequiredDescription
inputstrOptionalSource document or context.
outputstrYesPayload combining the source material and the candidate summary.

Configuration

ParameterDefaultNotes
modelgpt-5-nanoSwap to a larger evaluator for longer or more technical content.
temperature0.0Keep low for deterministic scoring; raise slightly to sample different critiques.
trackTrueDisable to skip sending traces to Opik.
project_nameNoneOverride when logging scores.

The evaluator emits an integer between 0 and 10 that Opik normalises to 0–1; the reason field captures the rubric notes explaining the judgement.