Meaning Match

The Meaning Match metric evaluates whether an LLM’s output semantically matches a ground truth answer, regardless of phrasing or formatting. This metric is particularly useful for evaluating question-answering systems where the same answer can be expressed in different ways.

How to use the Meaning Match metric

The Meaning Match metric is available as an LLM-as-a-Judge metric in automation rules. You can use it to automatically evaluate traces in your project by creating a new rule.

Creating a rule with Meaning Match

Navigate to your project in Opik
Click on “Rules” in the sidebar
Click “Create new rule”
Select “LLM-as-judge” as the metric type
Choose “Meaning Match” from the prompt dropdown
Configure the variable mapping:
- input: The original question or prompt
- ground_truth: The expected correct answer
- output: The LLM’s generated response
Select your preferred LLM model for evaluation
Configure sampling rate and filters as needed
Click “Create rule”

Understanding the scores

The Meaning Match metric returns a boolean score:

true (1.0): The output conveys the same essential answer as the ground truth, even if worded differently
false (0.0): The output contradicts, differs from, or fails to include the core answer in the ground truth

Each score includes a detailed reason explaining the judgment.

Evaluation Guidelines

The Meaning Match metric follows these rules when evaluating responses:

Focus on factual equivalence - Ignores style, grammar, or verbosity
Accept aliases and synonyms - “NYC” ≈ “New York City”; “Da Vinci” ≈ “Leonardo da Vinci”
Ignore formatting - Case, punctuation, and whitespace differences are acceptable
Allow extra context - Additional details are okay if they don’t contradict the main answer
Reject hedging - Uncertain or incomplete answers score as false
Treat numeric equivalents - “100” = “one hundred”
Reject multiple alternatives - If the output includes the correct answer with incorrect alternatives, it scores as false

Example evaluations

Input	Ground Truth	Output	Score	Reason
What’s the capital of France?	Paris	It’s Paris	✅ true	Output conveys the same factual answer as the ground truth
Who painted the Mona Lisa?	Leonardo da Vinci	Da Vinci	✅ true	”Da Vinci” is an accepted alias for “Leonardo da Vinci”
Who painted the Mona Lisa?	Leonardo da Vinci	Pablo Picasso	❌ false	Output names a different painter than the ground truth
What’s 10 + 10?	20	The answer is twenty	✅ true	Numeric and textual forms are treated as equivalent

Meaning Match Prompt

Opik uses an LLM as a Judge to evaluate semantic equivalence. By default, the evaluation uses the model you select when creating the rule. The prompt template used for evaluation is:

You are an expert semantic equivalence judge. Your task is to decide whether the OUTPUT conveys the same essential answer as the GROUND_TRUTH, regardless of phrasing or formatting.
## What to judge
- TRUE if the OUTPUT expresses the same core fact/entity/value as the GROUND_TRUTH.
- FALSE if the OUTPUT contradicts, differs from, or fails to include the core fact/value in GROUND_TRUTH.
## Rules
1. Focus only on the factual equivalence of the core answer. Ignore style, grammar, or verbosity.
2. Accept aliases, synonyms, paraphrases, or equivalent expressions.
   Examples: "NYC" ≈ "New York City"; "Da Vinci" ≈ "Leonardo da Vinci".
3. Ignore case, punctuation, and formatting differences.
4. Extra contextual details are acceptable **only if they don't change or contradict** the main answer.
5. If the OUTPUT includes the correct answer along with additional unrelated or incorrect alternatives → FALSE.
6. Uncertain, hedged, or incomplete answers → FALSE.
7. Treat numeric and textual forms as equivalent (e.g., "100" = "one hundred").
8. Ignore whitespace, articles, and small typos that don't change meaning.
## Output Format
Your response **must** be a single JSON object in the following format:
{
  "score": true or false,
  "reason": ["short reason for the response"]
}
## Example
INPUT: "Who painted the Mona Lisa?"
GROUND_TRUTH: "Leonardo da Vinci"
OUTPUT: "It was painted by Leonardo da Vinci."
→ {"score": true, "reason": ["Output conveys the same factual answer as the ground truth."]}
OUTPUT: "Pablo Picasso"
→ {"score": false, "reason": ["Output names a different painter than the ground truth."]}
INPUT:
{{input}}
GROUND_TRUTH:
{{ground_truth}}
OUTPUT:
{{output}}

Use cases

The Meaning Match metric is ideal for:

Question-answering systems - Evaluate if answers are semantically correct
Information extraction - Verify extracted entities match expected values
Knowledge base validation - Check if responses align with ground truth knowledge
RAG systems - Assess if retrieved information correctly answers questions
Multi-language systems - Compare answers across translations (when ground truth is translated)

Best practices

Provide clear ground truth - The more specific the ground truth, the more accurate the evaluation
Use with other metrics - Combine with other metrics like hallucination or answer relevance for comprehensive evaluation
Monitor false positives/negatives - Review evaluation results periodically to ensure the metric works well for your use case
Test with edge cases - Try the metric with ambiguous or borderline cases to understand its behavior