Meaning Match
The Meaning Match metric evaluates whether an LLM’s output semantically matches a ground truth answer, regardless of phrasing or formatting. This metric is particularly useful for evaluating question-answering systems where the same answer can be expressed in different ways.
How to use the Meaning Match metric
The Meaning Match metric is available as an LLM-as-a-Judge metric in automation rules. You can use it to automatically evaluate traces in your project by creating a new rule.
Creating a rule with Meaning Match
- Navigate to your project in Opik
- Click on “Rules” in the sidebar
- Click “Create new rule”
- Select “LLM-as-judge” as the metric type
- Choose “Meaning Match” from the prompt dropdown
- Configure the variable mapping:
- input: The original question or prompt
- ground_truth: The expected correct answer
- output: The LLM’s generated response
- Select your preferred LLM model for evaluation
- Configure sampling rate and filters as needed
- Click “Create rule”
Understanding the scores
The Meaning Match metric returns a boolean score:
- true (1.0): The output conveys the same essential answer as the ground truth, even if worded differently
- false (0.0): The output contradicts, differs from, or fails to include the core answer in the ground truth
Each score includes a detailed reason explaining the judgment.
Evaluation Guidelines
The Meaning Match metric follows these rules when evaluating responses:
- Focus on factual equivalence - Ignores style, grammar, or verbosity
- Accept aliases and synonyms - “NYC” ≈ “New York City”; “Da Vinci” ≈ “Leonardo da Vinci”
- Ignore formatting - Case, punctuation, and whitespace differences are acceptable
- Allow extra context - Additional details are okay if they don’t contradict the main answer
- Reject hedging - Uncertain or incomplete answers score as false
- Treat numeric equivalents - “100” = “one hundred”
- Reject multiple alternatives - If the output includes the correct answer with incorrect alternatives, it scores as false
Example evaluations
Meaning Match Prompt
Opik uses an LLM as a Judge to evaluate semantic equivalence. By default, the evaluation uses the model you select when creating the rule. The prompt template used for evaluation is:
Use cases
The Meaning Match metric is ideal for:
- Question-answering systems - Evaluate if answers are semantically correct
- Information extraction - Verify extracted entities match expected values
- Knowledge base validation - Check if responses align with ground truth knowledge
- RAG systems - Assess if retrieved information correctly answers questions
- Multi-language systems - Compare answers across translations (when ground truth is translated)
Best practices
- Provide clear ground truth - The more specific the ground truth, the more accurate the evaluation
- Use with other metrics - Combine with other metrics like hallucination or answer relevance for comprehensive evaluation
- Monitor false positives/negatives - Review evaluation results periodically to ensure the metric works well for your use case
- Test with edge cases - Try the metric with ambiguous or borderline cases to understand its behavior