Meaning Match
Meaning Match
The Meaning Match metric evaluates whether an LLMās output semantically matches a ground truth answer, regardless of phrasing or formatting. This metric is particularly useful for evaluating question-answering systems where the same answer can be expressed in different ways.
How to use the Meaning Match metric
The Meaning Match metric is available as an LLM-as-a-Judge metric in automation rules. You can use it to automatically evaluate traces in your project by creating a new rule.
Creating a rule with Meaning Match
- Navigate to your project in Opik
- Click on āRulesā in the sidebar
- Click āCreate new ruleā
- Select āLLM-as-judgeā as the metric type
- Choose āMeaning Matchā from the prompt dropdown
- Configure the variable mapping:
- input: The original question or prompt
- ground_truth: The expected correct answer
- output: The LLMās generated response
- Select your preferred LLM model for evaluation
- Configure sampling rate and filters as needed
- Click āCreate ruleā
Understanding the scores
The Meaning Match metric returns a boolean score:
- true (1.0): The output conveys the same essential answer as the ground truth, even if worded differently
- false (0.0): The output contradicts, differs from, or fails to include the core answer in the ground truth
Each score includes a detailed reason explaining the judgment.
Evaluation Guidelines
The Meaning Match metric follows these rules when evaluating responses:
- Focus on factual equivalence - Ignores style, grammar, or verbosity
- Accept aliases and synonyms - āNYCā ā āNew York Cityā; āDa Vinciā ā āLeonardo da Vinciā
- Ignore formatting - Case, punctuation, and whitespace differences are acceptable
- Allow extra context - Additional details are okay if they donāt contradict the main answer
- Reject hedging - Uncertain or incomplete answers score as false
- Treat numeric equivalents - ā100ā = āone hundredā
- Reject multiple alternatives - If the output includes the correct answer with incorrect alternatives, it scores as false
Example evaluations
Meaning Match Prompt
Opik uses an LLM as a Judge to evaluate semantic equivalence. By default, the evaluation uses the model you select when creating the rule. The prompt template used for evaluation is:
Use cases
The Meaning Match metric is ideal for:
- Question-answering systems - Evaluate if answers are semantically correct
- Information extraction - Verify extracted entities match expected values
- Knowledge base validation - Check if responses align with ground truth knowledge
- RAG systems - Assess if retrieved information correctly answers questions
- Multi-language systems - Compare answers across translations (when ground truth is translated)
Best practices
- Provide clear ground truth - The more specific the ground truth, the more accurate the evaluation
- Use with other metrics - Combine with other metrics like hallucination or answer relevance for comprehensive evaluation
- Monitor false positives/negatives - Review evaluation results periodically to ensure the metric works well for your use case
- Test with edge cases - Try the metric with ambiguous or borderline cases to understand its behavior