The usefulness metric allows you to evaluate how useful an LLM response is given an input. It uses a language model to assess the usefulness and provides a score between 0.0 and 1.0, where higher values indicate higher usefulness. Along with the score, it provides a detailed explanation of why that score was assigned.
You can use the Usefulness metric as follows:
Asynchronous scoring is also supported with the ascore method in Python and score method in TypeScript (which is always async).
The usefulness score ranges from 0.0 to 1.0:
Each score comes with a detailed explanation (result.reason) that helps understand why that particular score was assigned.
Opik uses an LLM as a Judge to evaluate usefulness, for this we have a prompt template that is used to generate the prompt for the LLM. By default, the gpt-4o model is used to evaluate responses but you can change this to any model supported by LiteLLM by setting the model parameter. You can learn more about customizing models in the Customize models for LLM as a Judge metrics section.
The template is as follows: