Overview
Opik provides a set of built-in evaluation metrics that can be used to evaluate the output of your LLM calls. These metrics are broken down into two main categories:
- Heuristic metrics
- LLM as a Judge metrics
Heuristic metrics are deterministic and are often statistical in nature. LLM as a Judge metrics are non-deterministic and are based on the idea of using an LLM to evaluate the output of another LLM.
Opik provides the following built-in evaluation metrics:
You can also create your own custom metric, learn more about it in the Custom Metric section.
Customizing LLM as a Judge metrics
By default, Opik uses GPT-4o from OpenAI as the LLM to evaluate the output of other LLMs. However, you can easily switch to another LLM provider by specifying a different model parameter.
For Python, this functionality is based on LiteLLM framework. You can find a full list of supported LLM providers and how to configure them in the LiteLLM Providers guide.
For TypeScript, the SDK integrates with the Vercel AI SDK. You can use model ID strings for simplicity or LanguageModel instances for advanced configuration. See the Models documentation for more details.