Custom model

Opik provides a set of LLM as a Judge metrics that are designed to be model-agnostic and can be used with any LLM. In order to achieve this, we use the LiteLLM library to abstract the LLM calls.

By default, Opik will use the gpt-4o model. However, you can change this by setting the model parameter when initializing your metric to any model supported by LiteLLM:

1from opik.evaluation.metrics import Hallucination
2
3hallucination_metric = Hallucination(
4 model="gpt-4-turbo"
5)

Using a model supported by LiteLLM

In order to use many models supported by LiteLLM, you also need to pass additional parameters. For this, you can use the LiteLLMChatModel class and passing it to the metric:

1from opik.evaluation.metrics import Hallucination
2from opik.evaluation import models
3
4model = models.LiteLLMChatModel(
5 name="<model_name>",
6 base_url="<base_url>"
7)
8
9hallucination_metric = Hallucination(
10 model=model
11)

Creating Your Own Custom Model Class

Opik’s LLM-as-a-Judge metrics, such as Hallucination, are designed to work with various language models. While Opik supports many models out-of-the-box via LiteLLM, you can integrate any LLM by creating a custom model class. This involves subclassing opik.evaluation.models.OpikBaseModel and implementing its required methods.

The OpikBaseModel Interface

OpikBaseModel is an abstract base class that defines the interface Opik metrics use to interact with LLMs. To create a compatible custom model, you must implement the following methods:

  1. __init__(self, model_name: str): Initializes the base model with a given model name.
  2. generate_string(self, input: str, **kwargs: Any) -> str: Simplified interface to generate a string output from the model.
  3. generate_provider_response(self, **kwargs: Any) -> Any: Generate a provider-specific response. Can be used to interface with the underlying model provider (e.g., OpenAI, Anthropic) and get raw output.

Implementing a Custom Model for an OpenAI-like API

Here’s an example of a custom model class that interacts with an LLM service exposing an OpenAI-compatible API endpoint.

1import requests
2from typing import Any
3
4from opik.evaluation.models import OpikBaseModel
5
6class CustomOpenAICompatibleModel(OpikBaseModel):
7 def __init__(self, model_name: str, api_key: str, base_url: str):
8 super().__init__(model_name)
9 self.api_key = api_key
10 self.base_url = base_url # e.g., "https://api.openai.com/v1/chat/completions"
11 self.headers = {
12 "Authorization": f"Bearer {self.api_key}",
13 "Content-Type": "application/json"
14 }
15
16 def generate_string(self, input: str, **kwargs: Any) -> str:
17 """
18 This method is used as part of LLM as a Judge metrics to take a string prompt, pass it to
19 the model as a user message and return the model's response as a string.
20 """
21 conversation = [
22 {
23 "content": input,
24 "role": "user",
25 },
26 ]
27
28 provider_response = self.generate_provider_response(messages=conversation, **kwargs)
29 return provider_response["choices"][0]["message"]["content"]
30
31 def generate_provider_response(self, messages: list[dict[str, Any]], **kwargs: Any) -> Any:
32 """
33 This method is used as part of LLM as a Judge metrics to take a list of AI messages, pass it to
34 the model and return the full model response.
35 """
36 payload = {
37 "model": self.model_name,
38 "messages": messages,
39 }
40
41 response = requests.post(self.base_url, headers=self.headers, json=payload)
42
43 response.raise_for_status()
44 return response.json()

Key considerations for the implementation:

  • API Endpoint and Payload: Adjust base_url and the JSON payload to match your specific LLM provider’s requirements if they deviate from the common OpenAI structure.
  • Model Name: The model_name passed to __init__ is used as the model parameter in the API call. Ensure this matches an available model on your LLM service.

Using the Custom Model with the Hallucination Metric

In order to run an evaluation using your Custom Model with the Hallucination metric, you will first need to instantiate our CustomOpenAICompatibleModel class and pass it to the Hallucination class. The evaluation can then be kicked off by calling the Hallucination.score()` method.

1from opik.evaluation.metrics import Hallucination
2
3# Ensure these are set securely, e.g., via environment variables
4API_KEY = os.getenv("MY_CUSTOM_LLM_API_KEY")
5BASE_URL = "YOUR_LLM_CHAT_COMPLETIONS_ENDPOINT" # e.g., "https://api.openai.com/v1/chat/completions"
6MODEL_NAME = "your-model-name" # e.g., "gpt-3.5-turbo"
7
8# Initialize your custom model
9my_custom_model = CustomOpenAICompatibleModel(
10 model_name=MODEL_NAME,
11 api_key=API_KEY,
12 base_url=BASE_URL
13)
14
15# Initialize the Hallucination metric with the custom model
16hallucination_metric = Hallucination(
17 model=my_custom_model
18)
19
20# Example usage:
21evaluation = hallucination_metric.score(
22 input="What is the capital of Mars?",
23 output="The capital of Mars is Ares City, a bustling metropolis.",
24 context=["Mars is a planet in our solar system. It does not currently have any established cities or a designated capital."]
25)
26print(f"Hallucination Score: {evaluation.value}") # Expected: 1.0 (hallucination detected)
27print(f"Reason: {evaluation.reason}")

Key considerations for the implementation:

  • ScoreResult Output: Hallucination.score() returns a ScoreResult object containing the metric name (name), score value (value), optional explanation (reason), metadata (metadata), and a failure flag (scoring_failed).