Evaluate LLM Applications with Ragas Metrics in Opik

The Opik SDK provides a simple way to integrate with Ragas, a framework for evaluating RAG systems.

There are two main ways to use Ragas with Opik:

  1. Using Ragas to score traces or spans.
  2. Using Ragas to evaluate a RAG pipeline.

Account Setup

Comet provides a hosted version of the Opik platform, simply create an account and grab your API Key.

You can also run the Opik platform locally, see the installation guide for more information.

Getting Started

Installation

You will first need to install the opik and ragas packages:

$pip install opik ragas

Configuring Opik

Configure the Opik Python SDK for your deployment type. See the Python SDK Configuration guide for detailed instructions on:

  • CLI configuration: opik configure
  • Code configuration: opik.configure()
  • Self-hosted vs Cloud vs Enterprise setup
  • Configuration files and environment variables

Configuring Ragas

In order to use Ragas, you will need to configure your LLM provider API keys. For this example, we’ll use OpenAI. You can find or create your API keys in these pages:

You can set them as environment variables:

$export OPENAI_API_KEY="YOUR_API_KEY"

Or set them programmatically:

1import os
2import getpass
3
4if "OPENAI_API_KEY" not in os.environ:
5 os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter your OpenAI API key: ")

Using Ragas to score traces or spans

Ragas provides a set of metrics that can be used to evaluate the quality of a RAG pipeline, a full list of the supported metrics can be found in the Ragas documentation.

You can use the RagasMetricWrapper to easily integrate Ragas metrics with Opik tracking:

1# Import the required dependencies
2from ragas.metrics import AnswerRelevancy
3from langchain_openai.chat_models import ChatOpenAI
4from langchain_openai.embeddings import OpenAIEmbeddings
5from ragas.llms import LangchainLLMWrapper
6from ragas.embeddings import LangchainEmbeddingsWrapper
7from opik.evaluation.metrics import RagasMetricWrapper
8
9# Initialize the Ragas metric
10llm = LangchainLLMWrapper(ChatOpenAI())
11emb = LangchainEmbeddingsWrapper(OpenAIEmbeddings())
12ragas_answer_relevancy = AnswerRelevancy(llm=llm, embeddings=emb)
13
14# Wrap the Ragas metric with RagasMetricWrapper for Opik integration
15answer_relevancy_metric = RagasMetricWrapper(
16 ragas_answer_relevancy,
17 track=True # This enables automatic tracing in Opik
18)

Once the metric wrapper is set up, you can use it to score traces or spans:

1from opik import track
2from opik.opik_context import update_current_trace
3
4@track
5def retrieve_contexts(question):
6 # Define the retrieval function, in this case we will hard code the contexts
7 return ["Paris is the capital of France.", "Paris is in France."]
8
9@track
10def answer_question(question, contexts):
11 # Define the answer function, in this case we will hard code the answer
12 return "Paris"
13
14@track
15def rag_pipeline(question):
16 # Define the pipeline
17 contexts = retrieve_contexts(question)
18 answer = answer_question(question, contexts)
19
20 # Score the pipeline using the RagasMetricWrapper
21 score_result = answer_relevancy_metric.score(
22 user_input=question,
23 response=answer,
24 retrieved_contexts=contexts
25 )
26
27 # Add the score to the current trace
28 update_current_trace(
29 feedback_scores=[{"name": score_result.name, "value": score_result.value}]
30 )
31
32 return answer
33
34print(rag_pipeline("What is the capital of France?"))

In the Opik UI, you will be able to see the full trace including the score calculation:

Comprehensive Example: Dataset Evaluation

For more advanced use cases, you can evaluate entire datasets using Ragas metrics with the Opik evaluation platform:

1. Create a Dataset

1from datasets import load_dataset
2import opik
3
4opik_client = opik.Opik()
5
6# Create a small dataset
7fiqa_eval = load_dataset("explodinggradients/fiqa", "ragas_eval")
8
9# Reformat the dataset to match the schema expected by the Ragas evaluate function
10hf_dataset = fiqa_eval["baseline"].select(range(3))
11dataset_items = hf_dataset.map(
12 lambda x: {
13 "user_input": x["question"],
14 "reference": x["ground_truths"][0],
15 "retrieved_contexts": x["contexts"],
16 }
17)
18dataset = opik_client.get_or_create_dataset("ragas-demo-dataset")
19dataset.insert(dataset_items)

2. Define Evaluation Task

1# Create an evaluation task
2def evaluation_task(x):
3 return {
4 "user_input": x["question"],
5 "response": x["answer"],
6 "retrieved_contexts": x["contexts"],
7 }

3. Run Evaluation

1# Use the RagasMetricWrapper directly with Opik's evaluate function
2opik.evaluation.evaluate(
3 dataset,
4 evaluation_task,
5 scoring_metrics=[answer_relevancy_metric],
6 task_threads=1,
7)

4. Alternative: Using Ragas Native Evaluation

You can also use Ragas’ native evaluation function with Opik tracing:

1from datasets import load_dataset
2from opik.integrations.langchain import OpikTracer
3from ragas.metrics import context_precision, answer_relevancy, faithfulness
4from ragas import evaluate
5
6fiqa_eval = load_dataset("explodinggradients/fiqa", "ragas_eval")
7
8# Reformat the dataset to match the schema expected by the Ragas evaluate function
9dataset = fiqa_eval["baseline"].select(range(3))
10
11dataset = dataset.map(
12 lambda x: {
13 "user_input": x["question"],
14 "reference": x["ground_truths"][0],
15 "retrieved_contexts": x["contexts"],
16 }
17)
18
19opik_tracer_eval = OpikTracer(tags=["ragas_eval"], metadata={"evaluation_run": True})
20
21result = evaluate(
22 dataset,
23 metrics=[context_precision, faithfulness, answer_relevancy],
24 callbacks=[opik_tracer_eval],
25)
26
27print(result)

Using Ragas metrics to evaluate a RAG pipeline

The RagasMetricWrapper can also be used directly within the Opik evaluation platform. This approach is much simpler than creating custom wrappers:

1. Define the Ragas metric

We will start by defining the Ragas metric, in this example we will use AnswerRelevancy:

1from ragas.metrics import AnswerRelevancy
2from langchain_openai.chat_models import ChatOpenAI
3from langchain_openai.embeddings import OpenAIEmbeddings
4from ragas.llms import LangchainLLMWrapper
5from ragas.embeddings import LangchainEmbeddingsWrapper
6from opik.evaluation.metrics import RagasMetricWrapper
7
8# Initialize the Ragas metric
9llm = LangchainLLMWrapper(ChatOpenAI())
10emb = LangchainEmbeddingsWrapper(OpenAIEmbeddings())
11
12ragas_answer_relevancy = AnswerRelevancy(llm=llm, embeddings=emb)

2. Create the metric wrapper

Simply wrap the Ragas metric with RagasMetricWrapper:

1# Create the answer relevancy scoring metric
2answer_relevancy = RagasMetricWrapper(
3 ragas_answer_relevancy,
4 track=True # Enable tracing for the metric computation
5)

If you are running within a Jupyter notebook, you will need to add the following line to the top of your notebook:

1import nest_asyncio
2nest_asyncio.apply()

3. Use the metric wrapper within the Opik evaluation platform

You can now use the metric wrapper directly within the Opik evaluation platform:

1from opik.evaluation import evaluate
2
3evaluation_task = evaluate(
4 dataset=dataset,
5 task=evaluation_task,
6 scoring_metrics=[answer_relevancy],
7 nb_samples=10,
8)

The RagasMetricWrapper automatically handles:

  • Field mapping between Opik and Ragas (e.g., inputuser_input, outputresponse)
  • Async execution of Ragas metrics
  • Integration with Opik’s tracing system when track=True
  • Proper error handling for missing required fields