evaluatePrompt Function | Opik Documentation

The evaluatePrompt function provides a streamlined way to evaluate prompt templates against a dataset. It automatically formats message templates with dataset variables, generates LLM responses, and evaluates the results using specified metrics.

Overview

evaluatePrompt is a convenience wrapper around the evaluate function that handles:

Template formatting: Automatically formats message templates with dataset item variables
Model invocation: Generates LLM responses using your specified model
Experiment tracking: Creates experiments linked to specific prompt versions
Metric evaluation: Scores outputs using the specified metrics

This is particularly useful for prompt engineering workflows where you want to quickly test different prompt templates against a dataset.

Function Signature

1 function evaluatePrompt(
2   options: EvaluatePromptOptions
3 ): Promise<EvaluationResult>;

EvaluatePromptOptions

1 interface EvaluatePromptOptions extends Omit<EvaluateOptions, "task"> {
2   // Required parameters
3   dataset: Dataset;
4   messages: OpikMessage[];
5 
6   // Optional parameters
7   model?: SupportedModelId | LanguageModel | OpikBaseModel;
8   templateType?: "mustache" | "jinja2";
9   scoringMetrics?: BaseMetric[];
10   experimentName?: string;
11   experimentConfig?: Record<string, unknown>;
12   prompts?: Prompt[];
13   projectName?: string;
14   nbSamples?: number;
15   scoringKeyMapping?: Record<string, string>;
16 }

Parameters

Required Parameters

dataset

Type: Dataset
Description: The dataset to evaluate prompts against. Each dataset item will be used to format the message templates and generate responses.

1 const dataset = await client.getOrCreateDataset({
2   name: "my-dataset",
3 });

messages

Type: OpikMessage[]
Description: Array of message templates with {{placeholders}} that will be formatted with dataset variables.

1 messages: [
2   { role: "system", content: "You are a helpful assistant" },
3   { role: "user", content: "Translate to {{language}}: {{text}}" },
4 ];

Optional Parameters

model

Type: SupportedModelId | LanguageModel | OpikBaseModel
Default: "gpt-4o"
Description: The language model to use for generation. Can be:
- Model ID string (e.g., "gpt-4o", "claude-3-5-sonnet-latest", "gemini-2.0-flash")
- Pre-configured LanguageModel instance from Vercel AI SDK
- Custom OpikBaseModel implementation

1 // Using model ID string
2 model: "gpt-4o";
3 
4 // Using LanguageModel instance
5 import { openai } from "@ai-sdk/openai";
6 const customModel = openai("gpt-4o");
7 model: customModel;

templateType

Type: "mustache" | "jinja2"
Default: "mustache"
Description: Template engine to use for variable substitution in message content.

1 // Mustache syntax (default)
2 templateType: "mustache";
3 messages: [{ role: "user", content: "Hello {{name}}" }];
4 
5 // Jinja2 syntax
6 templateType: "jinja2";
7 messages: [{ role: "user", content: "Hello {{ name }}" }];

scoringMetrics

Type: BaseMetric[]
Description: Array of metrics to evaluate the generated outputs. Can include both heuristic and LLM Judge metrics.

1 import { ExactMatch, Hallucination } from "opik";
2 
3 scoringMetrics: [new ExactMatch(), new Hallucination()];

experimentName

Type: string
Description: Name for the experiment. If not provided, a name will be auto-generated.

1 experimentName: "Prompt Evaluation - Translation Task";

experimentConfig

Type: Record<string, unknown>
Description: Additional metadata to store with the experiment. The function automatically adds prompt_template and model to this configuration.

1 experimentConfig: {
2   temperature: 0.7,
3   max_tokens: 1000,
4   version: "v2",
5 };

prompts

Type: Prompt[]
Description: Array of Opik Prompt objects to link to this experiment. Useful for tracking which prompt versions were used.

1 const prompt = await client.createPrompt({
2   name: "translation-prompt",
3   prompt: "Translate to {{language}}: {{text}}",
4 });
5 
6 prompts: [prompt];

projectName

Type: string
Description: Name of the Opik project to log traces to.

1 projectName: "prompt-engineering";

nbSamples

Type: number
Description: Maximum number of dataset items to evaluate. Useful for quick testing.

1 nbSamples: 10; // Only evaluate first 10 items

scoringKeyMapping

Type: Record<string, string>
Description: Maps metric parameter names to dataset/output field names when they don’t match.

1 scoringKeyMapping: {
2   input: "question", // Map 'input' param to 'question' field
3   expected: "reference_answer", // Map 'expected' param to 'reference_answer' field
4 };

Return Value

Returns a Promise<EvaluationResult> containing:

1 interface EvaluationResult {
2   experimentId: string; // ID of created experiment
3   experimentName: string; // Name of experiment
4   testResults: TestResult[]; // Results for each dataset item
5 }

Examples

Basic Usage

Simple prompt evaluation with default settings:

1 import { Opik, evaluatePrompt } from "opik";
2 
3 const client = new Opik();
4 const dataset = await client.getOrCreateDataset({ name: "qa-dataset" });
5 
6 await dataset.insert([
7   {
8     question: "What is the capital of France?",
9     expected_answer: "Paris",
10   },
11   {
12     question: "How do you calculate the area of a circle?",
13     expected_answer: "π × radius²",
14   },
15 ]);
16 
17 const result = await evaluatePrompt({
18   dataset,
19   messages: [
20     {
21       role: "system",
22       content:
23         "You are a helpful assistant. Answer questions accurately and concisely.",
24     },
25     { role: "user", content: "{{question}}" },
26   ],
27   model: "gpt-4o",
28 });
29 
30 console.log(`Experiment ID: ${result.experimentId}`);
31 console.log(`Evaluated ${result.testResults.length} items`);

With Scoring Metrics

Evaluate prompts with automatic scoring:

1 import { evaluatePrompt } from "opik";
2 import { Hallucination, ExactMatch } from "opik";
3 
4 // Create dataset with expected answers
5 const dataset = await client.getOrCreateDataset({ name: "geography-qa" });
6 await dataset.insert([
7   {
8     country: "France",
9     expected_answer: "Paris",
10   },
11   {
12     country: "Japan",
13     expected_answer: "Tokyo",
14   },
15 ]);
16 
17 await evaluatePrompt({
18   dataset,
19   messages: [
20     {
21       role: "user",
22       content: "What is the capital of {{country}}?",
23     },
24   ],
25   model: "gpt-4o",
26   scoringMetrics: [
27     new ExactMatch(), // Check exact match with expected output
28     new Hallucination(), // Check for hallucinations
29   ],
30   experimentName: "Geography Quiz Evaluation",
31 });

Using LanguageModel Instances

Use LanguageModel instances for provider-specific features:

1 import { openai } from "@ai-sdk/openai";
2 import { evaluatePrompt } from "opik";
3 
4 // Create model instance
5 const customModel = openai("gpt-4o");
6 
7 await evaluatePrompt({
8   dataset,
9   messages: [{ role: "user", content: "Summarize: {{text}}" }],
10   model: customModel,
11   experimentConfig: {
12     model_provider: "openai",
13     model_name: "gpt-4o",
14   },
15 });

Multi-Provider Model Support

The function supports models from multiple providers:

1 // OpenAI
2 model: "gpt-4o";
3 
4 // Anthropic
5 model: "claude-3-5-sonnet-latest";
6 
7 // Google Gemini
8 model: "gemini-2.0-flash";
9 
10 // Or use provider-specific LanguageModel instances
11 import { anthropic } from "@ai-sdk/anthropic";
12 const claude = anthropic("claude-3-5-sonnet-latest");
13 model: claude;

Linking to Prompt Versions

Track which prompt versions are used in evaluations:

1 import { Opik, evaluatePrompt } from "opik";
2 
3 const client = new Opik();
4 
5 // Create or get a prompt
6 const prompt = await client.createPrompt({
7   name: "customer-support-prompt",
8   prompt: "{{system_message}}\n\nUser: {{user_query}}",
9 });
10 
11 // Link the prompt to the evaluation
12 await evaluatePrompt({
13   dataset,
14   messages: [
15     { role: "system", content: "{{system_message}}" },
16     { role: "user", content: "{{user_query}}" },
17   ],
18   model: "gpt-4o",
19   prompts: [prompt], // Link to prompt
20   experimentName: "Customer Support - v2.1",
21 });

Template Types

Mustache Templates (Default)

1 await evaluatePrompt({
2   dataset,
3   messages: [
4     {
5       role: "user",
6       content: "Hello {{name}}, your order #{{order_id}} is ready.",
7     },
8   ],
9   templateType: "mustache", // This is the default
10 });

Jinja2 Templates

1 await evaluatePrompt({
2   dataset,
3   messages: [
4     {
5       role: "user",
6       content: "Hello {{ name }}, your order #{{ order_id }} is ready.",
7     },
8   ],
9   templateType: "jinja2",
10 });

Scoring Key Mapping

Map dataset fields to metric parameter names:

1 // Dataset has: { question: "...", reference_answer: "..." }
2 // Metric expects: { input: "...", expected: "..." }
3 
4 await evaluatePrompt({
5   dataset,
6   messages: [{ role: "user", content: "{{question}}" }],
7   scoringMetrics: [new ExactMatch()],
8   scoringKeyMapping: {
9     input: "question",
10     expected: "reference_answer",
11   },
12 });

Subset Evaluation

Evaluate only a subset of the dataset for quick iteration:

1 await evaluatePrompt({
2   dataset,
3   messages: [{ role: "user", content: "{{prompt}}" }],
4   nbSamples: 5, // Only evaluate first 5 items
5   experimentName: "Quick Test",
6 });

How It Works

When you call evaluatePrompt, the following happens:

Template Formatting: For each dataset item, message templates are formatted with item variables
Model Invocation: The formatted messages are sent to the specified model to generate a response
Experiment Creation: An experiment is created (or updated) with metadata
Metric Scoring: If metrics are provided, each output is scored
Result Aggregation: Results are collected and returned

Experiment Configuration

The function automatically enriches experiment configuration with:

prompt_template: The message templates used
model: The model identifier (name or type)

You can add additional metadata via experimentConfig:

1 experimentConfig: {
2   // Auto-added by evaluatePrompt:
3   // prompt_template: [{ role: 'user', content: '...' }]
4   // model: 'gpt-4o'
5 
6   // Your custom metadata:
7   temperature: 0.7,
8   version: "v2.0",
9   author: "team-ai",
10   description: "Testing improved prompt structure",
11 };

Best Practices

1. Start Simple, Then Add Metrics

Begin with basic prompt evaluation, then add metrics as needed:

1 // Step 1: Basic evaluation to see outputs
2 await evaluatePrompt({
3   dataset,
4   messages: [{ role: "user", content: "{{input}}" }],
5 });
6 
7 // Step 2: Add metrics after reviewing outputs
8 await evaluatePrompt({
9   dataset,
10   messages: [{ role: "user", content: "{{input}}" }],
11   scoringMetrics: [new Hallucination()],
12 });

2. Use Descriptive Experiment Names

Make it easy to find and compare experiments:

1 experimentName: "Translation - GPT-4o - v2.3 - 2025-01-15";

3. Version Your Prompts

Link evaluations to prompt versions for better tracking:

1 const prompt = await client.createPrompt({
2   name: "qa-prompt",
3   prompt: "Answer: {{question}}",
4   version: "v2.3",
5 });
6 
7 await evaluatePrompt({
8   dataset,
9   messages: [{ role: "user", content: "Answer: {{question}}" }],
10   prompts: [prompt],
11 });

4. Start with Small Samples

Use nbSamples for quick iteration before full evaluation:

1 // Quick test with 10 samples
2 await evaluatePrompt({
3   dataset,
4   messages: [{ role: "user", content: "{{input}}" }],
5   nbSamples: 10,
6 });
7 
8 // Full evaluation once satisfied
9 await evaluatePrompt({
10   dataset,
11   messages: [{ role: "user", content: "{{input}}" }],
12   // Evaluate entire dataset
13 });

5. Include Context in System Messages

Structure your prompts with clear system messages:

1 messages: [
2   {
3     role: "system",
4     content:
5       "You are an expert {{domain}} assistant. Provide accurate, concise answers.",
6   },
7   {
8     role: "user",
9     content: "{{question}}",
10   },
11 ];

Error Handling

The function validates inputs and throws errors for common issues:

1 try {
2   await evaluatePrompt({
3     dataset,
4     messages: [],
5   });
6 } catch (error) {
7   console.error(error.message);
8   // Error: Messages array is required and cannot be empty
9 }

Common validation errors:

Missing required dataset parameter
Empty messages array
Invalid experimentConfig (must be plain object)
Invalid templateType (must be ‘mustache’ or ‘jinja2’)