evaluatePrompt Function

The evaluatePrompt function provides a streamlined way to evaluate prompt templates against a dataset. It automatically formats message templates with dataset variables, generates LLM responses, and evaluates the results using specified metrics.

Overview

evaluatePrompt is a convenience wrapper around the evaluate function that handles:

  • Template formatting: Automatically formats message templates with dataset item variables
  • Model invocation: Generates LLM responses using your specified model
  • Experiment tracking: Creates experiments linked to specific prompt versions
  • Metric evaluation: Scores outputs using the specified metrics

This is particularly useful for prompt engineering workflows where you want to quickly test different prompt templates against a dataset.

Function Signature

1function evaluatePrompt(
2 options: EvaluatePromptOptions
3): Promise<EvaluationResult>;

EvaluatePromptOptions

1interface EvaluatePromptOptions extends Omit<EvaluateOptions, "task"> {
2 // Required parameters
3 dataset: Dataset;
4 messages: OpikMessage[];
5
6 // Optional parameters
7 model?: SupportedModelId | LanguageModel | OpikBaseModel;
8 templateType?: "mustache" | "jinja2";
9 scoringMetrics?: BaseMetric[];
10 experimentName?: string;
11 experimentConfig?: Record<string, unknown>;
12 prompts?: Prompt[];
13 projectName?: string;
14 nbSamples?: number;
15 scoringKeyMapping?: Record<string, string>;
16}

Parameters

Required Parameters

dataset

  • Type: Dataset
  • Description: The dataset to evaluate prompts against. Each dataset item will be used to format the message templates and generate responses.
1const dataset = await client.getOrCreateDataset({
2 name: "my-dataset",
3});

messages

  • Type: OpikMessage[]
  • Description: Array of message templates with {{placeholders}} that will be formatted with dataset variables.
1messages: [
2 { role: "system", content: "You are a helpful assistant" },
3 { role: "user", content: "Translate to {{language}}: {{text}}" },
4];

Optional Parameters

model

  • Type: SupportedModelId | LanguageModel | OpikBaseModel
  • Default: "gpt-4o"
  • Description: The language model to use for generation. Can be:
    • Model ID string (e.g., "gpt-4o", "claude-3-5-sonnet-latest", "gemini-2.0-flash")
    • Pre-configured LanguageModel instance from Vercel AI SDK
    • Custom OpikBaseModel implementation
1// Using model ID string
2model: "gpt-4o";
3
4// Using LanguageModel instance
5import { openai } from "@ai-sdk/openai";
6const customModel = openai("gpt-4o");
7model: customModel;

templateType

  • Type: "mustache" | "jinja2"
  • Default: "mustache"
  • Description: Template engine to use for variable substitution in message content.
1// Mustache syntax (default)
2templateType: "mustache";
3messages: [{ role: "user", content: "Hello {{name}}" }];
4
5// Jinja2 syntax
6templateType: "jinja2";
7messages: [{ role: "user", content: "Hello {{ name }}" }];

scoringMetrics

  • Type: BaseMetric[]
  • Description: Array of metrics to evaluate the generated outputs. Can include both heuristic and LLM Judge metrics.
1import { ExactMatch, Hallucination } from "opik";
2
3scoringMetrics: [new ExactMatch(), new Hallucination()];

experimentName

  • Type: string
  • Description: Name for the experiment. If not provided, a name will be auto-generated.
1experimentName: "Prompt Evaluation - Translation Task";

experimentConfig

  • Type: Record<string, unknown>
  • Description: Additional metadata to store with the experiment. The function automatically adds prompt_template and model to this configuration.
1experimentConfig: {
2 temperature: 0.7,
3 max_tokens: 1000,
4 version: "v2",
5};

prompts

  • Type: Prompt[]
  • Description: Array of Opik Prompt objects to link to this experiment. Useful for tracking which prompt versions were used.
1const prompt = await client.createPrompt({
2 name: "translation-prompt",
3 prompt: "Translate to {{language}}: {{text}}",
4});
5
6prompts: [prompt];

projectName

  • Type: string
  • Description: Name of the Opik project to log traces to.
1projectName: "prompt-engineering";

nbSamples

  • Type: number
  • Description: Maximum number of dataset items to evaluate. Useful for quick testing.
1nbSamples: 10; // Only evaluate first 10 items

scoringKeyMapping

  • Type: Record<string, string>
  • Description: Maps metric parameter names to dataset/output field names when they don’t match.
1scoringKeyMapping: {
2 input: "question", // Map 'input' param to 'question' field
3 expected: "reference_answer", // Map 'expected' param to 'reference_answer' field
4};

Return Value

Returns a Promise<EvaluationResult> containing:

1interface EvaluationResult {
2 experimentId: string; // ID of created experiment
3 experimentName: string; // Name of experiment
4 testResults: TestResult[]; // Results for each dataset item
5}

Examples

Basic Usage

Simple prompt evaluation with default settings:

1import { Opik, evaluatePrompt } from "opik";
2
3const client = new Opik();
4const dataset = await client.getOrCreateDataset({ name: "qa-dataset" });
5
6await dataset.insert([
7 {
8 question: "What is the capital of France?",
9 expected_answer: "Paris",
10 },
11 {
12 question: "How do you calculate the area of a circle?",
13 expected_answer: "π × radius²",
14 },
15]);
16
17const result = await evaluatePrompt({
18 dataset,
19 messages: [
20 {
21 role: "system",
22 content:
23 "You are a helpful assistant. Answer questions accurately and concisely.",
24 },
25 { role: "user", content: "{{question}}" },
26 ],
27 model: "gpt-4o",
28});
29
30console.log(`Experiment ID: ${result.experimentId}`);
31console.log(`Evaluated ${result.testResults.length} items`);

With Scoring Metrics

Evaluate prompts with automatic scoring:

1import { evaluatePrompt } from "opik";
2import { Hallucination, ExactMatch } from "opik";
3
4// Create dataset with expected answers
5const dataset = await client.getOrCreateDataset({ name: "geography-qa" });
6await dataset.insert([
7 {
8 country: "France",
9 expected_answer: "Paris",
10 },
11 {
12 country: "Japan",
13 expected_answer: "Tokyo",
14 },
15]);
16
17await evaluatePrompt({
18 dataset,
19 messages: [
20 {
21 role: "user",
22 content: "What is the capital of {{country}}?",
23 },
24 ],
25 model: "gpt-4o",
26 scoringMetrics: [
27 new ExactMatch(), // Check exact match with expected output
28 new Hallucination(), // Check for hallucinations
29 ],
30 experimentName: "Geography Quiz Evaluation",
31});

Using LanguageModel Instances

Use LanguageModel instances for provider-specific features:

1import { openai } from "@ai-sdk/openai";
2import { evaluatePrompt } from "opik";
3
4// Create model instance
5const customModel = openai("gpt-4o");
6
7await evaluatePrompt({
8 dataset,
9 messages: [{ role: "user", content: "Summarize: {{text}}" }],
10 model: customModel,
11 experimentConfig: {
12 model_provider: "openai",
13 model_name: "gpt-4o",
14 },
15});

Multi-Provider Model Support

The function supports models from multiple providers:

1// OpenAI
2model: "gpt-4o";
3
4// Anthropic
5model: "claude-3-5-sonnet-latest";
6
7// Google Gemini
8model: "gemini-2.0-flash";
9
10// Or use provider-specific LanguageModel instances
11import { anthropic } from "@ai-sdk/anthropic";
12const claude = anthropic("claude-3-5-sonnet-latest");
13model: claude;

Linking to Prompt Versions

Track which prompt versions are used in evaluations:

1import { Opik, evaluatePrompt } from "opik";
2
3const client = new Opik();
4
5// Create or get a prompt
6const prompt = await client.createPrompt({
7 name: "customer-support-prompt",
8 prompt: "{{system_message}}\n\nUser: {{user_query}}",
9});
10
11// Link the prompt to the evaluation
12await evaluatePrompt({
13 dataset,
14 messages: [
15 { role: "system", content: "{{system_message}}" },
16 { role: "user", content: "{{user_query}}" },
17 ],
18 model: "gpt-4o",
19 prompts: [prompt], // Link to prompt
20 experimentName: "Customer Support - v2.1",
21});

Template Types

Mustache Templates (Default)

1await evaluatePrompt({
2 dataset,
3 messages: [
4 {
5 role: "user",
6 content: "Hello {{name}}, your order #{{order_id}} is ready.",
7 },
8 ],
9 templateType: "mustache", // This is the default
10});

Jinja2 Templates

1await evaluatePrompt({
2 dataset,
3 messages: [
4 {
5 role: "user",
6 content: "Hello {{ name }}, your order #{{ order_id }} is ready.",
7 },
8 ],
9 templateType: "jinja2",
10});

Scoring Key Mapping

Map dataset fields to metric parameter names:

1// Dataset has: { question: "...", reference_answer: "..." }
2// Metric expects: { input: "...", expected: "..." }
3
4await evaluatePrompt({
5 dataset,
6 messages: [{ role: "user", content: "{{question}}" }],
7 scoringMetrics: [new ExactMatch()],
8 scoringKeyMapping: {
9 input: "question",
10 expected: "reference_answer",
11 },
12});

Subset Evaluation

Evaluate only a subset of the dataset for quick iteration:

1await evaluatePrompt({
2 dataset,
3 messages: [{ role: "user", content: "{{prompt}}" }],
4 nbSamples: 5, // Only evaluate first 5 items
5 experimentName: "Quick Test",
6});

How It Works

When you call evaluatePrompt, the following happens:

  1. Template Formatting: For each dataset item, message templates are formatted with item variables
  2. Model Invocation: The formatted messages are sent to the specified model to generate a response
  3. Experiment Creation: An experiment is created (or updated) with metadata
  4. Metric Scoring: If metrics are provided, each output is scored
  5. Result Aggregation: Results are collected and returned

Experiment Configuration

The function automatically enriches experiment configuration with:

  • prompt_template: The message templates used
  • model: The model identifier (name or type)

You can add additional metadata via experimentConfig:

1experimentConfig: {
2 // Auto-added by evaluatePrompt:
3 // prompt_template: [{ role: 'user', content: '...' }]
4 // model: 'gpt-4o'
5
6 // Your custom metadata:
7 temperature: 0.7,
8 version: "v2.0",
9 author: "team-ai",
10 description: "Testing improved prompt structure",
11};

Best Practices

1. Start Simple, Then Add Metrics

Begin with basic prompt evaluation, then add metrics as needed:

1// Step 1: Basic evaluation to see outputs
2await evaluatePrompt({
3 dataset,
4 messages: [{ role: "user", content: "{{input}}" }],
5});
6
7// Step 2: Add metrics after reviewing outputs
8await evaluatePrompt({
9 dataset,
10 messages: [{ role: "user", content: "{{input}}" }],
11 scoringMetrics: [new Hallucination()],
12});

2. Use Descriptive Experiment Names

Make it easy to find and compare experiments:

1experimentName: "Translation - GPT-4o - v2.3 - 2025-01-15";

3. Version Your Prompts

Link evaluations to prompt versions for better tracking:

1const prompt = await client.createPrompt({
2 name: "qa-prompt",
3 prompt: "Answer: {{question}}",
4 version: "v2.3",
5});
6
7await evaluatePrompt({
8 dataset,
9 messages: [{ role: "user", content: "Answer: {{question}}" }],
10 prompts: [prompt],
11});

4. Start with Small Samples

Use nbSamples for quick iteration before full evaluation:

1// Quick test with 10 samples
2await evaluatePrompt({
3 dataset,
4 messages: [{ role: "user", content: "{{input}}" }],
5 nbSamples: 10,
6});
7
8// Full evaluation once satisfied
9await evaluatePrompt({
10 dataset,
11 messages: [{ role: "user", content: "{{input}}" }],
12 // Evaluate entire dataset
13});

5. Include Context in System Messages

Structure your prompts with clear system messages:

1messages: [
2 {
3 role: "system",
4 content:
5 "You are an expert {{domain}} assistant. Provide accurate, concise answers.",
6 },
7 {
8 role: "user",
9 content: "{{question}}",
10 },
11];

Error Handling

The function validates inputs and throws errors for common issues:

1try {
2 await evaluatePrompt({
3 dataset,
4 messages: [],
5 });
6} catch (error) {
7 console.error(error.message);
8 // Error: Messages array is required and cannot be empty
9}

Common validation errors:

  • Missing required dataset parameter
  • Empty messages array
  • Invalid experimentConfig (must be plain object)
  • Invalid templateType (must be ‘mustache’ or ‘jinja2’)

See Also