evaluatePrompt Function
The evaluatePrompt function provides a streamlined way to evaluate prompt templates against a dataset. It automatically formats message templates with dataset variables, generates LLM responses, and evaluates the results using specified metrics.
Overview
evaluatePrompt is a convenience wrapper around the evaluate function that handles:
- Template formatting: Automatically formats message templates with dataset item variables
- Model invocation: Generates LLM responses using your specified model
- Experiment tracking: Creates experiments linked to specific prompt versions
- Metric evaluation: Scores outputs using the specified metrics
This is particularly useful for prompt engineering workflows where you want to quickly test different prompt templates against a dataset.
Function Signature
EvaluatePromptOptions
Parameters
Required Parameters
dataset
- Type:
Dataset - Description: The dataset to evaluate prompts against. Each dataset item will be used to format the message templates and generate responses.
messages
- Type:
OpikMessage[] - Description: Array of message templates with
{{placeholders}}that will be formatted with dataset variables.
Optional Parameters
model
- Type:
SupportedModelId | LanguageModel | OpikBaseModel - Default:
"gpt-4o" - Description: The language model to use for generation. Can be:
- Model ID string (e.g.,
"gpt-4o","claude-3-5-sonnet-latest","gemini-2.0-flash") - Pre-configured
LanguageModelinstance from Vercel AI SDK - Custom
OpikBaseModelimplementation
- Model ID string (e.g.,
templateType
- Type:
"mustache" | "jinja2" - Default:
"mustache" - Description: Template engine to use for variable substitution in message content.
scoringMetrics
- Type:
BaseMetric[] - Description: Array of metrics to evaluate the generated outputs. Can include both heuristic and LLM Judge metrics.
experimentName
- Type:
string - Description: Name for the experiment. If not provided, a name will be auto-generated.
experimentConfig
- Type:
Record<string, unknown> - Description: Additional metadata to store with the experiment. The function automatically adds
prompt_templateandmodelto this configuration.
prompts
- Type:
Prompt[] - Description: Array of Opik Prompt objects to link to this experiment. Useful for tracking which prompt versions were used.
projectName
- Type:
string - Description: Name of the Opik project to log traces to.
nbSamples
- Type:
number - Description: Maximum number of dataset items to evaluate. Useful for quick testing.
scoringKeyMapping
- Type:
Record<string, string> - Description: Maps metric parameter names to dataset/output field names when they don’t match.
Return Value
Returns a Promise<EvaluationResult> containing:
Examples
Basic Usage
Simple prompt evaluation with default settings:
With Scoring Metrics
Evaluate prompts with automatic scoring:
Using LanguageModel Instances
Use LanguageModel instances for provider-specific features:
Multi-Provider Model Support
The function supports models from multiple providers:
Linking to Prompt Versions
Track which prompt versions are used in evaluations:
Template Types
Mustache Templates (Default)
Jinja2 Templates
Scoring Key Mapping
Map dataset fields to metric parameter names:
Subset Evaluation
Evaluate only a subset of the dataset for quick iteration:
How It Works
When you call evaluatePrompt, the following happens:
- Template Formatting: For each dataset item, message templates are formatted with item variables
- Model Invocation: The formatted messages are sent to the specified model to generate a response
- Experiment Creation: An experiment is created (or updated) with metadata
- Metric Scoring: If metrics are provided, each output is scored
- Result Aggregation: Results are collected and returned
Experiment Configuration
The function automatically enriches experiment configuration with:
prompt_template: The message templates usedmodel: The model identifier (name or type)
You can add additional metadata via experimentConfig:
Best Practices
1. Start Simple, Then Add Metrics
Begin with basic prompt evaluation, then add metrics as needed:
2. Use Descriptive Experiment Names
Make it easy to find and compare experiments:
3. Version Your Prompts
Link evaluations to prompt versions for better tracking:
4. Start with Small Samples
Use nbSamples for quick iteration before full evaluation:
5. Include Context in System Messages
Structure your prompts with clear system messages:
Error Handling
The function validates inputs and throws errors for common issues:
Common validation errors:
- Missing required
datasetparameter - Empty
messagesarray - Invalid
experimentConfig(must be plain object) - Invalid
templateType(must be ‘mustache’ or ‘jinja2’)
See Also
- evaluate Function - For evaluating custom tasks
- Datasets - Working with evaluation datasets
- Metrics - Available evaluation metrics
- Models - Model configuration and usage
- Prompts - Managing prompt templates