Evaluate Function

The evaluate function allows you to run comprehensive evaluations of LLM tasks against datasets using customizable metrics.

1async function evaluate(options: EvaluateOptions): Promise<EvaluationResult>;

Parameters

The function accepts a single options parameter of type EvaluateOptions, which contains the following properties:

ParameterTypeRequiredDescription
datasetDataset | DatasetVersionYesThe dataset or dataset version to evaluate against. Use DatasetVersion for reproducible evaluations pinned to a specific snapshot.
taskEvaluationTaskYesThe specific LLM task to perform
scoringMetricsBaseMetric[]NoOptional array of metrics to evaluate model performance (e.g., accuracy, F1 score)
experimentNamestringNoOptional name for this evaluation experiment for tracking and reporting
projectNamestringNoOptional project identifier to associate this experiment with
experimentConfigRecord<string, unknown>NoOptional configuration settings for the experiment as key-value pairs
nbSamplesnumberNoOptional number of samples to evaluate from the dataset (defaults to all if not specified)
clientOpikClientNoOptional Opik client instance to use for tracking
scoringKeyMappingScoringKeyMappingTypeNoOptional mapping between dataset keys and scoring metric inputs

Returns

The function returns a Promise that resolves to an EvaluationResult object containing:

  • Aggregated scores across all evaluated samples
  • Individual sample results
  • Execution metadata

Example Usage

1import {
2 evaluate,
3 EvaluationTask,
4 Opik,
5 BaseMetric,
6 EvaluationScoreResult,
7 ExactMatch,
8} from "opik";
9import OpenAI from "openai";
10
11// Initialize clients
12const openai = new OpenAI();
13const opik = new Opik();
14
15// Define dataset item type
16type DatasetItem = {
17 input: string;
18 expected_output: string;
19 metadata: {
20 category: string;
21 difficulty: string;
22 version: number;
23 };
24};
25
26// Define LLM task
27const llmTask: EvaluationTask<DatasetItem> = async (datasetItem) => {
28 const { input } = datasetItem;
29
30 const response = await openai.responses.create({
31 model: "gpt-5-nano",
32 instructions: "You are a coding assistant",
33 input,
34 });
35
36 return { output: response.output_text };
37};
38
39async function runEvaluation() {
40 // Get or create dataset
41 const dataset = await opik.getOrCreateDataset<DatasetItem>("example-dataset");
42
43 // Run evaluation
44 const result = await evaluate({
45 dataset,
46 task: llmTask,
47 scoringMetrics: [new ExactMatch()],
48 experimentName: "Example Evaluation",
49
50 // Map the output of the task and dataset item data to the expected metric inputs
51 scoringKeyMapping: {
52 expected: "expected_output",
53 },
54 });
55}

Evaluating a Specific Version

For reproducible evaluations, use a DatasetVersion instead of Dataset:

1// Get a specific version for reproducible evaluation
2const dataset = await opik.getOrCreateDataset<DatasetItem>("example-dataset");
3const v2 = await dataset.getVersionView("v2");
4
5const result = await evaluate({
6 dataset: v2,
7 task: llmTask,
8 scoringMetrics: [new ExactMatch()],
9 experimentName: "Pinned to v2",
10});
11// Experiment is linked to version v2, not latest

Notes

  • The function automatically creates an experiment in Opik for tracking and analysis
  • If no client is provided, it uses the global Opik client instance
  • You can provide type parameters to properly type your dataset and task inputs/outputs
  • Errors during evaluation will be properly logged and re-thrown