Evaluate Function | Opik Documentation

The evaluate function allows you to run comprehensive evaluations of LLM tasks against datasets using customizable metrics.

1 async function evaluate(options: EvaluateOptions): Promise<EvaluationResult>;

Parameters

The function accepts a single options parameter of type EvaluateOptions, which contains the following properties:

Parameter	Type	Required	Description
`dataset`	`Dataset`	Yes	The dataset to evaluate against, containing inputs and expected outputs
`task`	`EvaluationTask`	Yes	The specific LLM task to perform
`scoringMetrics`	`BaseMetric[]`	No	Optional array of metrics to evaluate model performance (e.g., accuracy, F1 score)
`experimentName`	`string`	No	Optional name for this evaluation experiment for tracking and reporting
`projectName`	`string`	No	Optional project identifier to associate this experiment with
`experimentConfig`	`Record<string, unknown>`	No	Optional configuration settings for the experiment as key-value pairs
`nbSamples`	`number`	No	Optional number of samples to evaluate from the dataset (defaults to all if not specified)
`client`	`OpikClient`	No	Optional Opik client instance to use for tracking
`scoringKeyMapping`	`ScoringKeyMappingType`	No	Optional mapping between dataset keys and scoring metric inputs

Returns

The function returns a Promise that resolves to an EvaluationResult object containing:

Aggregated scores across all evaluated samples
Individual sample results
Execution metadata

Example Usage

1 import {
2   evaluate,
3   EvaluationTask,
4   Opik,
5   BaseMetric,
6   EvaluationScoreResult,
7   ExactMatch,
8 } from "opik";
9 import OpenAI from "openai";
10 
11 // Initialize clients
12 const openai = new OpenAI();
13 const opik = new Opik();
14 
15 // Define dataset item type
16 type DatasetItem = {
17   input: string;
18   expected_output: string;
19   metadata: {
20     category: string;
21     difficulty: string;
22     version: number;
23   };
24 };
25 
26 // Define LLM task
27 const llmTask: EvaluationTask<DatasetItem> = async (datasetItem) => {
28   const { input } = datasetItem;
29 
30   const response = await openai.responses.create({
31     model: "gpt-4o",
32     instructions: "You are a coding assistant",
33     input,
34   });
35 
36   return { output: response.output_text };
37 };
38 
39 async function runEvaluation() {
40   // Get or create dataset
41   const dataset = await opik.getOrCreateDataset<DatasetItem>("example-dataset");
42 
43   // Run evaluation
44   const result = await evaluate({
45     dataset,
46     task: llmTask,
47     scoringMetrics: [new ExactMatch()],
48     experimentName: "Example Evaluation",
49 
50     // Map the output of the task and dataset item data to the expected metric inputs
51     scoringKeyMapping: {
52       expected: "expected_output",
53     },
54   });
55 }

Notes

The function automatically creates an experiment in Opik for tracking and analysis
If no client is provided, it uses the global Opik client instance
You can provide type parameters to properly type your dataset and task inputs/outputs
Errors during evaluation will be properly logged and re-thrown