Evaluate Function

The evaluate function allows you to run comprehensive evaluations of LLM tasks against datasets using customizable metrics.

1async function evaluate(options: EvaluateOptions): Promise<EvaluationResult>;

Parameters

The function accepts a single options parameter of type EvaluateOptions, which contains the following properties:

ParameterTypeRequiredDescription
datasetDatasetYesThe dataset to evaluate against, containing inputs and expected outputs
taskEvaluationTaskYesThe specific LLM task to perform
scoringMetricsBaseMetric[]NoOptional array of metrics to evaluate model performance (e.g., accuracy, F1 score)
experimentNamestringNoOptional name for this evaluation experiment for tracking and reporting
projectNamestringNoOptional project identifier to associate this experiment with
experimentConfigRecord<string, unknown>NoOptional configuration settings for the experiment as key-value pairs
nbSamplesnumberNoOptional number of samples to evaluate from the dataset (defaults to all if not specified)
clientOpikClientNoOptional Opik client instance to use for tracking
scoringKeyMappingScoringKeyMappingTypeNoOptional mapping between dataset keys and scoring metric inputs

Returns

The function returns a Promise that resolves to an EvaluationResult object containing:

  • Aggregated scores across all evaluated samples
  • Individual sample results
  • Execution metadata

Example Usage

1import {
2 evaluate,
3 EvaluationTask,
4 Opik,
5 BaseMetric,
6 EvaluationScoreResult,
7 ExactMatch,
8} from "opik";
9import OpenAI from "openai";
10
11// Initialize clients
12const openai = new OpenAI();
13const opik = new Opik();
14
15// Define dataset item type
16type DatasetItem = {
17 input: string;
18 expected_output: string;
19 metadata: {
20 category: string;
21 difficulty: string;
22 version: number;
23 };
24};
25
26// Define LLM task
27const llmTask: EvaluationTask<DatasetItem> = async (datasetItem) => {
28 const { input } = datasetItem;
29
30 const response = await openai.responses.create({
31 model: "gpt-4o",
32 instructions: "You are a coding assistant",
33 input,
34 });
35
36 return { output: response.output_text };
37};
38
39async function runEvaluation() {
40 // Get or create dataset
41 const dataset = await opik.getOrCreateDataset<DatasetItem>("example-dataset");
42
43 // Run evaluation
44 const result = await evaluate({
45 dataset,
46 task: llmTask,
47 scoringMetrics: [new ExactMatch()],
48 experimentName: "Example Evaluation",
49
50 // Map the output of the task and dataset item data to the expected metric inputs
51 scoringKeyMapping: {
52 expected: "expected_output",
53 },
54 });
55}

Notes

  • The function automatically creates an experiment in Opik for tracking and analysis
  • If no client is provided, it uses the global Opik client instance
  • You can provide type parameters to properly type your dataset and task inputs/outputs
  • Errors during evaluation will be properly logged and re-thrown