Quick Start

In just 15 minutes, learn how to evaluate your AI models with Opik’s TypeScript SDK. This guide will walk you through creating a dataset, defining an evaluation task, and analyzing results with built-in metrics – everything you need to start making data-driven decisions about your AI systems.

Complete Working Example

💡 Copy, paste, and run this complete example that:

  1. Creates a structured dataset for AI evaluation
  2. Defines an evaluation task using OpenAI’s latest models
  3. Runs an evaluation with built-in metrics and analyzes the results
1import { config } from "dotenv";
2import { evaluate, EvaluationTask, Opik, ExactMatch } from "opik";
3import OpenAI from "openai";
4
5// Load environment variables from .env file
6config();
7
8// Initialize the OpenAI client
9const client = new OpenAI();
10
11// Create an Opik client
12const opik = new Opik();
13
14// define the type for DatasetItem
15type DatasetItem = {
16 input: string;
17 expected_output: string;
18 metadata: {
19 category: string;
20 difficulty: string;
21 version: number;
22 };
23};
24
25// Retrieve the existing dataset by name
26const retrievedDataset = await opik.getOrCreateDataset<DatasetItem>(
27 "testDataset"
28);
29
30// Add items to a dataset
31const itemsToAdd = [
32 {
33 input: "What is machine learning?",
34 expected_output:
35 "Machine learning is a type of artificial intelligence that enables systems to learn and improve from experience without being explicitly programmed.",
36 metadata: { category: "AI basics", difficulty: "beginner", version: 1 },
37 },
38];
39await retrievedDataset.insert(itemsToAdd);
40
41// Define a task that takes a dataset item and returns a result
42const llmTask: EvaluationTask<DatasetItem> = async (datasetItem) => {
43 const { input } = datasetItem;
44
45 const response = await client.responses.create({
46 model: "gpt-4o",
47 instructions: "You are a coding assistant that talks like a pirate",
48 input,
49 });
50
51 return { output: response.output_text };
52};
53
54// Run evaluation
55const result = await evaluate({
56 dataset: retrievedDataset,
57 task: llmTask,
58 scoringMetrics: [new ExactMatch()],
59 experimentName: "My First Evaluation",
60
61 // map the output of the task and dataset item data to the expected metric inputs
62 scoringKeyMapping: {
63 expected: "expected_output",
64 },
65});
66
67console.log(`Experiment ID: ${result.experimentId}`);
68console.log(`Experiment Name: ${result.experimentName}`);
69console.log(`Total test cases: ${result.testResults.length}`);

Step-by-Step Walkthrough

1. Setting up environment

1import { config } from "dotenv";
2import { evaluate, EvaluationTask, Opik, ExactMatch } from "opik";
3import OpenAI from "openai";
4
5// Load environment variables from .env file
6config();
7
8// Initialize the OpenAI client
9const client = new OpenAI();
10
11// Create an Opik client
12const opik = new Opik();

This section imports the necessary dependencies and configures your evaluation environment. The dotenv package securely loads your API keys from a .env file:

OPENAI_API_KEY=your_openai_api_key
OPIK_API_KEY=your_opik_api_key
OPIK_PROJECT_NAME=your_opik_project_name
OPIK_WORKSPACE=your_opik_workspace

2. Building a structured evaluation dataset

1// Create an Opik client
2const opik = new Opik();
3
4// define the type for DatasetItem
5type DatasetItem = {
6 input: string;
7 expected_output: string;
8 metadata: {
9 category: string;
10 difficulty: string;
11 version: number;
12 };
13};
14
15// Retrieve the existing dataset by name
16const retrievedDataset = await opik.getOrCreateDataset<DatasetItem>(
17 "testDataset"
18);
19
20// Add items to a dataset
21const itemsToAdd = [
22 {
23 input: "What is machine learning?",
24 expected_output:
25 "Machine learning is a type of artificial intelligence that enables systems to learn and improve from experience without being explicitly programmed.",
26 metadata: { category: "AI basics", difficulty: "beginner", version: 1 },
27 },
28];
29await retrievedDataset.insert(itemsToAdd);

This section creates your evaluation dataset with full TypeScript support:

  • Initialize the client: Connect to Opik’s evaluation platform
  • Define your schema: Use TypeScript types for dataset items with full IDE autocompletion
  • Retrieve or create: Use getOrCreateDataset to seamlessly work with existing or new datasets
  • Add evaluation items: Structure your test cases with inputs, expected outputs, and rich metadata for filtering and analysis

📌 Best practice: Add descriptive metadata to each item for powerful filtering and analysis in the Opik UI.

3. Defining your evaluation task

1// Define a task that takes a dataset item and returns a result
2const llmTask: EvaluationTask<DatasetItem> = async (datasetItem) => {
3 const { input } = datasetItem;
4
5 const response = await client.responses.create({
6 model: "gpt-4o", // Use any model you need to evaluate
7 instructions: "You are a coding assistant that talks like a pirate",
8 input,
9 });
10
11 return { output: response.output_text };
12};

Your evaluation task:

  • Receives dataset items: Automatically processes each item in your dataset
  • Integrates with any API: Works with OpenAI, Anthropic, your own models, or any API
  • Returns structured output: Package results in a format ready for evaluation

4. Running your evaluation

1// Run evaluation with a built-in ExactMatch metric
2const result = await evaluate({
3 dataset: retrievedDataset,
4 task: llmTask,
5 scoringMetrics: [new ExactMatch()], // Use multiple metrics for comprehensive evaluation
6 experimentName: "My First Evaluation",
7
8 // map the output of the task and dataset item data to the expected metric inputs
9 scoringKeyMapping: {
10 expected: "expected_output",
11 },
12});
13
14console.log(`Experiment URL: ${result.resultUrl}`); // Direct link to view results

This single function call:

  • The dataset we created
  • Our defined LLM task
  • The built-in ExactMatch metric that compares outputs exactly
  • A name for the experiment
  • Key mapping to connect dataset fields with metric inputs

Expected Output

When you run this code, you’ll receive an evaluation result object containing:

  • experimentId: Unique identifier for your evaluation experiment
  • experimentName: The name you provided
  • testResults: Array of results for each dataset item
    • testCase: Contains the input data and outputs
    • scoreResults: Array of scores from each metric
  • resultUrl: Link to view detailed results in the Opik platform
1{
2 "experimentId": "01975908-818f-765a-abv6-08d179c15610",
3 "experimentName": "My First Evaluation",
4 "testResults": [
5 {
6 "testCase": {
7 "traceId": "01975908-82dc-73fd-862d-dd51152ddse1",
8 "datasetItemId": "01975908-810c-7663-b7a3-e3ae94484ca9",
9 "scoringInputs": {
10 "input": "What is machine learning?",
11 "metadata": {
12 "category": "AI basics",
13 "difficulty": "beginner",
14 "version": 1
15 },
16 "expected_output": "Machine learning is a type of artificial intelligence that enables systems to learn and improve from experience without being explicitly programmed.",
17 "id": "01975908-810c-7663-b7a3-e3ae43884ca9",
18 "output": "Arrr, machine learnin' be a way for computers to learn from data, akin to teachin' a parrot new tricks! Instead of givin' exact instructions, ye feed the machine lots o' examples, and it figures out how to make decisions on its own. It be useful for predictin' things, findin' patterns, and even speakin' like a fine pirate! 🏴‍☠️",
19 "expected": "Machine learning is a type of artificial intelligence that enables systems to learn and improve from experience without being explicitly programmed."
20 },
21 "taskOutput": {
22 "output": "Arrr, machine learnin' be a way for computers to learn from data, akin to teachin' a parrot new tricks! Instead of givin' exact instructions, ye feed the machine lots o' examples, and it figures out how to make decisions on its own. It be useful for predictin' things, findin' patterns, and even speakin' like a fine pirate! 🏴‍☠️"
23 }
24 },
25 "scoreResults": [
26 {
27 "name": "exact_match",
28 "value": 0,
29 "reason": "Exact match: No match"
30 }
31 ]
32 }
33 ],
34 "resultUrl": "https://comet.com/opik/api/v1/session/redirect/experiments/?experiment_id=01975908-818f-765a-abv6-08d179c15610&dataset_id=01975908-810c-7663-b7a3-e3ae94484ca9&path=aHR0cHM6Ly9kZXYuY29tZXQuY29tL29dfWsvYXBp"
35}

Troubleshooting & Best Practices

API Key Issues

Error: Unauthorized: Invalid API key
  • Make sure you’ve set up your .env file correctly
  • Verify your API keys are valid and have the correct permissions

Metric Input Mapping

Error: Metric 'ExactMatch' is skipped, missing required arguments: expected. Available arguments: output.
  • Review your scoringKeyMapping to ensure it maps correctly to your dataset structure
  • Check that all metric required inputs are provided either in task output or via mapping