Quick Start | Opik Documentation

In just 15 minutes, learn how to evaluate your AI models with Opik’s TypeScript SDK. This guide will walk you through creating a dataset, defining an evaluation task, and analyzing results with built-in metrics – everything you need to start making data-driven decisions about your AI systems.

Complete Working Example

💡 Copy, paste, and run this complete example that:

Creates a structured dataset for AI evaluation
Defines an evaluation task using OpenAI’s latest models
Runs an evaluation with built-in metrics and analyzes the results

1 import { config } from "dotenv";
2 import { evaluate, EvaluationTask, Opik, ExactMatch } from "opik";
3 import OpenAI from "openai";
4 
5 // Load environment variables from .env file
6 config();
7 
8 // Initialize the OpenAI client
9 const client = new OpenAI();
10 
11 // Create an Opik client
12 const opik = new Opik();
13 
14 // define the type for DatasetItem
15 type DatasetItem = {
16   input: string;
17   expected_output: string;
18   metadata: {
19     category: string;
20     difficulty: string;
21     version: number;
22   };
23 };
24 
25 // Retrieve the existing dataset by name
26 const retrievedDataset = await opik.getOrCreateDataset<DatasetItem>(
27   "testDataset"
28 );
29 
30 // Add items to a dataset
31 const itemsToAdd = [
32   {
33     input: "What is machine learning?",
34     expected_output:
35       "Machine learning is a type of artificial intelligence that enables systems to learn and improve from experience without being explicitly programmed.",
36     metadata: { category: "AI basics", difficulty: "beginner", version: 1 },
37   },
38 ];
39 await retrievedDataset.insert(itemsToAdd);
40 
41 // Define a task that takes a dataset item and returns a result
42 const llmTask: EvaluationTask<DatasetItem> = async (datasetItem) => {
43   const { input } = datasetItem;
44 
45   const response = await client.responses.create({
46     model: "gpt-4o",
47     instructions: "You are a coding assistant that talks like a pirate",
48     input,
49   });
50 
51   return { output: response.output_text };
52 };
53 
54 // Run evaluation
55 const result = await evaluate({
56   dataset: retrievedDataset,
57   task: llmTask,
58   scoringMetrics: [new ExactMatch()],
59   experimentName: "My First Evaluation",
60 
61   // map the output of the task and dataset item data to the expected metric inputs
62   scoringKeyMapping: {
63     expected: "expected_output",
64   },
65 });
66 
67 console.log(`Experiment ID: ${result.experimentId}`);
68 console.log(`Experiment Name: ${result.experimentName}`);
69 console.log(`Total test cases: ${result.testResults.length}`);

Step-by-Step Walkthrough

1. Setting up environment

1 import { config } from "dotenv";
2 import { evaluate, EvaluationTask, Opik, ExactMatch } from "opik";
3 import OpenAI from "openai";
4 
5 // Load environment variables from .env file
6 config();
7 
8 // Initialize the OpenAI client
9 const client = new OpenAI();
10 
11 // Create an Opik client
12 const opik = new Opik();

This section imports the necessary dependencies and configures your evaluation environment. The dotenv package securely loads your API keys from a .env file:

OPENAI_API_KEY=your_openai_api_key
OPIK_API_KEY=your_opik_api_key
OPIK_PROJECT_NAME=your_opik_project_name
OPIK_WORKSPACE=your_opik_workspace

2. Building a structured evaluation dataset

1 // Create an Opik client
2 const opik = new Opik();
3 
4 // define the type for DatasetItem
5 type DatasetItem = {
6   input: string;
7   expected_output: string;
8   metadata: {
9     category: string;
10     difficulty: string;
11     version: number;
12   };
13 };
14 
15 // Retrieve the existing dataset by name
16 const retrievedDataset = await opik.getOrCreateDataset<DatasetItem>(
17   "testDataset"
18 );
19 
20 // Add items to a dataset
21 const itemsToAdd = [
22   {
23     input: "What is machine learning?",
24     expected_output:
25       "Machine learning is a type of artificial intelligence that enables systems to learn and improve from experience without being explicitly programmed.",
26     metadata: { category: "AI basics", difficulty: "beginner", version: 1 },
27   },
28 ];
29 await retrievedDataset.insert(itemsToAdd);

This section creates your evaluation dataset with full TypeScript support:

Initialize the client: Connect to Opik’s evaluation platform
Define your schema: Use TypeScript types for dataset items with full IDE autocompletion
Retrieve or create: Use getOrCreateDataset to seamlessly work with existing or new datasets
Add evaluation items: Structure your test cases with inputs, expected outputs, and rich metadata for filtering and analysis

📌 Best practice: Add descriptive metadata to each item for powerful filtering and analysis in the Opik UI.

3. Defining your evaluation task

1 // Define a task that takes a dataset item and returns a result
2 const llmTask: EvaluationTask<DatasetItem> = async (datasetItem) => {
3   const { input } = datasetItem;
4 
5   const response = await client.responses.create({
6     model: "gpt-4o", // Use any model you need to evaluate
7     instructions: "You are a coding assistant that talks like a pirate",
8     input,
9   });
10 
11   return { output: response.output_text };
12 };

Your evaluation task:

Receives dataset items: Automatically processes each item in your dataset
Integrates with any API: Works with OpenAI, Anthropic, your own models, or any API
Returns structured output: Package results in a format ready for evaluation

4. Running your evaluation

1 // Run evaluation with a built-in ExactMatch metric
2 const result = await evaluate({
3   dataset: retrievedDataset,
4   task: llmTask,
5   scoringMetrics: [new ExactMatch()], // Use multiple metrics for comprehensive evaluation
6   experimentName: "My First Evaluation",
7 
8   // map the output of the task and dataset item data to the expected metric inputs
9   scoringKeyMapping: {
10     expected: "expected_output",
11   },
12 });
13 
14 console.log(`Experiment URL: ${result.resultUrl}`); // Direct link to view results

This single function call:

The dataset we created
Our defined LLM task
The built-in ExactMatch metric that compares outputs exactly
A name for the experiment
Key mapping to connect dataset fields with metric inputs

Expected Output

When you run this code, you’ll receive an evaluation result object containing:

experimentId: Unique identifier for your evaluation experiment
experimentName: The name you provided
testResults: Array of results for each dataset item
- testCase: Contains the input data and outputs
- scoreResults: Array of scores from each metric
resultUrl: Link to view detailed results in the Opik platform

1 {
2   "experimentId": "01975908-818f-765a-abv6-08d179c15610",
3   "experimentName": "My First Evaluation",
4   "testResults": [
5     {
6       "testCase": {
7         "traceId": "01975908-82dc-73fd-862d-dd51152ddse1",
8         "datasetItemId": "01975908-810c-7663-b7a3-e3ae94484ca9",
9         "scoringInputs": {
10           "input": "What is machine learning?",
11           "metadata": {
12             "category": "AI basics",
13             "difficulty": "beginner",
14             "version": 1
15           },
16           "expected_output": "Machine learning is a type of artificial intelligence that enables systems to learn and improve from experience without being explicitly programmed.",
17           "id": "01975908-810c-7663-b7a3-e3ae43884ca9",
18           "output": "Arrr, machine learnin' be a way for computers to learn from data, akin to teachin' a parrot new tricks! Instead of givin' exact instructions, ye feed the machine lots o' examples, and it figures out how to make decisions on its own. It be useful for predictin' things, findin' patterns, and even speakin' like a fine pirate! 🏴‍☠️",
19           "expected": "Machine learning is a type of artificial intelligence that enables systems to learn and improve from experience without being explicitly programmed."
20         },
21         "taskOutput": {
22           "output": "Arrr, machine learnin' be a way for computers to learn from data, akin to teachin' a parrot new tricks! Instead of givin' exact instructions, ye feed the machine lots o' examples, and it figures out how to make decisions on its own. It be useful for predictin' things, findin' patterns, and even speakin' like a fine pirate! 🏴‍☠️"
23         }
24       },
25       "scoreResults": [
26         {
27           "name": "exact_match",
28           "value": 0,
29           "reason": "Exact match: No match"
30         }
31       ]
32     }
33   ],
34   "resultUrl": "https://comet.com/opik/api/v1/session/redirect/experiments/?experiment_id=01975908-818f-765a-abv6-08d179c15610&dataset_id=01975908-810c-7663-b7a3-e3ae94484ca9&path=aHR0cHM6Ly9kZXYuY29tZXQuY29tL29dfWsvYXBp"
35 }

Troubleshooting & Best Practices

API Key Issues

Error: Unauthorized: Invalid API key

Make sure you’ve set up your .env file correctly
Verify your API keys are valid and have the correct permissions

Metric Input Mapping

Error: Metric 'ExactMatch' is skipped, missing required arguments: expected. Available arguments: output.

Review your scoringKeyMapping to ensure it maps correctly to your dataset structure
Check that all metric required inputs are provided either in task output or via mapping