Manually logging experiments

Step by step guide to logging evaluation results using Python SDK and REST API

Evaluating your LLM application allows you to have confidence in the performance of your LLM application. In this guide, we will walk through manually creating experiments using data you have already computed.

This guide focuses on logging pre-computed evaluation results. If you’re looking to run evaluations with Opik computing the metrics, refer to the Evaluate your agent and Evaluate single prompts guides.

The process involves these key steps:

  1. Create a dataset with your test cases
  2. Prepare your evaluation results
  3. Log experiment items in bulk

1. Create a Dataset

First, you’ll need to create a dataset containing your test cases. This dataset will be linked to your experiments.

1import { Opik } from "opik";
2
3const client = new Opik({
4 apiKey: "your-api-key",
5 apiUrl: "https://www.comet.com/opik/api",
6 projectName: "your-project-name",
7 workspaceName: "your-workspace-name",
8});
9const dataset = await client.getOrCreateDataset("My dataset");
10
11await dataset.insert([
12 {
13 user_question: "What is the capital of France?",
14 expected_output: "Paris"
15 },
16 {
17 user_question: "What is the capital of Japan?",
18 expected_output: "Tokyo"
19 },
20 {
21 user_question: "What is the capital of Brazil?",
22 expected_output: "Brasília"
23 }
24]);

Dataset item IDs will be automatically generated if not provided. If you do provide your own IDs, ensure they are in UUID7 format.

2. Prepare Evaluation Results

Structure your evaluation results with the necessary fields. Each experiment item should include:

  • dataset_item_id: The ID of the dataset item being evaluated
  • evaluate_task_result: The output from your LLM application
  • feedback_scores: Array of evaluation metrics (optional)
1const datasetItems = await dataset.getItems();
2
3const mockResponses = {
4 "What is the capital of France?": "The capital of France is Paris.",
5 "What is the capital of Japan?": "Japan's capital is Tokyo.",
6 "What is the capital of Brazil?": "The capital of Brazil is Rio de Janeiro."
7}
8
9// This would be replaced by your specific logic, the goal is simply to have an array of
10// evaluation items with a dataset_item_id, evaluate_task_result and feedback_scores
11const evaluationItems = datasetItems.map(item => {
12 const response = mockResponses[item.user_question] || "I don't know";
13 return {
14 dataset_item_id: item.id,
15 evaluate_task_result: { prediction: response },
16 feedback_scores: [{ name: "accuracy", value: response.includes(item.expected_output) ? 1.0 : 0.0, source: "sdk" }]
17 }
18});

3. Log Experiment Items in Bulk

Use the bulk endpoint to efficiently log multiple evaluation results at once.

1import { Opik } from "opik";
2
3const client = new Opik({
4 apiKey: "your-api-key",
5 apiUrl: "https://www.comet.com/opik/api",
6 projectName: "your-project-name",
7 workspaceName: "your-workspace-name",
8});
9
10const experimentName = "Bulk experiment upload";
11const datasetName = "geography-questions";
12const items = [
13 {
14 dataset_item_id: "dataset-item-id-1",
15 evaluate_task_result: { prediction: "The capital of France is Paris." },
16 feedback_scores: [{ name: "accuracy", value: 1.0, source: "sdk" }]
17 }
18];
19
20await client.api.experiments.experimentItemsBulk({ experimentName, datasetName, items });

Request Size Limit: The maximum allowed payload size is 4MB. For larger submissions, divide the data into smaller batches.

If you wish to divide the data into smaller batches, just add the experiment_id to the payload so experiment items can be added to an existing experiment.

Below is an example of splitting the evaluation_items into two batches which will both be added to the same experiment:

1import { generateId } from "opik";
2
3const experimentId = generateId();
4const experimentName = "Bulk experiment upload";
5// Split evaluation_items into two batches
6const mid = Math.floor(evaluationItems.length / 2);
7
8const halves = [
9 evaluationItems.slice(0, mid),
10 evaluationItems.slice(mid)
11];
12
13for (const half of halves) {
14 await client.restClient.experiments.experimentItemsBulk({
15 experimentId: experimentId,
16 experimentName: experimentName,
17 datasetName: "geography-questions",
18 items: half.map(item => ({
19 datasetItemId: item.datasetItemId,
20 evaluateTaskResult: item.evaluateTaskResult,
21 feedbackScores: item.feedbackScores.map(score => ({
22 ...score,
23 source: "sdk"
24 }))
25 }))
26 });
27}

4. Analyzing the results

Once you have logged your experiment items, you can analyze the results in the Opik UI and even compare different experiments to one another.

Complete Example

Here’s a complete example that puts all the steps together:

1import { Opik } from "opik";
2
3// Configure Opik
4const client = new Opik({
5apiKey: "your-api-key",
6apiUrl: "https://www.comet.com/opik/api",
7projectName: "your-project-name",
8workspaceName: "your-workspace-name",
9});
10
11// Step 1: Create dataset
12const dataset = await client.getOrCreateDataset("geography-questions");
13
14const localDatasetItems = [
15{
16 user_question: "What is the capital of France?",
17 expected_output: "Paris"
18},
19{
20 user_question: "What is the capital of Japan?",
21 expected_output: "Tokyo"
22}
23];
24
25await dataset.insert(localDatasetItems);
26
27// Step 2: Get dataset items and prepare evaluation results
28const datasetItems = await dataset.getItems();
29
30// Helper function to get dataset item ID
31const getDatasetItem = (country: string) => {
32return datasetItems.find(item =>
33 item.user_question.toLowerCase().includes(country.toLowerCase())
34);
35};
36
37// Prepare evaluation results
38const evaluationItems = [
39{
40 dataset_item_id: getDatasetItem("France")?.id,
41 evaluate_task_result: { prediction: "The capital of France is Paris." },
42 feedback_scores: [{ name: "accuracy", value: 1.0 }]
43},
44{
45 dataset_item_id: getDatasetItem("Japan")?.id,
46 evaluate_task_result: { prediction: "Japan's capital is Tokyo." },
47 feedback_scores: [{ name: "accuracy", value: 1.0 }]
48}
49];
50
51// Step 3: Log experiment results
52const experimentName = `geography-bot-${Math.random().toString(36).substr(2, 4)}`;
53await client.api.experiments.experimentItemsBulk({
54experimentName,
55datasetName: "geography-questions",
56items: evaluationItems.map(item => ({
57 datasetItemId: item.dataset_item_id,
58 evaluateTaskResult: item.evaluate_task_result,
59 feedbackScores: item.feedback_scores.map(score => ({
60 ...score,
61 source: "sdk"
62 }))
63}))
64});
65
66console.log(`Experiment '${experimentName}' created successfully!`);

Advanced Usage

Including Traces and Spans

You can include full execution traces with your experiment items for complete observability, to do achieve this, add a trace and spans field to your experiment items:

1[
2 {
3 "dataset_item_id": "your-dataset-item-id",
4 "trace": {
5 "name": "geography_query",
6 "input": { "question": "What is the capital of France?" },
7 "output": { "answer": "Paris" },
8 "metadata": { "model": "gpt-3.5-turbo" },
9 "start_time": "2024-01-01T00:00:00Z",
10 "end_time": "2024-01-01T00:00:01Z"
11 },
12 "spans": [
13 {
14 "name": "llm_call",
15 "type": "llm",
16 "start_time": "2024-01-01T00:00:00Z",
17 "end_time": "2024-01-01T00:00:01Z",
18 "input": { "prompt": "What is the capital of France?" },
19 "output": { "response": "Paris" }
20 }
21 ],
22 "feedback_scores": [{ "name": "accuracy", "value": 1.0, "source": "sdk" }]
23 }
24]
Important: You may supply either evaluate_task_result or trace — not both.

Java Example

For Java developers, here’s how to integrate with Opik using Jackson and HttpClient:

1import com.fasterxml.jackson.databind.ObjectMapper;
2import com.fasterxml.jackson.databind.JsonNode;
3import com.fasterxml.jackson.databind.node.JsonNodeFactory;
4import com.fasterxml.jackson.databind.node.ArrayNode;
5
6public class OpikExperimentLogger {
7
8 public static void main(String[] args) {
9 ObjectMapper mapper = new ObjectMapper();
10
11 String baseURI = System.getenv("OPIK_URL_OVERRIDE");
12 String workspaceName = System.getenv("OPIK_WORKSPACE");
13 String apiKey = System.getenv("OPIK_API_KEY");
14
15 String datasetName = "geography-questions";
16 String experimentName = "geography-bot-v1";
17
18 try (var client = HttpClient.newHttpClient()) {
19 // Stream dataset items
20 var streamRequest = HttpRequest.newBuilder()
21 .uri(URI.create(baseURI).resolve("/v1/private/datasets/items/stream"))
22 .header("Content-Type", "application/json")
23 .header("Accept", "application/octet-stream")
24 .header("Authorization", apiKey)
25 .header("Comet-Workspace", workspaceName)
26 .POST(HttpRequest.BodyPublishers.ofString(
27 mapper.writeValueAsString(Map.of("dataset_name", datasetName))
28 ))
29 .build();
30
31 HttpResponse<InputStream> streamResponse = client.send(
32 streamRequest,
33 HttpResponse.BodyHandlers.ofInputStream()
34 );
35
36 List<JsonNode> experimentItems = new ArrayList<>();
37
38 try (var reader = new BufferedReader(new InputStreamReader(streamResponse.body()))) {
39 String line;
40 while ((line = reader.readLine()) != null) {
41 JsonNode datasetItem = mapper.readTree(line);
42 String question = datasetItem.get("data").get("user_question").asText();
43 UUID datasetItemId = UUID.fromString(datasetItem.get("id").asText());
44
45 // Call your LLM application
46 JsonNode llmOutput = callYourLLM(question);
47
48 // Calculate metrics
49 List<JsonNode> scores = calculateMetrics(llmOutput);
50
51 // Build experiment item
52 ArrayNode scoresArray = JsonNodeFactory.instance.arrayNode().addAll(scores);
53 JsonNode experimentItem = JsonNodeFactory.instance.objectNode()
54 .put("dataset_item_id", datasetItemId.toString())
55 .setAll(Map.of(
56 "evaluate_task_result", llmOutput,
57 "feedback_scores", scoresArray
58 ));
59
60 experimentItems.add(experimentItem);
61 }
62 }
63
64 // Send experiment results in bulk
65 var bulkBody = JsonNodeFactory.instance.objectNode()
66 .put("dataset_name", datasetName)
67 .put("experiment_name", experimentName)
68 .setAll(Map.of("items",
69 JsonNodeFactory.instance.arrayNode().addAll(experimentItems)
70 ));
71
72 var bulkRequest = HttpRequest.newBuilder()
73 .uri(URI.create(baseURI).resolve("/v1/private/experiments/items/bulk"))
74 .header("Content-Type", "application/json")
75 .header("Authorization", apiKey)
76 .header("Comet-Workspace", workspaceName)
77 .PUT(HttpRequest.BodyPublishers.ofString(bulkBody.toString()))
78 .build();
79
80 HttpResponse<String> bulkResponse = client.send(
81 bulkRequest,
82 HttpResponse.BodyHandlers.ofString()
83 );
84
85 if (bulkResponse.statusCode() == 204) {
86 System.out.println("Experiment items successfully created.");
87 } else {
88 System.err.printf("Failed to create experiment items: %s %s",
89 bulkResponse.statusCode(), bulkResponse.body());
90 }
91
92 } catch (Exception e) {
93 throw new RuntimeException(e);
94 }
95 }
96}

Using the REST API with local deployments

If you are using the REST API with a local deployment, you can all the endpoints using:

$# No authentication headers required for local deployments
>curl -X PUT 'http://localhost:5173/api/v1/private/experiments/items/bulk' \
> -H 'Content-Type: application/json' \
> -d '{ ... }'

Reference