Log experiments with REST API

Step by step guide to logging evaluation results using Python SDK and REST API

Evaluating your LLM application allows you to have confidence in the performance of your LLM application. In this guide, we will walk through logging pre-computed evaluation results to Opik using both the Python SDK and REST API.

This guide focuses on logging pre-computed evaluation results. If you’re looking to run evaluations with Opik computing the metrics, refer to the Evaluate Your LLM guide.

The process involves these key steps:

  1. Create a dataset with your test cases
  2. Prepare your evaluation results
  3. Log experiment items in bulk

1. Create a Dataset

First, you’ll need to create a dataset containing your test cases. This dataset will be linked to your experiments.

1from opik import Opik
2import opik
3
4# Configure Opik
5opik.configure()
6
7# Create dataset items
8dataset_items = [
9 {
10 "user_question": "What is the capital of France?",
11 "expected_output": "Paris"
12 },
13 {
14 "user_question": "What is the capital of Japan?",
15 "expected_output": "Tokyo"
16 },
17 {
18 "user_question": "What is the capital of Brazil?",
19 "expected_output": "Brasília"
20 }
21]
22
23# Get or create a dataset
24client = Opik()
25dataset = client.get_or_create_dataset(name="geography-questions")
26
27# Add dataset items
28dataset.insert(dataset_items)

Dataset item IDs will be automatically generated if not provided. If you do provide your own IDs, ensure they are in UUID7 format.

2. Prepare Evaluation Results

Structure your evaluation results with the necessary fields. Each experiment item should include:

  • dataset_item_id: The ID of the dataset item being evaluated
  • evaluate_task_result: The output from your LLM application
  • feedback_scores: Array of evaluation metrics (optional)
1# Get dataset items from the dataset object
2dataset_items = list(dataset.get_items())
3
4# Mock LLM responses for this example
5# In a real scenario, you would call your actual LLM here
6mock_responses = {
7 "France": "The capital of France is Paris.",
8 "Japan": "Japan's capital is Tokyo.",
9 "Brazil": "The capital of Brazil is Rio de Janeiro." # Incorrect
10}
11
12# Prepare evaluation results
13evaluation_items = []
14
15for item in dataset_items[:3]: # Process first 3 items for this example
16 # Determine which mock response to use
17 question = item['user_question']
18 response = "I don't know"
19
20 for country, mock_response in mock_responses.items():
21 if country.lower() in question.lower():
22 response = mock_response
23 break
24
25 # Calculate accuracy (1.0 if expected answer is in response)
26 accuracy = 1.0 if item['expected_output'].lower() in response.lower() else 0.0
27
28 evaluation_items.append({
29 "dataset_item_id": item['id'],
30 "evaluate_task_result": {
31 "prediction": response
32 },
33 "feedback_scores": [
34 {
35 "name": "accuracy",
36 "value": accuracy,
37 "source": "sdk"
38 }
39 ]
40 })
41
42print(f"Prepared {len(evaluation_items)} evaluation items")

3. Log Experiment Items in Bulk

Use the bulk endpoint to efficiently log multiple evaluation results at once.

1experiment_name = "Bulk experiment upload"
2# Log experiment results using the bulk method
3client.rest_client.experiments.experiment_items_bulk(
4 experiment_name=experiment_name,
5 dataset_name="geography-questions",
6 items=[
7 {
8 "dataset_item_id": item["dataset_item_id"],
9 "evaluate_task_result": item["evaluate_task_result"],
10 "feedback_scores": [
11 {**score, "source": "sdk"}
12 for score in item["feedback_scores"]
13 ]
14 }
15 for item in evaluation_items
16 ]
17)

Request Size Limit: The maximum allowed payload size is 4MB. For larger submissions, divide the data into smaller batches.

Complete Example

Here’s a complete example that puts all the steps together:

1from opik import Opik
2import opik
3import uuid
4
5# Configure Opik
6opik.configure()
7
8# Step 1: Create dataset
9client = Opik()
10dataset = client.get_or_create_dataset(name="geography-questions")
11
12dataset_items = [
13 {
14 "user_question": "What is the capital of France?",
15 "expected_output": "Paris"
16 },
17 {
18 "user_question": "What is the capital of Japan?",
19 "expected_output": "Tokyo"
20 }
21]
22
23dataset.insert(dataset_items)
24
25# Step 2: Run your LLM application and collect results
26# (In a real scenario, you would call your LLM here)
27
28# Helper function to get dataset item ID
29def get_dataset_item(country):
30 items = dataset.get_items()
31 for item in items:
32 if country.lower() in item['user_question'].lower():
33 return item
34 return None
35
36# Prepare evaluation results
37evaluation_items = [
38 {
39 "dataset_item_id": get_dataset_item("France")['id'],
40 "evaluate_task_result": {"prediction": "The capital of France is Paris."},
41 "feedback_scores": [{"name": "accuracy", "value": 1.0}]
42 },
43 {
44 "dataset_item_id": get_dataset_item("Japan")['id'],
45 "evaluate_task_result": {"prediction": "Japan's capital is Tokyo."},
46 "feedback_scores": [{"name": "accuracy", "value": 1.0}]
47 }
48]
49
50# Step 3: Log experiment results
51rest_client = client.rest_client
52experiment_name = f"geography-bot-{str(uuid.uuid4())[0:4]}"
53rest_client.experiments.experiment_items_bulk(
54 experiment_name=experiment_name,
55 dataset_name="geography-questions",
56 items=[
57 {
58 "dataset_item_id": item["dataset_item_id"],
59 "evaluate_task_result": item["evaluate_task_result"],
60 "feedback_scores": [
61 {**score, "source": "sdk"}
62 for score in item["feedback_scores"]
63 ]
64 }
65 for item in evaluation_items
66 ]
67)
68
69print(f"Experiment '{experiment_name}' created successfully!")

Advanced Usage

Including Traces and Spans

You can include full execution traces with your experiment items for complete observability:

1# Include trace information
2items_with_traces = [
3 {
4 "dataset_item_id": "your-dataset-item-id",
5 "trace": {
6 "name": "geography_query",
7 "input": {"question": "What is the capital of France?"},
8 "output": {"answer": "Paris"},
9 "metadata": {"model": "gpt-3.5-turbo"},
10 "start_time": "2024-01-01T00:00:00Z",
11 "end_time": "2024-01-01T00:00:01Z"
12 },
13 "spans": [
14 {
15 "name": "llm_call",
16 "type": "llm",
17 "start_time": "2024-01-01T00:00:00Z",
18 "end_time": "2024-01-01T00:00:01Z",
19 "input": {"prompt": "What is the capital of France?"},
20 "output": {"response": "Paris"}
21 }
22 ],
23 "feedback_scores": [
24 {"name": "accuracy", "value": 1.0, "source": "sdk"}
25 ]
26 }
27]
Important: You may supply either evaluate_task_result or trace — not both.

Java Example

For Java developers, here’s how to integrate with Opik using Jackson and HttpClient:

1import com.fasterxml.jackson.databind.ObjectMapper;
2import com.fasterxml.jackson.databind.JsonNode;
3import com.fasterxml.jackson.databind.node.JsonNodeFactory;
4import com.fasterxml.jackson.databind.node.ArrayNode;
5
6public class OpikExperimentLogger {
7
8 public static void main(String[] args) {
9 ObjectMapper mapper = new ObjectMapper();
10
11 String baseURI = System.getenv("OPIK_URL_OVERRIDE");
12 String workspaceName = System.getenv("OPIK_WORKSPACE");
13 String apiKey = System.getenv("OPIK_API_KEY");
14
15 String datasetName = "geography-questions";
16 String experimentName = "geography-bot-v1";
17
18 try (var client = HttpClient.newHttpClient()) {
19 // Stream dataset items
20 var streamRequest = HttpRequest.newBuilder()
21 .uri(URI.create(baseURI).resolve("/v1/private/datasets/items/stream"))
22 .header("Content-Type", "application/json")
23 .header("Accept", "application/octet-stream")
24 .header("Authorization", apiKey)
25 .header("Comet-Workspace", workspaceName)
26 .POST(HttpRequest.BodyPublishers.ofString(
27 mapper.writeValueAsString(Map.of("dataset_name", datasetName))
28 ))
29 .build();
30
31 HttpResponse<InputStream> streamResponse = client.send(
32 streamRequest,
33 HttpResponse.BodyHandlers.ofInputStream()
34 );
35
36 List<JsonNode> experimentItems = new ArrayList<>();
37
38 try (var reader = new BufferedReader(new InputStreamReader(streamResponse.body()))) {
39 String line;
40 while ((line = reader.readLine()) != null) {
41 JsonNode datasetItem = mapper.readTree(line);
42 String question = datasetItem.get("data").get("user_question").asText();
43 UUID datasetItemId = UUID.fromString(datasetItem.get("id").asText());
44
45 // Call your LLM application
46 JsonNode llmOutput = callYourLLM(question);
47
48 // Calculate metrics
49 List<JsonNode> scores = calculateMetrics(llmOutput);
50
51 // Build experiment item
52 ArrayNode scoresArray = JsonNodeFactory.instance.arrayNode().addAll(scores);
53 JsonNode experimentItem = JsonNodeFactory.instance.objectNode()
54 .put("dataset_item_id", datasetItemId.toString())
55 .setAll(Map.of(
56 "evaluate_task_result", llmOutput,
57 "feedback_scores", scoresArray
58 ));
59
60 experimentItems.add(experimentItem);
61 }
62 }
63
64 // Send experiment results in bulk
65 var bulkBody = JsonNodeFactory.instance.objectNode()
66 .put("dataset_name", datasetName)
67 .put("experiment_name", experimentName)
68 .setAll(Map.of("items",
69 JsonNodeFactory.instance.arrayNode().addAll(experimentItems)
70 ));
71
72 var bulkRequest = HttpRequest.newBuilder()
73 .uri(URI.create(baseURI).resolve("/v1/private/experiments/items/bulk"))
74 .header("Content-Type", "application/json")
75 .header("Authorization", apiKey)
76 .header("Comet-Workspace", workspaceName)
77 .PUT(HttpRequest.BodyPublishers.ofString(bulkBody.toString()))
78 .build();
79
80 HttpResponse<String> bulkResponse = client.send(
81 bulkRequest,
82 HttpResponse.BodyHandlers.ofString()
83 );
84
85 if (bulkResponse.statusCode() == 204) {
86 System.out.println("Experiment items successfully created.");
87 } else {
88 System.err.printf("Failed to create experiment items: %s %s",
89 bulkResponse.statusCode(), bulkResponse.body());
90 }
91
92 } catch (Exception e) {
93 throw new RuntimeException(e);
94 }
95 }
96}

Authentication

Configure authentication based on your deployment:

$# No authentication headers required for local deployments
>curl -X PUT 'http://localhost:5173/api/v1/private/experiments/items/bulk' \
> -H 'Content-Type: application/json' \
> -d '{ ... }'

Environment Variables

For security and flexibility, use environment variables for credentials:

$export OPIK_API_KEY="your_api_key"
>export OPIK_WORKSPACE="your_workspace_name"
>export OPIK_URL_OVERRIDE="https://www.comet.com/opik/api"

Then use them in your code:

1import os
2from opik import Opik
3
4# Opik SDK will automatically use these environment variables
5client = Opik()
6
7# Or for direct REST API calls
8headers = {
9 "Authorization": os.getenv('OPIK_API_KEY'),
10 "Comet-Workspace": os.getenv('OPIK_WORKSPACE')
11}

Reference