Manually logging experiments
Evaluating your LLM application allows you to have confidence in the performance of your LLM application. In this guide, we will walk through manually creating experiments using data you have already computed.
This guide focuses on logging pre-computed evaluation results. If you’re looking to run evaluations with Opik computing the metrics, refer to the Evaluate your agent and Evaluate single prompts guides.
The process involves these key steps:
- Create a dataset with your test cases
- Prepare your evaluation results
- Log experiment items in bulk
1. Create a Dataset
First, you’ll need to create a dataset containing your test cases. This dataset will be linked to your experiments.
Dataset item IDs will be automatically generated if not provided. If you do provide your own IDs, ensure they are in UUID7 format.
2. Prepare Evaluation Results
Structure your evaluation results with the necessary fields. Each experiment item should include:
dataset_item_id: The ID of the dataset item being evaluatedevaluate_task_result: The output from your LLM applicationfeedback_scores: Array of evaluation metrics (optional)
3. Log Experiment Items in Bulk
Use the bulk endpoint to efficiently log multiple evaluation results at once.
Request Size Limit: The maximum allowed payload size is 4MB. For larger submissions, divide the data into smaller batches.
If you wish to divide the data into smaller batches, just add the experiment_id to the payload
so experiment items can be added to an existing experiment.
Below is an example of splitting the evaluation_items into two batches which will both be added
to the same experiment:
4. Analyzing the results
Once you have logged your experiment items, you can analyze the results in the Opik UI and even compare different experiments to one another.
Complete Example
Here’s a complete example that puts all the steps together:
Advanced Usage
Including Traces and Spans
You can include full execution traces with your experiment items for complete observability, to do
achieve this, add a trace and spans field to your experiment items:
evaluate_task_result or trace — not both.Java Example
For Java developers, here’s how to integrate with Opik using Jackson and HttpClient:
Using the REST API with local deployments
If you are using the REST API with a local deployment, you can all the endpoints using:
Reference
- Endpoint:
PUT /api/v1/private/experiments/items/bulk - Max Payload Size: 4MB
- Required Fields:
experiment_name,dataset_name,items(withdataset_item_id) - SDK Reference: ExperimentsClient.experiment_items_bulk
- REST API Reference: Experiments API