Dataset versioning migration
Dataset Versioning Migration
Starting with Opik version 1.9.92, the platform includes an automated migration process to support enhanced dataset versioning capabilities. This migration runs automatically when you first start Opik 1.9.92 or higher.
Migration Duration: Unlike typical Liquibase migrations, the data migration can take 2-10 minutes depending on the size of your datasets. This is normal and expected.
Overview
The dataset versioning migration is a three-part automated process:
1. Liquibase Schema Migration (Automated)
Automatically runs during application startup and migrates existing datasets to the new versioning schema in ClickHouse.
2. MySQL Counter Migration (Automated, Recommended to Disable After Completion)
Migrates dataset version counters in MySQL. This is controlled by DATASET_VERSION_ITEMS_TOTAL_MIGRATION_ENABLED (enabled by default).
Recommendation: After the upgrade completes successfully, you can optionally disable this flag to prevent the migration job from running on subsequent restarts. However, this is not required - after the first successful run, the migration automatically detects that all data has been migrated and becomes a no-op (does nothing):
3. Lazy Migration (Optional Data Reconciliation)
An optional safety mechanism controlled by DATASET_VERSIONING_MIGRATION_LAZY_ENABLED (disabled by default).
When to enable: Only if data was left behind due to concurrent writes during the migration. This handles the edge case where datasets were created during the migration process itself and were not migrated by the Liquibase migration.
How it works: When enabled, every dataset CRUD operation checks whether the dataset was migrated. If not migrated, it automatically migrates the dataset during that operation. This means you can simply access an unmigrated dataset through the UI or API, and it will be migrated on-the-fly.
Usage: If you know a specific dataset was not migrated, enable this flag temporarily and access that dataset. It’s recommended to disable this flag after all datasets are aligned (have a first version), but this is not mandatory - the check only runs for the dataset being accessed, so it becomes a no-op for datasets that are already migrated (the check will find they’re already migrated and do nothing).
Expected Behavior During Migration
First Startup with Version 1.9.92+
Expected Timeline: The migration typically takes 2-10 minutes, longer than standard Liquibase migrations due to data processing requirements.
During this time, you may see transient errors in the logs similar to:
These errors are expected and transient. The system automatically recovers once the migration process completes. No manual intervention is required.
Migration Process Timeline
The automated migration follows this sequence:
- Application starts (t=0s)
- Liquibase migration begins (immediately)
- Migrates dataset schema in ClickHouse
- Duration: 2-10 minutes (depends on data volume)
- Startup delay (default: 30 seconds after app start)
- MySQL counter migration begins (after startup delay)
- Processes dataset versions in batches
- Updates
items_totalfield in MySQL
- Migration completes automatically without downtime
Configuration Options
All configuration is done through environment variables. The default values are appropriate for most deployments and require no changes.
MySQL Counter Migration (Recommended to Disable After Completion)
Controls the MySQL dataset version counter migration. This should be disabled after the upgrade completes successfully.
Lazy Migration (Optional Data Reconciliation)
Controls optional data reconciliation for datasets that may have been missed during concurrent writes. Only enable if you suspect data was left behind during migration.
Post-Migration Steps
Step 1: Verify Migration Completion
Monitor the backend logs to confirm successful completion:
Look for the completion message:
Step 2: Optionally Disable MySQL Counter Migration
Once you see the completion message, you can optionally disable the MySQL counter migration flag. This is not required since the migration automatically becomes a no-op after the first successful run, but you may choose to disable it for clarity:
If you choose to disable it, restart the backend service to apply the change.
Step 3: Verify Data Integrity
After migration, verify that:
- Your datasets are accessible and contain all expected data
- Each dataset has a first version (version 1) visible in the UI or API
Additional Validation Queries
You can run these queries to validate the migration completed successfully:
Verify item counts match between ClickHouse tables:
Verify all MySQL dataset version counters are updated:
For comprehensive validation, refer to the test_dataset_versioning_migration.sql script which includes additional checks for:
- Verifying version 1 created for all datasets
- Validating ‘latest’ tags are properly assigned
- Checking dataset metadata integrity
- Confirming ClickHouse data consistency
Troubleshooting and Recovery
Scenario 1: Migration Takes Longer Than Expected
Symptoms:
- Migration runs for more than 10 minutes
- No completion message in logs after extended period
Root Causes:
- Very large number of dataset items
- ClickHouse or MySQL performance issues
- Resource constraints (CPU, memory, disk I/O)
Recovery Steps:
-
Check if migration is still progressing:
-
If migration is stuck, increase timeouts:
-
Check database health:
-
Restart the backend service after adjusting configuration:
Scenario 2: Transient Timeout Errors During Migration
Symptoms:
What’s Happening:
These timeout errors are expected and transient during the migration process. They occur because:
- The Liquibase migration is actively writing large amounts of data to ClickHouse
- Temporary resource contention during data migration
- Normal behavior for the 2-10 minute migration window
Recovery Steps:
Simply wait for the migration to complete. These errors are temporary and the system automatically recovers. No manual intervention is required.
If errors persist beyond 15-20 minutes:
-
Check migration progress in logs:
-
Verify ClickHouse is responding:
-
If ClickHouse is unresponsive, check resources:
-
Only if ClickHouse is severely resource-constrained, consider increasing resources temporarily during migration.
Scenario 3: Data Missing After Migration
Symptoms:
- Some datasets were not migrated
- Users report incomplete data
Root Causes:
- Concurrent writes during migration
- Migration job interrupted before completion
Recovery Steps:
-
Enable lazy migration to reconcile missed data:
-
Restart the backend service:
-
Access affected datasets through the UI or API to trigger lazy migration for those specific datasets.
-
Monitor logs for lazy migration activity:
-
After reconciliation is complete, disable lazy migration:
-
Restart the backend to apply the change.
Scenario 4: Cannot Disable Migration Flag
Symptoms:
- Migration continues to run after disabling the flag
- Completion message appears on every restart
Root Causes:
- Configuration not properly applied
- Using wrong environment variable name
- Backend not restarted after configuration change
Recovery Steps:
-
Verify the exact environment variable name:
-
Check that configuration is applied:
-
Ensure backend was restarted after configuration change:
-
Monitor logs to confirm migration is skipped:
You should see logs indicating the migration is disabled or skipped.
Best Practices
Before Migration
-
Backup your databases before upgrading to version 1.9.92+
- See Advanced ClickHouse Backup guide
- Backup MySQL data as well
-
Review resource allocation:
- Ensure adequate CPU, memory, and disk space
- Consider the 2-10 minute migration window
-
Plan for the migration window:
- Schedule the upgrade during low-traffic periods
- Inform users of the expected 2-10 minute migration time
During Migration
- Monitor logs actively during the first startup
- Don’t restart the backend while migration is in progress
- Verify database connectivity if issues occur
After Migration
- Verify completion by checking for the success message in logs
- Optionally disable the MySQL items_total migration flag (not required - it becomes a no-op after first run)
- Validate data integrity using the verification queries:
- Compare item counts between ClickHouse tables
- Verify all MySQL counters are updated (no
items_total = -1) - Test dataset access through UI and API
- Keep lazy migration disabled unless data reconciliation is needed
Monitoring Migration Progress
Check Migration Status
Monitor the backend logs to track migration progress:
Expected Log Messages
During migration, you should see logs indicating:
Additional Resources
- Troubleshooting - General troubleshooting guide
- Scaling Opik - Performance optimization guidelines
- Kubernetes Deployment - Helm chart configuration
- Advanced ClickHouse Backup - Database backup procedures