Dataset versioning migration

Understanding the automated dataset versioning migration process

Dataset Versioning Migration

Starting with Opik version 1.9.92, the platform includes an automated migration process to support enhanced dataset versioning capabilities. This migration runs automatically when you first start Opik 1.9.92 or higher.

Migration Duration: Unlike typical Liquibase migrations, the data migration can take 2-10 minutes depending on the size of your datasets. This is normal and expected.

Overview

The dataset versioning migration is a three-part automated process:

1. Liquibase Schema Migration (Automated)

Automatically runs during application startup and migrates existing datasets to the new versioning schema in ClickHouse.

Migrates dataset version counters in MySQL. This is controlled by DATASET_VERSION_ITEMS_TOTAL_MIGRATION_ENABLED (enabled by default).

Recommendation: After the upgrade completes successfully, you can optionally disable this flag to prevent the migration job from running on subsequent restarts. However, this is not required - after the first successful run, the migration automatically detects that all data has been migrated and becomes a no-op (does nothing):

1# Disable after successful migration
2DATASET_VERSION_ITEMS_TOTAL_MIGRATION_ENABLED: false

3. Lazy Migration (Optional Data Reconciliation)

An optional safety mechanism controlled by DATASET_VERSIONING_MIGRATION_LAZY_ENABLED (disabled by default).

When to enable: Only if data was left behind due to concurrent writes during the migration. This handles the edge case where datasets were created during the migration process itself and were not migrated by the Liquibase migration.

How it works: When enabled, every dataset CRUD operation checks whether the dataset was migrated. If not migrated, it automatically migrates the dataset during that operation. This means you can simply access an unmigrated dataset through the UI or API, and it will be migrated on-the-fly.

Usage: If you know a specific dataset was not migrated, enable this flag temporarily and access that dataset. It’s recommended to disable this flag after all datasets are aligned (have a first version), but this is not mandatory - the check only runs for the dataset being accessed, so it becomes a no-op for datasets that are already migrated (the check will find they’re already migrated and do nothing).

Expected Behavior During Migration

First Startup with Version 1.9.92+

Expected Timeline: The migration typically takes 2-10 minutes, longer than standard Liquibase migrations due to data processing requirements.

During this time, you may see transient errors in the logs similar to:

Causing: liquibase.exception.DatabaseException: ClickHouse exception, code: 159, host: clickhouse-opik-clickhouse, port: 8123; Read timed out [Failed SQL: (159) INSERT INTO opik_prod.dataset_item_versions

These errors are expected and transient. The system automatically recovers once the migration process completes. No manual intervention is required.

Migration Process Timeline

The automated migration follows this sequence:

  1. Application starts (t=0s)
  2. Liquibase migration begins (immediately)
    • Migrates dataset schema in ClickHouse
    • Duration: 2-10 minutes (depends on data volume)
  3. Startup delay (default: 30 seconds after app start)
  4. MySQL counter migration begins (after startup delay)
    • Processes dataset versions in batches
    • Updates items_total field in MySQL
  5. Migration completes automatically without downtime

Configuration Options

All configuration is done through environment variables. The default values are appropriate for most deployments and require no changes.

Controls the MySQL dataset version counter migration. This should be disabled after the upgrade completes successfully.

Environment VariableDefaultDescription
DATASET_VERSION_ITEMS_TOTAL_MIGRATION_ENABLEDtrueEnable the MySQL counter migration on startup. Disable after successful upgrade.
DATASET_VERSION_ITEMS_TOTAL_MIGRATION_BATCH_SIZE100Number of dataset versions to process per batch
DATASET_VERSION_ITEMS_TOTAL_MIGRATION_LOCK_TIMEOUT_SECONDS3600Lock timeout in seconds (1 hour)
DATASET_VERSION_ITEMS_TOTAL_MIGRATION_STARTUP_DELAY_SECONDS30Delay before starting migration after startup
DATASET_VERSION_ITEMS_TOTAL_MIGRATION_JOB_TIMEOUT_SECONDS3600Maximum time for the entire migration job (1 hour)

Lazy Migration (Optional Data Reconciliation)

Controls optional data reconciliation for datasets that may have been missed during concurrent writes. Only enable if you suspect data was left behind during migration.

Environment VariableDefaultDescription
DATASET_VERSIONING_MIGRATION_LAZY_ENABLEDfalseEnable lazy migration for reconciling missed datasets (migrate on first access)

Post-Migration Steps

Step 1: Verify Migration Completion

Monitor the backend logs to confirm successful completion:

$# Kubernetes
$kubectl logs -n opik deployment/opik-backend -f | grep -i "migration"
$
$# Docker Compose
$docker-compose logs -f backend | grep -i "migration"

Look for the completion message:

Dataset version items_total migration completed successfully

Step 2: Optionally Disable MySQL Counter Migration

Once you see the completion message, you can optionally disable the MySQL counter migration flag. This is not required since the migration automatically becomes a no-op after the first successful run, but you may choose to disable it for clarity:

1# Docker Compose
2services:
3 backend:
4 environment:
5 - DATASET_VERSION_ITEMS_TOTAL_MIGRATION_ENABLED=false
1# Kubernetes/Helm
2component:
3 backend:
4 env:
5 DATASET_VERSION_ITEMS_TOTAL_MIGRATION_ENABLED: "false"

If you choose to disable it, restart the backend service to apply the change.

Step 3: Verify Data Integrity

After migration, verify that:

  1. Your datasets are accessible and contain all expected data
  2. Each dataset has a first version (version 1) visible in the UI or API

Additional Validation Queries

You can run these queries to validate the migration completed successfully:

Verify item counts match between ClickHouse tables:

1-- Both queries should return the same count
2SELECT count(distinct id) from dataset_items;
3SELECT count(distinct id) from dataset_item_versions;

Verify all MySQL dataset version counters are updated:

1-- Should return empty result set (0 rows)
2-- Any rows returned indicate versions with incomplete items total migration
3SELECT count(id)
4FROM dataset_versions
5WHERE items_total = -1;

For comprehensive validation, refer to the test_dataset_versioning_migration.sql script which includes additional checks for:

  • Verifying version 1 created for all datasets
  • Validating ‘latest’ tags are properly assigned
  • Checking dataset metadata integrity
  • Confirming ClickHouse data consistency

Troubleshooting and Recovery

Scenario 1: Migration Takes Longer Than Expected

Symptoms:

  • Migration runs for more than 10 minutes
  • No completion message in logs after extended period

Root Causes:

  • Very large number of dataset items
  • ClickHouse or MySQL performance issues
  • Resource constraints (CPU, memory, disk I/O)

Recovery Steps:

  1. Check if migration is still progressing:

    $# Monitor backend logs for progress
    $kubectl logs -n opik deployment/opik-backend -f | grep -i "migration"
  2. If migration is stuck, increase timeouts:

    1# Extend timeouts for large deployments
    2DATASET_VERSION_ITEMS_TOTAL_MIGRATION_LOCK_TIMEOUT_SECONDS: "7200" # 2 hours
    3DATASET_VERSION_ITEMS_TOTAL_MIGRATION_JOB_TIMEOUT_SECONDS: "7200" # 2 hours
  3. Check database health:

    $# Verify ClickHouse is responsive
    $kubectl exec -it chi-opik-clickhouse-cluster-0-0-0 -- clickhouse-client -q "SELECT 1"
    $
    $# Check MySQL connectivity
    $kubectl exec -it mysql-pod -- mysql -u opik -p -e "SELECT 1"
  4. Restart the backend service after adjusting configuration:

    $# Kubernetes
    $kubectl rollout restart deployment/opik-backend -n opik
    $
    $# Docker Compose
    $docker-compose restart backend

Scenario 2: Transient Timeout Errors During Migration

Symptoms:

liquibase.exception.DatabaseException: ClickHouse exception, code: 159
Read timed out

What’s Happening:

These timeout errors are expected and transient during the migration process. They occur because:

  • The Liquibase migration is actively writing large amounts of data to ClickHouse
  • Temporary resource contention during data migration
  • Normal behavior for the 2-10 minute migration window

Recovery Steps:

Simply wait for the migration to complete. These errors are temporary and the system automatically recovers. No manual intervention is required.

If errors persist beyond 15-20 minutes:

  1. Check migration progress in logs:

    $# Kubernetes
    $kubectl logs -n opik deployment/opik-backend -f | grep -i "migration"
    $
    $# Docker Compose
    $docker-compose logs -f backend | grep -i "migration"
  2. Verify ClickHouse is responding:

    $# Check ClickHouse is healthy
    $kubectl exec -it chi-opik-clickhouse-cluster-0-0-0 -- clickhouse-client -q "SELECT 1"
  3. If ClickHouse is unresponsive, check resources:

    $# Check ClickHouse CPU and memory usage
    $kubectl top pod -n opik | grep clickhouse
  4. Only if ClickHouse is severely resource-constrained, consider increasing resources temporarily during migration.

Scenario 3: Data Missing After Migration

Symptoms:

  • Some datasets were not migrated
  • Users report incomplete data

Root Causes:

  • Concurrent writes during migration
  • Migration job interrupted before completion

Recovery Steps:

  1. Enable lazy migration to reconcile missed data:

    1# Docker Compose
    2services:
    3 backend:
    4 environment:
    5 - DATASET_VERSIONING_MIGRATION_LAZY_ENABLED=true
    1# Kubernetes/Helm
    2component:
    3 backend:
    4 env:
    5 DATASET_VERSIONING_MIGRATION_LAZY_ENABLED: "true"
  2. Restart the backend service:

    $# Kubernetes
    $kubectl rollout restart deployment/opik-backend -n opik
    $
    $# Docker Compose
    $docker-compose restart backend
  3. Access affected datasets through the UI or API to trigger lazy migration for those specific datasets.

  4. Monitor logs for lazy migration activity:

    $kubectl logs -n opik deployment/opik-backend -f | grep -i "lazy.*migration"
  5. After reconciliation is complete, disable lazy migration:

    1DATASET_VERSIONING_MIGRATION_LAZY_ENABLED: "false"
  6. Restart the backend to apply the change.

Scenario 4: Cannot Disable Migration Flag

Symptoms:

  • Migration continues to run after disabling the flag
  • Completion message appears on every restart

Root Causes:

  • Configuration not properly applied
  • Using wrong environment variable name
  • Backend not restarted after configuration change

Recovery Steps:

  1. Verify the exact environment variable name:

    1# Correct variable name
    2DATASET_VERSION_ITEMS_TOTAL_MIGRATION_ENABLED: "false"
  2. Check that configuration is applied:

    $# Kubernetes - check environment variables
    $kubectl get pod -n opik -l app=opik-backend -o jsonpath='{.items[0].spec.containers[0].env[*]}'
    $
    $# Docker Compose - check container environment
    $docker-compose exec backend env | grep DATASET_VERSION
  3. Ensure backend was restarted after configuration change:

    $# Kubernetes
    $kubectl rollout restart deployment/opik-backend -n opik
    $kubectl rollout status deployment/opik-backend -n opik
    $
    $# Docker Compose
    $docker-compose restart backend
  4. Monitor logs to confirm migration is skipped:

    $kubectl logs -n opik deployment/opik-backend -f | grep -i "migration"

    You should see logs indicating the migration is disabled or skipped.

Best Practices

Before Migration

  1. Backup your databases before upgrading to version 1.9.92+

  2. Review resource allocation:

    • Ensure adequate CPU, memory, and disk space
    • Consider the 2-10 minute migration window
  3. Plan for the migration window:

    • Schedule the upgrade during low-traffic periods
    • Inform users of the expected 2-10 minute migration time

During Migration

  1. Monitor logs actively during the first startup
  2. Don’t restart the backend while migration is in progress
  3. Verify database connectivity if issues occur

After Migration

  1. Verify completion by checking for the success message in logs
  2. Optionally disable the MySQL items_total migration flag (not required - it becomes a no-op after first run)
  3. Validate data integrity using the verification queries:
    • Compare item counts between ClickHouse tables
    • Verify all MySQL counters are updated (no items_total = -1)
    • Test dataset access through UI and API
  4. Keep lazy migration disabled unless data reconciliation is needed

Monitoring Migration Progress

Check Migration Status

Monitor the backend logs to track migration progress:

$# Kubernetes
$kubectl logs -n opik deployment/opik-backend -f | grep -i "migration"
$
$# Docker Compose
$docker-compose logs -f backend | grep -i "migration"

Expected Log Messages

During migration, you should see logs indicating:

Starting dataset versioning items_total migration...
Processing dataset versions batch...
Migration completed successfully, processed X dataset versions
Dataset version items_total migration completed successfully

Additional Resources