Troubleshooting
This guide covers common troubleshooting scenarios for self-hosted Opik deployments.
Common Issues
ClickHouse Zookeeper Metadata Loss
Problem Description
If Zookeeper loses the metadata paths for ClickHouse tables, you will see coordination exceptions in the ClickHouse logs and potentially in the opik-backend service logs. These errors indicate that Zookeeper cannot find table metadata paths.
Symptoms:
Error messages appearing in ClickHouse logs and propagating to opik-backend service:
This indicates that Zookeeper has lost the metadata paths for one or more ClickHouse tables.
Resolution Steps
Follow these steps to restore ClickHouse table metadata in Zookeeper:
1. Clean Zookeeper Paths (If Needed)
If only some table paths are missing in Zookeeper, you’ll need to delete the existing paths manually. Connect to the Zookeeper pod and use the Zookeeper CLI:
Warning: This operation removes all table metadata from Zookeeper. Proceed with caution.
2. Restart ClickHouse
Restart the ClickHouse pods so they become aware that Zookeeper no longer has the metadata:
3. Restore Replica Definitions
Connect to the first ClickHouse replica and restore the replica definitions for each table:
Important: The Opik schema name is typically opik but may vary depending on your installation. Before proceeding, verify your schema name by running SHOW DATABASES; in ClickHouse and identifying the Opik database. Use that database name in all subsequent commands.
Run the SYSTEM RESTORE REPLICA command for each table:
Note: The exact list of tables may vary depending on your Opik version. Use the SHOW DATABASES; or \d command to list all tables in your database and restore each one.
4. Restart ClickHouse Again
Restart ClickHouse again to ensure it:
- Re-establishes connections to Zookeeper
- Verifies and synchronizes the newly restored metadata
- Automatically resumes normal replication operations
5. Validate the Recovery
After the restart completes, verify that the replica status is healthy:
Expected Results:
is_readonly = 0(table is writable)replica_is_active = 1(replica is active)zookeeper_exception = ''(no exceptions)
You can also verify from the Zookeeper side:
Diagnostic Commands
Connecting to ClickHouse
Connect directly to ClickHouse pods for diagnostics:
Connecting to Zookeeper
Connect directly to Zookeeper pods:
Common Zookeeper commands:
Prevention and Best Practices
To avoid Zookeeper metadata loss issues:
-
Regular Backups: Implement regular backups of ClickHouse data. See the Advanced ClickHouse Backup guide for details.
-
Monitoring: Set up monitoring for Zookeeper health and ClickHouse replica status. Alert on
zookeeper_exceptioninsystem.replicas. -
Resource Allocation: Ensure Zookeeper has adequate resources (CPU, memory, disk) to maintain metadata reliably.
-
Persistent Storage: Use persistent volumes for Zookeeper to prevent data loss during pod restarts.
-
Replica Validation: Regularly check replica status with the diagnostic queries above.
Getting Help
If you continue to experience issues after following this guide:
- Check the Opik GitHub Issues for similar problems
- Review ClickHouse and Zookeeper logs for additional error details
- Open a new issue on GitHub with:
- Opik versions:
- Backend version (opik-backend)
- Frontend version (opik-frontend)
- Helm chart version (if deployed via Helm)
- ClickHouse version
- Zookeeper version
- Error logs from all services (ClickHouse, Zookeeper, opik-backend)
- Steps taken to reproduce the issue
- Opik versions: