Troubleshooting Self-Hosted Opik Deployments

This guide covers common troubleshooting scenarios for self-hosted Opik deployments.

Common Issues

ClickHouse Zookeeper Metadata Loss

Problem Description

If Zookeeper loses the metadata paths for ClickHouse tables, you will see coordination exceptions in the ClickHouse logs and potentially in the opik-backend service logs. These errors indicate that Zookeeper cannot find table metadata paths.

Symptoms:

Error messages appearing in ClickHouse logs and propagating to opik-backend service:

Code: 999. Coordination::Exception: Coordination error: No node, path /clickhouse/tables/0/default/DATABASECHANGELOG/log. (KEEPER_EXCEPTION)

This indicates that Zookeeper has lost the metadata paths for one or more ClickHouse tables.

Resolution Steps

Follow these steps to restore ClickHouse table metadata in Zookeeper:

1. Clean Zookeeper Paths (If Needed)

If only some table paths are missing in Zookeeper, you’ll need to delete the existing paths manually. Connect to the Zookeeper pod and use the Zookeeper CLI:

$ # Connect to Zookeeper pod
> kubectl exec -it cometml-production-opik-zookeeper-0 -- zkCli.sh -server localhost:2181
> 
> # Delete all ClickHouse table paths
> deleteall /clickhouse/tables

Warning: This operation removes all table metadata from Zookeeper. Proceed with caution.

2. Restart ClickHouse

Restart the ClickHouse pods so they become aware that Zookeeper no longer has the metadata:

$ kubectl rollout restart statefulset/chi-opik-clickhouse-cluster-0-0

3. Restore Replica Definitions

Connect to the first ClickHouse replica and restore the replica definitions for each table:

$ # Connect to the first ClickHouse replica
> kubectl exec -it chi-opik-clickhouse-cluster-0-0-0 -- clickhouse-client

Important: The Opik schema name is typically opik but may vary depending on your installation. Before proceeding, verify your schema name by running SHOW DATABASES; in ClickHouse and identifying the Opik database. Use that database name in all subsequent commands.

Run the SYSTEM RESTORE REPLICA command for each table:

1 -- Restore system tables
2 SYSTEM RESTORE REPLICA default.DATABASECHANGELOG;
3 SYSTEM RESTORE REPLICA default.DATABASECHANGELOGLOCK;
4 
5 -- Verify your Opik database name
6 SHOW DATABASES;
7 
8 -- List all Opik tables (replace 'opik' with your actual schema name if different)
9 USE opik;
10 SHOW TABLES;
11 
12 -- Restore each Opik table
13 SYSTEM RESTORE REPLICA opik.attachments;
14 SYSTEM RESTORE REPLICA opik.automation_rule_evaluator_logs;
15 SYSTEM RESTORE REPLICA opik.comments;
16 SYSTEM RESTORE REPLICA opik.dataset_items;
17 SYSTEM RESTORE REPLICA opik.experiment_items;
18 SYSTEM RESTORE REPLICA opik.experiments;
19 SYSTEM RESTORE REPLICA opik.feedback_scores;
20 SYSTEM RESTORE REPLICA opik.guardrails;
21 SYSTEM RESTORE REPLICA opik.optimizations;
22 SYSTEM RESTORE REPLICA opik.project_configurations;
23 SYSTEM RESTORE REPLICA opik.spans;
24 SYSTEM RESTORE REPLICA opik.traces;
25 SYSTEM RESTORE REPLICA opik.trace_threads;
26 SYSTEM RESTORE REPLICA opik.workspace_configurations;

Note: The exact list of tables may vary depending on your Opik version. Use the SHOW DATABASES; or \d command to list all tables in your database and restore each one.

4. Restart ClickHouse Again

Restart ClickHouse again to ensure it:

Re-establishes connections to Zookeeper
Verifies and synchronizes the newly restored metadata
Automatically resumes normal replication operations

$ kubectl rollout restart statefulset/chi-opik-clickhouse-cluster-0-0

5. Validate the Recovery

After the restart completes, verify that the replica status is healthy:

1 -- Check table creation
2 SHOW CREATE TABLE opik.attachments;
3 
4 -- Verify replica status
5 SELECT table, is_readonly, replica_is_active, zookeeper_exception
6 FROM system.replicas;

Expected Results:

is_readonly = 0 (table is writable)
replica_is_active = 1 (replica is active)
zookeeper_exception = '' (no exceptions)

You can also verify from the Zookeeper side:

$ # Connect to Zookeeper CLI
> kubectl exec -it cometml-production-opik-zookeeper-0 -- zkCli.sh -server localhost:2181
> 
> # List tables (example path - adjust for your database name)
> ls /clickhouse/tables/0/<database_name>/<table_name>

Diagnostic Commands

Connecting to ClickHouse

Connect directly to ClickHouse pods for diagnostics:

$ # Connect to first replica
> kubectl exec -it chi-opik-clickhouse-cluster-0-0-0 -- clickhouse-client
> 
> # Connect to second replica (if running multiple replicas)
> kubectl exec -it chi-opik-clickhouse-cluster-0-1-0 -- clickhouse-client

Connecting to Zookeeper

Connect directly to Zookeeper pods:

$ # Connect to Zookeeper pod
> kubectl exec -it cometml-production-opik-zookeeper-0 -- bash
> 
> # Run Zookeeper client commands
> zkCli.sh -server localhost:2181

Common Zookeeper commands:

$ # List tables in Zookeeper
> kubectl exec -it cometml-production-opik-zookeeper-0 -- \
>   zkCli.sh -server localhost:2181 ls /clickhouse/tables/0/opik
> 
> # Remove a specific table from Zookeeper
> kubectl exec -it cometml-production-opik-zookeeper-0 -- \
>   zkCli.sh -server localhost:2181 \
>   deleteall /clickhouse/tables/0/opik/optimizations

Prevention and Best Practices

To avoid Zookeeper metadata loss issues:

Regular Backups: Implement regular backups of ClickHouse data. See the Advanced ClickHouse Backup guide for details.
Monitoring: Set up monitoring for Zookeeper health and ClickHouse replica status. Alert on zookeeper_exception in system.replicas.
Resource Allocation: Ensure Zookeeper has adequate resources (CPU, memory, disk) to maintain metadata reliably.
Persistent Storage: Use persistent volumes for Zookeeper to prevent data loss during pod restarts.
Replica Validation: Regularly check replica status with the diagnostic queries above.

Getting Help

If you continue to experience issues after following this guide:

Check the Opik GitHub Issues for similar problems
Review ClickHouse and Zookeeper logs for additional error details
Open a new issue on GitHub with:
- Opik versions:
  - Backend version (opik-backend)
  - Frontend version (opik-frontend)
  - Helm chart version (if deployed via Helm)
- ClickHouse version
- Zookeeper version
- Error logs from all services (ClickHouse, Zookeeper, opik-backend)
- Steps taken to reproduce the issue