Scaling Opik

Comprehensive guide for scaling Opik in production environments

Opik is built to power mission-critical workloads at scale. Whether you’re running a small proof of concept or a high-volume enterprise deployment, Opik adapts seamlessly to your needs. Its stateless architecture and powerful ClickHouse backed storage make it highly resilient, horizontally scalable, and future-proof for your data growth.

This guide outlines recommended configurations and best practices for running Opik in production.

Proven at Scale

Opik is engineered to handle demanding, production-grade workloads. The following example demonstrates the robustness of a typical deployment:

MetricValue
Select queries per second~80
Insert queries per second~20
Rows inserted per minuteUp to 75K
Traces (count)40 million
Traces (size)400 GB
Spans (count)250M
Spans (size)3.1 TB
Total data on disk5 TB
Weekly data ingestion100 GB

A deployment of this scale is fully supported using:

Opik Services - These Opik Services run on r7i.2xlarge instances with 2 replicas:

  • Opik Backend
  • Opik Frontend

The Opik Python Backend service runs on c7i.2xlarge instances with 3 replicas:

ClickHouse - running on m7i.8xlarge instances with 2 replicas.

This configuration provides both performance and reliability while leaving room for seamless expansion.

Built for Growth

Opik is designed with flexibility at its core. As your data grows and query volumes increase, Opik grows with you.

  • Horizontal scaling - add more replicas of services to instantly handle more traffic
  • Vertical scaling - increase CPU, memory, or storage to handle denser workloads
  • Seamless elasticity - scale out during peak usage and scale back during quieter periods

For larger workloads, ClickHouse can be scaled to support enterprise-level deployments. A common configuration includes:

  • 62 CPU cores
  • 256 GB RAM
  • 25 TB disk space

ClickHouse’s read path can also scale horizontally by increasing replicas, ensuring Opik continues to deliver high performance as usage grows.

Resilient Services Cluster

Opik services are stateless and fault-tolerant, ensuring high availability across environments. Recommended resources:

EnvironmentCPU (vCPU)RAM (GB)
Development48
Production1332

Instance Guidance

DeploymentInstancevCPUsMemory (GiB)
Dev (small)c7i.large24
Devc7i.xlarge48
Prod (small)c7i.2xlarge816
Prodc7i.4xlarge1632

Backend Service (Scales to Demand)

MetricDevProd SmallProd Large
Replicas257
CPU cores122
Memory (GiB)2912

Frontend Service (Always Responsive)

MetricDevProd SmallProd Large
Replicas235
CPU (millicores)55050
Memory (MiB)163264

ClickHouse: High-Performance Storage

At the heart of Opik’s scalability is ClickHouse, a proven, high-performance analytical database designed for large-scale workloads. Opik leverages ClickHouse for storing traces and spans, ensuring fast queries, robust ingestion, and uncompromising reliability.

Instance Types

Memory-optimized instances are recommended, with a minimum 4:1 memory-to-CPU ratio:

DeploymentInstance
Smallm7i.2xlarge
Mediumm7i.4xlarge
Largem7i.8xlarge

Replication Strategy

  • Development: 1 replica
  • Production: 2 replicas

Always scale vertically before adding more replicas for efficiency.

CPU & Memory Guidance

Target 10–20% CPU utilization, with safe spikes up to 40–50%.

Maintain at least a 4:1 memory-to-CPU ratio (extend to 8:1 for very large environments).

DeploymentCPU coresMemory (GiB)
Minimum28
Development416
Production (small)624
Production32128

Disk Recommendations

To ensure reliable performance under heavy load:

VolumeValue
FamilySSD
Typegp3
Size8–16 TiB (workload dependent)
IOPS3000
Throughput250 MiB/s

Opik’s ClickHouse layer is resilient even under sustained, large-scale ingestion, ensuring queries stay fast.

Managing System Tables

System tables (e.g., system.opentelemetry_span_log) can grow quickly. To keep storage lean:

  • Configure TTL settings in ClickHouse, or
  • Perform periodic manual pruning

Why Opik Scales with Confidence

  • Enterprise-ready — built to support multi-terabyte data volumes
  • Elastic & flexible — easily adjust resources to match workload demands
  • Robust & reliable — designed for high availability and long-term stability
  • Future-proof — proven to support growing usage without redesign

With Opik, you can start small and scale confidently, knowing your observability platform won’t hold you back.

References