Enabling Large CSV Uploads

Configure Opik to support large CSV file uploads for datasets

By default, Opik supports CSV file uploads up to 20MB for dataset creation. For self-hosted deployments that need to process larger CSV files (up to 2GB), you can enable the large CSV upload feature with additional configuration.

Overview

When enabled, this feature allows:

  • CSV files up to 2GB in size
  • Asynchronous processing - files are processed in the background after upload

Configuration Steps

1. Enable the Feature Toggle

Set the following environment variable for the Opik backend service:

$TOGGLE_CSV_UPLOAD_ENABLED: true

2. Increase Idle Timeout

Large file uploads require more time to transfer. Increase the server idle timeout:

$SERVER_IDLE_TIMEOUT: 10m

The default timeout is 30 seconds, which is insufficient for large file uploads. We recommend setting it to 10 minutes for files up to 2GB.

3. Configure Nginx (Kubernetes/Helm Deployments)

If you’re using the Helm chart deployment, add the following configuration to your values.yaml:

1component:
2 frontend:
3 # Increase client body size limit to 2GB
4 clientMaxBodySize: "2g"
5
6 # Increase proxy timeouts for large file uploads
7 upstreamConfig:
8 proxy_read_timeout: 600s
9 proxy_connect_timeout: 600s
10 proxy_send_timeout: 600s
11 client_max_body_size: 2g

4. Ensure Adequate Disk Space

The backend service temporarily buffers uploaded CSV files to disk before processing them. Ensure your backend pods/containers have:

  • Minimum 50GB of disk space available
  • Sufficient IOPS for concurrent file operations

5. Optional: Adjust Batch Size

You can optionally configure the batch size for CSV processing:

$BATCH_OPERATIONS_DATASETS_CSV_BATCH_SIZE: 1000

The default batch size is 1000 rows per batch. Adjust this based on your:

  • Available memory
  • Row complexity (number of columns, data size)
  • Desired processing speed

Docker Compose Deployments

For Docker Compose deployments, the configuration is slightly different:

1. Update docker-compose.yml

Add the environment variables to the backend service:

1services:
2 backend:
3 environment:
4 - TOGGLE_CSV_UPLOAD_ENABLED=true
5 - SERVER_IDLE_TIMEOUT=10m
6 - BATCH_OPERATIONS_DATASETS_CSV_BATCH_SIZE=1000

2. Update Nginx Configuration

The nginx configuration files already include the 2GB limit for local deployments. No additional changes are needed for nginx_default_local.conf or nginx_local_be_local.conf.

Kubernetes/Helm Deployment Example

Here’s a complete example for Helm chart deployments:

1# values.yaml
2component:
3 backend:
4 env:
5 TOGGLE_CSV_UPLOAD_ENABLED: "true"
6 SERVER_IDLE_TIMEOUT: "10m"
7 BATCH_OPERATIONS_DATASETS_CSV_BATCH_SIZE: "1000"
8
9 # Ensure adequate disk space
10 persistence:
11 enabled: true
12 size: 100Gi # Adjust based on your needs
13
14 frontend:
15 clientMaxBodySize: "2g"
16
17 upstreamConfig:
18 proxy_read_timeout: 600s
19 proxy_connect_timeout: 600s
20 proxy_send_timeout: 600s
21 client_max_body_size: 2g

Then upgrade your Helm release:

$helm upgrade opik opik/opik -n opik -f values.yaml

Verification

After applying the configuration:

  1. Restart services to apply the changes
  2. Test with a small CSV first (< 100MB) to verify the feature works
  3. Monitor logs during upload to ensure proper processing:
$# Kubernetes
>kubectl logs -n opik deployment/opik-backend -f | grep CSV
>
># Docker Compose
>docker-compose logs -f backend | grep CSV

You should see log messages like:

CSV upload request for dataset 'xxx' on workspaceId 'xxx'
CSV upload accepted for dataset 'xxx' on workspaceId 'xxx', processing asynchronously
Starting asynchronous CSV processing for dataset 'xxx' on workspaceId 'xxx'
CSV processing completed for dataset 'xxx', total items: 'xxx'

Troubleshooting

Upload Fails with 413 Error

Problem: HTTP 413 Request Entity Too Large

Solution: Verify nginx configuration includes client_max_body_size: 2g at the server level, not just in location blocks.

Upload Succeeds but Processing Fails

Problem: File uploads successfully but items don’t appear in the dataset

Solution:

  1. Check backend logs for processing errors
  2. Verify adequate disk space is available
  3. Check memory limits - large CSV files require sufficient memory for processing

Timeout Errors

Problem: Upload times out before completing

Solution:

  1. Increase SERVER_IDLE_TIMEOUT further (e.g., to 15m or 20m)
  2. Increase nginx proxy timeouts in upstreamConfig
  3. Check network bandwidth between client and server

Out of Memory Errors

Problem: Backend service crashes or restarts during processing

Solution:

  1. Reduce BATCH_OPERATIONS_DATASETS_CSV_BATCH_SIZE to process smaller batches
  2. Increase backend service memory limits
  3. Process smaller CSV files or split large files into multiple uploads

Additional Resources