Offline fallback and message replay
Offline fallback and message replay
The Opik Python SDK includes a built-in offline fallback mechanism that protects your tracing data during network outages. When the SDK cannot reach the Opik server, messages are automatically persisted to a local SQLite database. Once connectivity is restored, all stored messages are replayed to the server transparently, with no changes required in your application code.
The offline fallback feature is available in the Python SDK only and is enabled by default with no configuration required.
How it works
The feature operates entirely in the background across three phases:
1. Detection — A lightweight background thread uses (OpikConnectionMonitor) to periodically ping the /is-alive/ping
endpoint on the Opik server. When a ping fails, or when a message sending encounters a connection error, the
SDK marks the connection as unavailable.
2. Storage — While the connection is unavailable, every new message is immediately written to a local SQLite database (stored in a system temporary directory) instead of being sent over the network. If a message was in flight when the connection dropped, it is re-marked as failed and added to the same store.
3. Replay — When the OpikConnectionMonitor detects that the server is reachable again, a ReplayManager
thread reads all stored messages in configurable batches and reinjects them into the SDK’s normal processing
pipeline. After that, they are delivered to the server just like any other message.
The SQLite database is cleaned up automatically when the SDK shuts down. Delivered messages are deleted from the database as soon as the server confirms receipt.
Supported message types
All SDK operations that produce the following message types are protected by the offline fallback:
Configuration
The offline fallback works out of the box with sensible defaults. You can tune its behaviour using
environment variables or the ~/.opik.config file.
Environment variables
Set these before starting your application:
Configuration file
Add the parameters to your ~/.opik.config file under the [opik] section:
Configuration reference
Tuning for your environment
High-volume applications
If your application logs many traces per second, a large backlog may accumulate during an outage. To replay it quickly after recovery, increase the batch size and reduce the inter-batch delay:
Memory-constrained environments
To limit the amount of memory used when reading messages from the database during replay:
Slow or unreliable networks
If connectivity is intermittent, reduce the ping interval so the SDK stops trying to send messages sooner after an outage begins:
Fast recovery detection
To minimise the delay between the server becoming available again and replay starting:
Recovery time estimate
The approximate time to replay a backlog after connectivity is restored is:
Example: 500 stored messages with default settings (batch_size=50, delay=0.5 s):
Graceful degradation
If the local SQLite database itself becomes unavailable (for example, the temporary directory is not writable), the SDK logs a warning and continues operating without the offline fallback. Tracing data will be lost during any later outage, but the application will not crash.
Ensure the system temporary directory is writable by the process running the SDK. On most systems
this is /tmp or the path returned by tempfile.gettempdir().
Troubleshooting
Messages are not replayed after recovery
- Verify connectivity — Run
opik healthcheckto confirm the SDK can reach the server. - Check the ping interval — The SDK may take up to
connection_monitor_ping_intervalseconds to detect that the server is back. With the default of 10 seconds, wait at least 10–15 seconds after the server recovers before concluding that replay is not happening. - Call
client.flush()— Explicitly flushing the client triggers an immediate replay attempt and waits for all pending messages to be delivered.
Large backlog taking too long to replay
Increase OPIK_REPLAY_BATCH_SIZE and decrease OPIK_REPLAY_BATCH_REPLAY_DELAY as shown in the
high-volume tuning section above.
Database error in logs
If you see a log message such as "Some network resiliency features were disabled", the SQLite
database could not be initialised. Check that the temporary directory is writable and that there is
sufficient disk space.
Enable debug logging
To see detailed replay activity, enable debug logging before importing opik:
Then inspect /tmp/opik-debug.log for entries from replay_manager and db_manager.