The Opik Python SDK includes a built-in offline fallback mechanism that protects your tracing data during network outages. When the SDK cannot reach the Opik server, messages are automatically persisted to a local SQLite database. Once connectivity is restored, all stored messages are replayed to the server transparently, with no changes required in your application code.
The offline fallback feature is available in the Python SDK only and is enabled by default with no configuration required.
The feature operates entirely in the background across three phases:
1. Detection — A lightweight background thread uses (OpikConnectionMonitor) to periodically ping the /is-alive/ping
endpoint on the Opik server. When a ping fails, or when a message sending encounters a connection error, the
SDK marks the connection as unavailable.
2. Storage — While the connection is unavailable, every new message is immediately written to a local SQLite database (stored in a system temporary directory) instead of being sent over the network. If a message was in flight when the connection dropped, it is re-marked as failed and added to the same store.
3. Replay — When the OpikConnectionMonitor detects that the server is reachable again, a ReplayManager
thread reads all stored messages in configurable batches and reinjects them into the SDK’s normal processing
pipeline. After that, they are delivered to the server just like any other message.
The SQLite database is cleaned up automatically when the SDK shuts down. Delivered messages are deleted from the database as soon as the server confirms receipt.
All SDK operations that produce the following message types are protected by the offline fallback:
The offline fallback works out of the box with sensible defaults. You can tune its behaviour using
environment variables or the ~/.opik.config file.
Set these before starting your application:
Add the parameters to your ~/.opik.config file under the [opik] section:
If your application logs many traces per second, a large backlog may accumulate during an outage. To replay it quickly after recovery, increase the batch size and reduce the inter-batch delay:
To limit the amount of memory used when reading messages from the database during replay:
If connectivity is intermittent, reduce the ping interval so the SDK stops trying to send messages sooner after an outage begins:
To minimise the delay between the server becoming available again and replay starting:
The approximate time to replay a backlog after connectivity is restored is:
Example: 500 stored messages with default settings (batch_size=50, delay=0.5 s):
If the local SQLite database itself becomes unavailable (for example, the temporary directory is not writable), the SDK logs a warning and continues operating without the offline fallback. Tracing data will be lost during any later outage, but the application will not crash.
Ensure the system temporary directory is writable by the process running the SDK. On most systems
this is /tmp or the path returned by tempfile.gettempdir().
opik healthcheck to confirm the SDK can reach the server.connection_monitor_ping_interval seconds to
detect that the server is back. With the default of 10 seconds, wait at least 10–15 seconds after
the server recovers before concluding that replay is not happening.client.flush() — Explicitly flushing the client triggers an immediate replay attempt and
waits for all pending messages to be delivered.Increase OPIK_REPLAY_BATCH_SIZE and decrease OPIK_REPLAY_BATCH_REPLAY_DELAY as shown in the
high-volume tuning section above.
If you see a log message such as "Some network resiliency features were disabled", the SQLite
database could not be initialised. Check that the temporary directory is writable and that there is
sufficient disk space.
To see detailed replay activity, enable debug logging before importing opik:
Then inspect /tmp/opik-debug.log for entries from replay_manager and db_manager.