Changelog

June 9, 2026

Resume Interrupted Evaluations with `evaluate_resume`

Long-running evaluation jobs that get cut short — by Ctrl-C, an OOM error, a failed scoring metric, or a network blip — can now be continued from where they stopped instead of restarting from scratch. opik.evaluate_resume(experiment_id, task, scoring_metrics=[...]) replays only the trials that did not complete, merges them with the ones that did, and returns a single EvaluationResult covering the whole experiment.

A trial counts as complete only when trace.output is set, which happens after the task, scoring, and score-logging all succeed. Any failure mode that prevents reaching that point — a metric raising an exception, a KeyboardInterrupt between task and scoring — leaves the trial replayable.

1 import opik
2 
3 # Continue a partially-completed experiment — only missing trials are replayed
4 result = opik.evaluate_resume(
5     experiment_id="...",
6     task=my_task,
7     scoring_metrics=[Equals()],
8 )

The original evaluate(...) call writes a resume snapshot into experiment_config so the exact iteration (pinned dataset version, sample count, per-item trial counts) can be reconstructed server-side. When the original call used a custom dataset_sampler or explicit dataset_item_ids, the SDK also writes a local checkpoint next to the experiment ID for those cases.

👉 Resume evaluations documentation

OpenAI Responses API Support in Playground and LLM-as-a-Judge

The Playground and LLM-as-a-Judge now support OpenAI’s /v1/responses API, making it possible to use o-series reasoning models (o1, o3, o3-mini, o4-mini) and other deployments that are only available on the newer API path. Previously, sending these models through the Chat Completions path returned “This is not a chat model and thus not supported in the v1/chat/completions endpoint.”

To opt in, open the Manage AI Providers dialog, select your OpenAI key, and set Pipeline mode to Responses API. The Chat Completions path remains the default and is unchanged for all other models.

The Playground’s Top P slider is now also hidden for OpenAI reasoning models (gpt-5.x, o1*, o3*, o4*). Those models reject top_p outright; the slider was causing 400 errors when it appeared.

Bug Fixes & Improvements

Annotation queues: claim mechanism for parallel annotation — Multiple annotators working the same queue simultaneously now see each item locked while another reviewer is looking at it, preventing duplicate work. Items show an “In review” indicator (orange) when all annotator slots are occupied by a combination of active locks and existing scores. Locks are kept alive by a heartbeat and expire via TTL when the reviewer navigates away. The sidebar also gains “To review” / “Processed” filter tabs, and each annotator sees items in a distinct shuffled order to reduce contention.
Collapsible JSON/YAML in trace and span detail view — JSON objects, arrays, and YAML blocks in the trace and span detail view can now be folded and unfolded with an inline chevron at the end of each foldable line. Collapsed blocks render as a clickable gray placeholder. This makes it easier to navigate large payloads without scrolling past content you don’t need.
Redesigned dataset and test suite creation flow — The creation dialog now presents two explicit paths: Upload a file (CSV or JSON dropzone with auto-naming and optional evaluation criteria for test suites) and Use SDK (name + code snippet). Both options are accessible from the header button and from the list empty state. On success the panel closes and a “Go to …” toast appears.
Evaluate experiment traces directly from the UI — The Compare Experiments page has a new Evaluate button (brain icon) in the action bar. It opens the online evaluation dialog scoped to all traces in the current experiment, so you can score an experiment’s output without leaving the page.
Span filtering by created_at and last_updated_at — The span search API now accepts created_at and last_updated_at as filter fields with all comparison operators (=, !=, >, >=, <, <=). These fields were already supported on traces; span support was missing.
OpenTelemetry: in-process spans now linked to the active @opik.track trace — When an OTel-instrumented library (such as logfire or PydanticAI) emits spans from inside an @opik.track-decorated function, those spans are now nested under the active tracked trace rather than starting a separate trace. Distributed flows where parent spans or W3C baggage carry Opik IDs continue to take precedence over the in-process context.
Optimization: best trial configuration now shows the optimized prompt — The Best Trial Configuration panel was displaying the baseline prompt instead of the prompt produced by the optimizer. It now shows the correct optimized result. The Trials table also gains a Prompt column with per-message formatting and a diff-vs-baseline popover.
Experiment views: prompt version labels instead of commit hashes — The Experiments table, the single-experiment Configuration tab, and the Dashboard Experiments leaderboard now display prompts as “name (v3)” instead of raw commit hashes, consistent with the display already used in the Prompt Library.
AI Spend dashboard: total tokens KPI card and onboarding empty state — The placeholder “Budget remaining” card is replaced by a Total tokens KPI showing the sum of all token tiers across models, with a period-over-period trend indicator. The dashboard also shows an onboarding empty state with setup instructions and a ready-to-copy configuration snippet when no trace data has been received yet.
Cost calculation: tiered pricing above 200k tokens now applied — For models such as gemini-2.5-pro and vertex_ai/claude-sonnet-4-5 that carry above_200k_tokens rate tiers, requests exceeding the 200k-token threshold were being billed at the base input rate. Opik now applies the tier rate when the threshold is crossed (the entire request is billed at the tier price, mirroring LiteLLM’s semantics).
Cost calculation: Claude on Vertex AI cached tokens now discounted — Claude models on Vertex AI (vertex_ai/claude-haiku-4-5, vertex_ai/claude-sonnet-4-5, vertex_ai/claude-opus-4-1) were having cache-read tokens billed at the full input rate. They now use the Anthropic cache calculator, correctly applying the discount on cached tokens.
Vertex AI: model selection preserved across provider switches — Switching away from Vertex AI and back in the Playground no longer resets the previously selected model.

Performance Improvements

Span timestamp filters use ClickHouse skip indexes — created_at and last_updated_at on the spans and traces tables now have minmax skip indexes. Range filters on these columns prune granules instead of scanning the full project partition, significantly reducing query time and ClickHouse CPU load on large tables.

And much more! 👉 See full commit log on GitHub

Releases: 2.0.53, 2.0.54, 2.0.55, 2.0.56, 2.0.57, 2.0.58, 2.0.59

June 2, 2026

Prompt Library Now Available in Opik 2.0

The Prompt Library is now part of the Opik 2.0 UI, accessible from the project sidebar under Prompt library. Alongside that, prompt versions have gained first-class environment support — you can tag a version as production or staging and retrieve it by name from the SDK, without tracking version numbers in application code.

What’s new:

Prompt Library in the sidebar — the Prompt Library is now a top-level section in every project in the new Opik 2.0 interface
Fetch by environment — client.get_prompt(name, environment="production") returns the version currently tagged as production; version and environment are mutually exclusive and passing both raises a clear error
Assign environments — client.set_prompt_environments(name, ["production", "staging"]) replaces the full environment set on a version; the same environment is automatically moved away from whatever version previously held it
Tag at creation — client.create_prompt(name, content="...", environments=["staging"]) and client.create_chat_prompt(...) accept environments directly
TypeScript parity — setPromptEnvironments, getPrompt({ environment }), and createPrompt({ environments }) mirror the Python API
Sequential version numbers — prompt versions now show as v1, v2, v3 in the UI and API instead of raw commit hashes
Environment badges everywhere — assigned environments appear next to every version reference in the prompt library, history timeline, diff view, and Playground
Terminology update — “commit” has been replaced with “version” throughout the prompt UI

1 # Tag a version at creation time
2 prompt = client.create_prompt("system-prompt", content="...", environments=["staging"])
3 
4 # Retrieve by environment — no hard-coded version number needed
5 production_prompt = client.get_prompt("system-prompt", environment="production")
6 
7 # Promote a specific version to production
8 client.set_prompt_environments("system-prompt", ["production"], version="v3")

Simplified Filters in the Logs View

The Traces, Spans, and Threads tabs now have a redesigned filter bar that makes it faster to narrow down what you’re looking at. Filters appear as chips directly in the toolbar — pick a field, set a value, and the table updates instantly. Frequently-used filters can be pinned to the bar so they’re always one click away, and filter state is preserved in the URL so you can share an exact filtered view with a teammate.

Bug Fixes & Improvements

Test suite assertions: sub-span inspection — the evaluator LLM can now issue get_trace_spans and read tool calls to inspect intermediate spans during evaluation, enabling correctness checks about tool usage, model selection, and per-span errors inside complex agents
Google ADK integration: images render in trace attachments — URL-safe base64 image data sent by ADK is automatically normalized to standard base64; PNG, JPEG, GIF, and WebP attachments all render correctly
Optimization trials page: all constituent experiments shown — experiments belonging to multi-project optimizations are now visible from the trials page regardless of which project the user is currently viewing the optimization from
Error rate KPI: now shows a percentage — the error rate dashboard card was displaying a raw event count; it now shows the rate as a percentage
Annotation queue: trace logs shown inline — trace log entries are rendered inline on the annotation queue page instead of requiring navigation away
Online evaluation rules: ClassCastException resolved — thread-level rules that include filters no longer throw a ClassCastException under certain configurations
Attachments: data URI prefix handled — base64 attachment payloads that include a data:<type>;base64, prefix are now stripped correctly in both the SDK and the frontend
SDK: built-in environment colors preserved — workspace environments with reserved names retain their designated color after updates or syncs
opik migrate: skipped items reported clearly — the migration command now reports each skipped item with its reason, count, and sample source IDs, and exits with code 1 so CI pipelines detect incomplete migrations
Qianfan integration documentation — the Qianfan LLM provider integration now has a dedicated documentation page

And much more! 👉 See full commit log on GitHub

Releases: 2.0.48, 2.0.49, 2.0.50, 2.0.51, 2.0.52

May 26, 2026

AND/OR Condition Grouping in Alerts

Alert rules now support structured condition grouping: conditions within a group are evaluated with AND, while groups themselves are combined with OR. This makes it possible to express logic such as “flag a trace if (hallucination score > 0.8 AND relevance score < 0.3) OR (toxicity score > 0.5)”.

Existing single-condition alerts continue to work exactly as before — each legacy condition is automatically treated as its own group, so no migration is needed.

Bug Fixes & Improvements

Prompt masks (Python & TypeScript SDKs) — prompt_mask_context(masks) / promptMaskContext(masks) lets you run agent code with specific prompt IDs silently redirected to a different version ID, non-destructively. The agent calls get_prompt() as usual and receives the overridden template without any permanent change to the prompt library. Designed for A/B testing and optimizer sweep scenarios.
Experiments: dataset version shown inline — the dataset version is now displayed as a pill alongside the item source in both the experiments table and the experiment detail header. The standalone “Test suite version” column has been removed; the same information is now visible in context.
Dataset items: conflicting key names no longer cause errors — iterating a dataset whose items contain a key that matches a DatasetItem field (e.g. id, as in HotpotQA) previously raised TypeError: multiple values for keyword argument. The SDK now strips conflicting keys and emits a one-time warning so iteration completes.
Harbor integration: supports harbor <0.8 and >=0.8 — track_harbor() now patches whichever method name the installed version of harbor exposes (_setup_environment or _setup_agent_environment), so tracing works regardless of which version is installed.
New Playground models — Gemini 3.5 Flash and qwen/qwen3.7-max are now available in the model picker.

And much more! 👉 See full commit log on GitHub

Releases: 2.0.42, 2.0.43, 2.0.44, 2.0.45, 2.0.46, 2.0.47

May 19, 2026

🚀 Client-Side Prompt Caching (Python & TypeScript SDKs)

client.get_prompt() and client.get_chat_prompt() now cache results in-process, so repeated calls inside a hot path skip the network round-trip entirely. Pinned commits are cached indefinitely; latest-version lookups use a 5-minute TTL that refreshes in the background so your code always gets a reasonably fresh value without blocking.

What’s new:

Automatic caching — results are cached on the first fetch; subsequent calls return instantly from memory
Configurable TTL — set OPIK_PROMPT_CACHE_TTL_SECONDS to adjust the freshness window (default: 300 s)
Bypass when needed — pass no_cache=True / noCache: true to force a live fetch from the backend
Prompt metadata injected into traces — when you fetch a prompt inside an @track context, the prompt ID and commit are automatically recorded in the trace metadata so you know which version was used at inference time
TypeScript SDK — the same caching and metadata injection are available in the TypeScript client

1 # Cached after the first call — no extra latency on subsequent invocations
2 prompt = client.get_prompt("my-system-prompt")
3 
4 # Force a fresh fetch, bypassing the cache
5 prompt = client.get_prompt("my-system-prompt", no_cache=True)

🔌 `opik connect` CLI Improvements

The opik connect and opik endpoint CLI commands have been reorganized with a much better error experience:

Formatted error output — configuration problems now show a labelled card (Reason / Workspace / URL / Config / Fix / Docs) so you can see exactly what’s wrong and how to fix it without reading a stack trace
Auto-configure on first run — if no ~/.opik.config file exists, opik connect now offers to run opik configure automatically (skipped in non-interactive / headless environments)
Instant disconnect — stopping a local runner session now notifies the backend immediately, so the connection status in the UI updates right away instead of waiting for a timeout

🔧 Bug Fixes & Improvements

Playground: Gemma 4 no longer leaks reasoning traces — the internal thinking output from Gemma 4 models was appearing at the top of Playground responses; it is now suppressed so you see only the final model response
Playground: updated default models for Gemini and Vertex AI — the provider dropdown previously defaulted to deprecated model aliases that weren’t selectable in the picker; both providers now default to their current recommended models
Google ADK integration: re-patching fixed — OpikADKOtelTracer was killing all active OpenTelemetry spans and re-patching the ADK exporter on every request; the patcher is now idempotent and preserves user-configured OTel pipelines
Environments: auto-created environments get distinct colors — environments created automatically from trace ingestion now receive a color from the palette (assigned deterministically by name hash), so they no longer all appear identical
Environments: inline validation errors — when creating or editing an environment, backend errors (duplicate name, invalid value) now appear inline below the field; the dialog stays open so you don’t lose your input
Environments: SDK preserves environment on update — calling span.end(), span.update(), trace.end(), or trace.update() no longer clears the environment field set at creation time
Experiments: “Item source” column — the column previously labeled “Test suite” in the Experiments table is now called “Item source” and shows a dynamic icon reflecting the actual source (dataset, trace, manual, etc.)
Dataset version copy no longer drops items — when a dataset version was copied and the stored item count had drifted from the actual ClickHouse row count, some items could be silently lost; the copy now uses a live count, eliminating the discrepancy
Self-hosted onboarding skip no longer loops — clicking “Skip” during onboarding on deployments without demo data (self-hosted Docker Compose, self-hosted EKS) previously started a 5-minute polling loop; it now routes directly to the home page
CSV dataset upload always available — the CSV upload button in the dataset UI is now always shown; it was previously hidden on self-hosted Docker Compose and some staging environments even though the feature was fully functional

⚡ Performance Improvements

Dataset streaming uses less backend CPU — resolved a query pattern that caused the MySQL reader to scan all dataset versions on every /datasets/items/stream call; under high request volume this was pushing database CPU to 80–99%, it now uses a direct primary-key lookup instead
Workspace selector loads faster — the workspace dropdown now fetches a lighter summary endpoint, reducing the number of backend queries on each page load for users with many workspaces

And much more! 👉 See full commit log on GitHub

Releases: 2.0.32, 2.0.33, 2.0.34, 2.0.35, 2.0.36, 2.0.37

May 12, 2026

Here are the most relevant improvements we’ve made since the last release:

🌍 Environment Tracking for Traces, Spans & Threads

You can now tag traces, spans, and threads with an environment field — production, staging, dev, or any label you define. This makes it easy to separate signal from noise: filter your project’s trace view to only production issues, or compare behavior between environments without spinning up separate projects.

What’s new:

Environment column in the Logs view - Traces and spans tables now show the environment and support filtering, so you can slice by production vs staging in a single project
Auto-create environments from ingestion - Environments are created automatically the first time a trace with a new environment name arrives; no setup required
Python SDK support - Pass environment to @track, opik.trace(), or opik.span() — and it’s preserved through .end() and .update() calls
TypeScript SDK support - Set environment on trace and span creation

1 import opik
2 
3 @opik.track(environment="production")
4 def my_agent(input: str) -> str:
5     ...

👉 Environments Documentation

🧪 Test Suite Assertions Can Now Inspect Sub-Spans

Test suite assertions can now look inside a trace — not just the top-level input/output — to reason about tool calls, intermediate LLM steps, and sub-agent behavior. The evaluator LLM gets access to two on-demand tools: get_trace_spans (lists all sub-spans for the trace) and read (fetches a specific span by ID) — so it can drill into exactly what happened at each step.

Why it matters: Previously, an assertion could only see what went in and came out of the agent. Now it can check whether the right tool was called, which model was used in an intermediate step, or whether a specific span had an error — enabling far more meaningful correctness checks for complex agents.

⚡ Dramatically Faster Trace Table Loading

Traces and spans tables no longer download attachment bytes (images, PDFs) when loading a list — attachments are lazy-loaded only when you open an individual trace. In our benchmarks with image and PDF attachments, this reduced the per-page payload from 85 MB → 0.13 MB and load time from 3.4 s to 0.1 s.

Why it matters: If any of your traces include file attachments, the table was silently fetching all that binary data on every page load. The experience is now fast regardless of attachment size or count.

🤖 OpenAI Playground: Per-Model reasoning_effort Support

The Playground’s reasoning_effort control now tracks OpenAI’s actual per-model capability matrix. Models like gpt-5.1 that support a "none" option show it; models that don’t support reasoning effort have the control hidden automatically. Previously, the UI could get out of sync with what the backend supported.

🐍 Python SDK Improvements

Several reliability fixes and small improvements to the Python (and TypeScript) SDKs:

Streamer & drain reliability — Fixed two edge-case bugs in the background message pipeline that could cause a small percentage of traces to be missed when using attachments or high message throughput
Auto-retry on rate limits — search_traces() and search_spans() now automatically wait and retry on 429 responses instead of raising an error, so large bulk searches complete reliably under API rate limits
Evaluation task failures surfaced — When an evaluation task raises an exception, the failure is now recorded on the experiment item (Python and TypeScript SDKs) instead of being silently discarded

🔧 Bug Fixes

Annotation queue reason persistence — Two customer-reported bugs: the reason/comment field was being cleared when navigating between queue items, and the navigation order was incorrect. Both are fixed.
LLM failures visible in experiment traces — When an LLM call failed during an experiment run, the trace appeared as an empty-output item with no indication of what went wrong. Structured error details are now written to the trace output so failures are diagnosable.

And much more! 👉 See full commit log on GitHub

Releases: 2.0.25, 2.0.26, 2.0.27, 2.0.28, 2.0.29, 2.0.30, 2.0.31

May 5, 2026

This is our biggest release yet! A fundamental rethink of how you build, debug, and improve AI agents with Opik. Three major new feature groups (Ollie, Test Suites, and the Agent Playground) work together to close the loop from observing a problem to shipping a fix, all without leaving the platform. Alongside them, we’ve reorganized everything around projects, redesigned the core trace experience, and rebuilt the navigation to match. Here’s what’s new:

🤖 Ollie & Opik Connect

Ollie is a powerful coding agent built into the Opik UI. It has full access to your project’s traces and logs, and can analyze patterns across hundreds of interactions, diagnose issues, and take action to fix them, all without leaving the platform.

Highlights:

Trace Analysis - Analyze traces, spot patterns across interactions, and diagnose issues with full project context
Code Fixes via Opik Connect - Link Ollie to your local codebase so it can implement fixes directly in your development code
Test Case Generation - When Ollie fixes an issue, it automatically creates a new test case in your test suite to prevent regressions
UI Navigation - Ollie can navigate the Opik UI, create filtered views, and take actions on your behalf
Opik Connect CLI - Connect your codebase with opik connect, with support for --workspace and --api-key flags
Always Available - Access Ollie from the project home page or as a persistent sidebar from any page in the product

👉 Ollie Agent Documentation

🧪 Test Suites

Test Suites bring structured regression testing to agent development. Each suite has global rules that every test case must pass, plus item-level assertions for specific scenarios. Define rules in plain English for what your agent should and shouldn’t do, and get clear pass/fail results when you run them.

Test suite experiment results showing pass/fail per item with assertion details

Highlights:

Pass/Fail Assertions - Define global rules and item-level assertions in plain English, no complex metric configurations needed
Multi-Provider LLM-as-Judge - Assertions can use different LLM providers for evaluation, giving you flexibility in how test cases are judged
Assertion Reasons & Breakdown - See exactly why each assertion passed or failed with detailed run-breakdown popovers
Add Traces as Test Cases - Add production traces directly to your suite with assertions, so your suite grows naturally as you build and debug
Full SDK Support - Python and TypeScript SDKs support creating suites, adding items, running experiments, and importing/exporting suites

👉 Building Test Suites

🎮 Agent Playground & Agent Configurations

The Agent Playground connects to your agent so you can run it directly from the Opik UI. Experiment with different prompts, models, and parameters to see how your whole agent responds, without touching your code. Agent Configurations track and version the full set of prompts, models, and variables as a single unit, so you always know what combination worked.

Agent Playground running an agent with configuration controls

Highlights:

Agent Playground - Run your full agent from the Opik UI and test different configurations without changing your code
Agent Configurations - Track and version prompts, models, and variables together as a single versioned unit
Blueprint Versioning - Auto-increment naming, diff view for changes between versions, and auto-generated descriptions from config changes
Full SDK Support - Python AgentConfigManager and TypeScript AgentConfig with Zod schema validation and blueprint caching

👉 Agent Playground

🏗️ Project-Scoped Organization & UX Improvements

Projects now map directly to your agents. Test suites, experiments, optimizations, prompts, datasets, alerts, and dashboards are all scoped to the project, giving you a focused view of everything related to a single agent, paired with a redesigned navigation and trace experience.

Redesigned unified Logs page with threads, traces, and spans in a single view

What’s new:

Redesigned Navigation - New sidebar with workspace-level project selector and project-scoped routing across all pages
Unified Logs Page - Threads, traces, and spans are now combined into a single redesigned Logs page with a cleaner layout and faster navigation between them
Redesigned Trace Details - New tabbed layout with LLM message formatting, feedback scores section, and error callouts for faster issue identification
Project-Scoped APIs - All endpoints now support project_name scoping for datasets, experiments, optimizations, prompts, alerts, and dashboards
KPI Cards - New project-level metrics summary cards on the project home page

And much more! 👉 See full commit log on GitHub

Releases: 1.10.24 through 2.0.21

March 3, 2026

Here are the most relevant improvements we’ve made since the last release:

🦞 Native OpenClaw Observability with Opik

We’ve released opik-openclaw, a native OpenClaw plugin that gives you full-stack observability for your agents, powered by Opik. This brings enterprise-grade tracing, evaluation, and monitoring to the fastest-growing open-source agent framework.

What you get:

Full Trace Capture - Every LLM call, tool execution, memory recall, context assembly, and agent delegation is logged with complete input/output pairs, token counts, latency, and cost
End-to-End Conversation Threading - Trace a request from the initial message through multi-step reasoning, tool calls, and the final response, even when the agent chains across sub-agents or scheduled heartbeats
Real Cost Visibility - Per-request, per-model cost breakdowns so you can see exactly where tokens are going and optimize accordingly
Automated Evaluation with LLM-as-a-Judge - Set up hallucination detection, answer relevance, and context precision metrics that run automatically on your traces

Get started in two minutes: install the plugin with openclaw plugins install @opik/opik-openclaw, configure your API key, and traces start flowing immediately. Works with both Opik Cloud and self-hosted instances.

👉 Visit the GitHub repository here

🤖 Expanded Model & Provider Support

We’ve broadened the range of models and providers you can use across the platform, giving you more flexibility in how you build and evaluate your LLM applications.

What’s new:

Gemini 3.1 Support - Google’s Gemini 3.1 is now available as a supported model across the platform
Claude Sonnet 4.6 as Default - Claude Sonnet 4.6 is now the default Anthropic model, bringing improved performance out of the box
OpenRouter Native UX - OpenRouter now has a much more native out-of-the-box experience in the Opik UI. openrouter/free is directly selectable, and openrouter/* route models including /auto are supported and prioritized in model selection
Updated Default Models - The Python SDK has been updated to retire legacy gpt-4* defaults in favor of more current models
OpenAI TTS Tracking - You can now track OpenAI text-to-speech model calls (audio.speech) with full tracing support
OpenAI-Compatible Providers for LLM-as-a-Judge - Use any OpenAI-compatible provider when running LLM-as-a-Judge evaluation metrics, giving you more flexibility in choosing your evaluation model

📦 SDK Improvements

We’ve continued to expand the capabilities of both the TypeScript and Python SDKs, making it easier to integrate Opik into your workflows programmatically.

What’s new:

G-Eval Metric (TypeScript) - The G-Eval evaluation metric is now available in the TypeScript SDK, enabling structured LLM-based evaluation directly from your TypeScript projects
Annotation Queue Support (TypeScript) - Manage and interact with annotation queues programmatically from the TypeScript SDK
Thread Search (TypeScript) - Search through conversation threads programmatically with the new searchThreads functionality in the TypeScript SDK
Offline Message Persistence (Python) - When the Python SDK loses connectivity to the Opik server, telemetry messages are now persisted locally in a SQLite database and automatically replayed once the connection is restored — ensuring no data is lost during network outages
OTEL Integration Docs Expansion - We’ve shipped a major expansion of our OpenTelemetry integration documentation, including new pages and updated guidance for multiple frameworks and providers with emphasis on TypeScript

🚀 Optimization Studio & Optimizer SDK

We’ve made the Optimization Studio more powerful and flexible, with new metrics, persistence, and a major Optimizer SDK update.

What’s new:

JSONPath Support & Numerical Similarity Metric - The Optimization Studio now supports JSONPath expressions for extracting values from complex outputs, along with a new Numerical Similarity metric for comparing numeric results
Native MCP/Tool Optimization - The v3.x Optimizer SDK now includes fully native MCP and tool optimization support, including support for remote MCP and improved tool-signature handling
Multi-Metric Optimization - Multi-metric optimization is now working across span data with working examples for cost, speed, and quality tradeoff scenarios
Stronger Sampling & Agent Optimization - Since the initial v3 SDK launch, we’ve added stronger sampling controls, full agent optimization including multi-prompt support, and finer prompt-control inside optimizer loops
Optimizer SDK 3.1.0 - The Optimizer SDK has been updated to version 3.1.0 with all of the above improvements and the retirement of legacy gpt-4* model references

✨ Platform Features & UX Improvements

We’ve made several improvements to make your day-to-day workflow smoother and more intuitive.

What’s improved:

Updated Default Columns - Default columns across all tables have been refreshed to surface the most relevant information by default
Relative Time Format - Time columns now display relative timestamps (e.g., “2 hours ago”) for quicker at-a-glance understanding
Smart Threads Tab Default - Projects with threads now automatically default to the Threads tab, getting you to the right view faster
Consistent Destructive Actions - Destructive menu options are now visually unified with red text and separators for clearer intent
Feedback Score Precision - Feedback scores are now rounded to 2 decimal places with full precision available on hover
Workspace Color Maps - Configure workspace-level color maps for consistent visual styling across your projects
Image Attachments in Threads - View image attachments directly within the thread view for better context when reviewing conversations
Bulk Tag Operations - Add or remove tags in bulk across traces, spans, and other entities for faster organization
Inline Feedback Definition Creation - Create new feedback definitions directly from the annotation queue form without leaving your workflow
Dataset Item Descriptions - Dataset items now support a description field, making it easier to document and annotate your evaluation data
Revamped MCP Server - The Opik MCP server has been revamped to align with current MCP standards, with added support for remote MCP, improved auth behavior, and expanded native features including prompt and dataset workflows

🏷️ Prompt Version Tags

We’ve introduced prompt version tags, giving you a lightweight way to label and organize your prompt versions across the platform.

What’s new:

Version Tags in Comparison View - Easily see and compare tagged prompt versions side by side in the prompt comparison view
Python SDK Support - Create, manage, and retrieve prompt version tags programmatically from the Python SDK
Retrieve Prompts by Commits - A new API endpoint lets you retrieve prompts by their commit references, enabling tighter integration with your version control workflow

👉 Prompt Version Tags Documentation

And much more! 👉 See full commit log on GitHub

Releases: 1.10.11, 1.10.12, 1.10.13, 1.10.14, 1.10.15, 1.10.16, 1.10.17, 1.10.18, 1.10.19, 1.10.20, 1.10.21, 1.10.22, 1.10.23

February 10, 2026

Here are the most relevant improvements we’ve made since the last release:

🛠️ SDK Improvements

We’ve significantly expanded the capabilities of both our Python and TypeScript SDKs, making it easier to integrate Opik into your workflows programmatically.

What’s new:

Annotation Queue Support - Both Python and TypeScript SDKs now support annotation queues, allowing you to programmatically manage and interact with your annotation workflows
Dataset Versioning - Full dataset versioning support is now available in both SDKs, giving you better control over your data lifecycle and experiment reproducibility
Dataset Filtering (TypeScript) - Filter dataset items directly from the TypeScript SDK for more efficient data retrieval
Opik Query Language (TypeScript) - The TypeScript SDK now supports Opik Query Language (OQL), enabling powerful and flexible querying of your data
Feedback Scores Logging (TypeScript) - Log feedback scores directly from the TypeScript SDK to track model quality metrics
Thread Search (TypeScript) - Search through conversation threads programmatically with the new searchThreads functionality

👉 Annotation Queues | Opik Query Language | Dataset Versioning

🔌 LLM Provider & Integration Updates

We’ve expanded our LLM provider support and improved integrations to give you more flexibility in your AI workflows.

What’s new:

Ollama Provider Support - Ollama is now available as a native provider in the Playground, enabling local LLM inference directly within Opik
Claude Opus 4.6 Support - Full support for Anthropic’s latest Claude Opus 4.6 model
LangChain Tool Descriptions - Tool descriptions are now automatically extracted and added to tool spans in the LangChain integration, providing better visibility into your agent’s tool usage

👉 AI Providers Documentation

✨ Product & UX Improvements

We’ve made several improvements to make your day-to-day workflow smoother and more intuitive.

What’s improved:

Unified Logs View - Traces, threads, and spans are now merged into a single Logs tab, providing a streamlined view of all your observability data in one place
Image Attachments in Threads - View image attachments directly within the thread view for better context when reviewing conversations
Inline Feedback Definition Creation - Create new feedback definitions directly from the annotation queue form without leaving your workflow
Improved Table Loading - Enhanced loading state UX across tables for a smoother experience when working with large datasets
Expanded Feedback Scores - The feedback scores section in experiment items sidebar is now expanded by default for quicker access
Organization & Workspace Selectors - New organization and workspace selectors in the sidebar and breadcrumbs make it easier to navigate between different contexts
Playground Prompt Metadata - Traces generated from the Opik Playground now include prompt metadata for better traceability

And much more! 👉 See full commit log on GitHub

Releases: 1.9.102, 1.9.103, 1.9.104, 1.10.0, 1.10.1, 1.10.2, 1.10.3, 1.10.4, 1.10.5, 1.10.6, 1.10.7, 1.10.8, 1.10.9, 1.10.10

January 27, 2026

Here are the most relevant improvements we’ve made since the last release:

🚀 Optimization Studio

We’re excited to introduce Optimization Studio — a powerful new way to improve your prompts without writing code. Bring a prompt, define what “good” looks like, and Opik tests variations to find a better version you can ship with confidence.

What’s new:

No-code prompt optimization - Optimization Studio helps you improve prompts directly from the Opik UI. You see scores and examples, not just a hunch, shortening the loop from idea to evidence
Algorithm selection - Choose how Opik searches for better prompts: GEPA works well for single-turn prompts and quick improvements, while HRPO is better when you need deeper analysis of why a prompt fails
Flexible metrics - Define how Opik should score each prompt variation. Use Equals for strict matching when you have a single correct answer, or G-Eval when answers can vary and you want a model to grade quality
Visual progress tracking - Monitor your optimization runs with real-time progress charts showing the best score so far and results for each trial
Trials comparison - The Trials tab lets you compare prompt variations and scores side-by-side, with the ability to drill down into individual evaluated items
Rerun & compare - Easily rerun the same setup, cancel a run to change inputs, or select multiple runs to compare outcomes

For teams that prefer a programmatic workflow, we’ve also released Opik Optimizer SDK v3 with improved algorithms, better performance, and more intuitive APIs.

👉 Optimization Studio Documentation

📊 Dashboard Improvements

We’ve enhanced the dashboard with new widgets and visualization capabilities to help you track and compare experiments more effectively.

Experiment Leaderboard widget showing ranked experiments with feedback scores and metrics charts

What’s new:

Experiment Leaderboard Widget - A new leaderboard widget lets you rank and compare experiments at a glance directly from your dashboard, making it easier to identify your best-performing configurations
Group By for Metrics Widget - The project metrics widget now supports grouping, allowing you to slice and dice your metrics data in more meaningful ways
Span-level Metrics Charts - New charts provide visibility into span-level metrics, giving you deeper insights into the performance of individual components in your traces

Project metrics dashboard showing span count, duration percentiles, and spans by model provider

🎬 Video Generation Support

We’ve added support for the latest video generation models, enabling you to track and log video outputs from your AI applications.

What’s new:

OpenAI SORA Integration - Log and track video generation outputs from OpenAI’s SORA model directly in Opik
Google Veo Integration - Full support for Google’s Veo video generation API, including automatic logging of video outputs and metadata

🧪 Experiment Management

We’ve made it easier to organize and navigate your experiments with new filtering and tagging capabilities.

What’s improved:

Project Column in Experiments View - Experiments now display their associated project directly in the list view, making it easier to understand context at a glance
Project Filter & Grouping - Filter and group your experiments by project to quickly find what you’re looking for across large experiment collections
Experiment Tags - Tags are now rendered on the experiment page, helping you categorize and identify experiments more easily

✨ UI/UX Improvements

We’ve made several improvements to make your day-to-day workflow smoother.

What’s improved:

Time Formatting Settings - Customize how timestamps are displayed throughout the UI to match your preferred format
Online Score Rules Defaults - Input and output fields in online score rules are now pre-populated with sensible defaults, reducing setup time
Dataset Item Navigation - Navigation tags in the experiment item view now link directly to the associated dataset item for easier data exploration
Annotation Queue Review - You can now review completed annotation queues, making it easier to audit and verify your annotation work

🔌 SDK & Integrations

We’ve improved our SDK integrations with better tracing and performance metrics.

What’s improved:

Vercel AI SDK Thread Support - Thread ID support for the Vercel AI SDK integration enables better conversation tracking across multi-turn interactions
ADK Distributed Tracing - Added distributed trace headers support to the ADK integration for improved observability in distributed systems
Time-to-First-Token (TTFT) - The ADK integration now captures TTFT metrics, giving you visibility into response latency for streaming applications

And much more! 👉 See full commit log on GitHub

Releases: 1.9.79, 1.9.80, 1.9.81, 1.9.82, 1.9.83, 1.9.84, 1.9.85, 1.9.86, 1.9.87, 1.9.88, 1.9.89, 1.9.90, 1.9.91, 1.9.92, 1.9.95, 1.9.96, 1.9.97, 1.9.98, 1.9.99, 1.9.100, 1.9.101

January 13, 2026

Here are the most relevant improvements we’ve made since the last release:

🔌 Playground & Provider Enhancements

We’ve expanded the Playground with new provider support and enhanced functionality to make prompt experimentation more powerful.

What’s new:

Display Metric Results in Output - Playground output cells now display metric results directly, making it easier to evaluate prompt performance at a glance

Playground output showing metric results directly in the output cell

Model Selector for OpikAI Features - Easily select which model powers the Prompt Generator and Prompt Improver features

Model selector dropdown for OpikAI features like Prompt Generator and Prompt Improver

Native AWS Bedrock Integration - Bedrock is now available as a native provider in the Playground, giving you direct access to Amazon’s models without additional configuration

AWS Bedrock integration in the Playground provider selection

Gemini 3 Flash Support - Added support for Gemini 3 Flash in both the Playground and online scoring, expanding your model options for fast, cost-effective evaluations

🧪 Online Evaluation & Scoring

We’ve made online evaluation more flexible and easier to manage across your projects.

What’s improved:

Multi-Project Evaluation Rules - Online evaluation rules can now be applied across multiple projects, reducing duplication and simplifying rule management

Multi-project support for online evaluation rules

Clone Score Rules - Quickly duplicate existing online score rules to create variations without starting from scratch

🎨 UI & UX Enhancements

We’ve refined the user experience across the platform with improved responsiveness and dashboard polish.

What’s improved:

Mobile Responsiveness - Better support for mobile devices when logging traces
Dashboard Enhancements - Unified widget editor design, dashboard count in sidebar, and various UX improvements to the dashboard experience

📦 SDK Improvements

We’ve updated our SDKs with new capabilities and modernized dependencies.

What’s new:

Python 3.9 End of Life - Python 3.9 support has been retired as it reached end-of-life. Please upgrade to Python 3.10+
Experiment Tags in evaluate() - You can now add tags to experiments directly when calling the evaluate() method
Vercel AI SDK v6 - Upgraded TypeScript SDK integration from Vercel AI SDK v5 to v6
Prompt Version Tags (TypeScript) - TypeScript SDK now supports prompt version tags for better prompt management

And much more! 👉 See full commit log on GitHub

Releases: 1.9.57, 1.9.58, 1.9.59, 1.9.60, 1.9.61, 1.9.62, 1.9.63, 1.9.64, 1.9.65, 1.9.66, 1.9.67, 1.9.68, 1.9.69, 1.9.70, 1.9.71, 1.9.72, 1.9.73, 1.9.74, 1.9.75, 1.9.76, 1.9.77, 1.9.78

Resume Interrupted Evaluations with `evaluate_resume`

1 import opik
2 
3 # Continue a partially-completed experiment — only missing trials are replayed
4 result = opik.evaluate_resume(
5     experiment_id="...",
6     task=my_task,
7     scoring_metrics=[Equals()],
8 )

👉 Resume evaluations documentation

OpenAI Responses API Support in Playground and LLM-as-a-Judge

Bug Fixes & Improvements

Annotation queues: claim mechanism for parallel annotation — Multiple annotators working the same queue simultaneously now see each item locked while another reviewer is looking at it, preventing duplicate work. Items show an “In review” indicator (orange) when all annotator slots are occupied by a combination of active locks and existing scores. Locks are kept alive by a heartbeat and expire via TTL when the reviewer navigates away. The sidebar also gains “To review” / “Processed” filter tabs, and each annotator sees items in a distinct shuffled order to reduce contention.
Collapsible JSON/YAML in trace and span detail view — JSON objects, arrays, and YAML blocks in the trace and span detail view can now be folded and unfolded with an inline chevron at the end of each foldable line. Collapsed blocks render as a clickable gray placeholder. This makes it easier to navigate large payloads without scrolling past content you don’t need.
Redesigned dataset and test suite creation flow — The creation dialog now presents two explicit paths: Upload a file (CSV or JSON dropzone with auto-naming and optional evaluation criteria for test suites) and Use SDK (name + code snippet). Both options are accessible from the header button and from the list empty state. On success the panel closes and a “Go to …” toast appears.
Evaluate experiment traces directly from the UI — The Compare Experiments page has a new Evaluate button (brain icon) in the action bar. It opens the online evaluation dialog scoped to all traces in the current experiment, so you can score an experiment’s output without leaving the page.
Span filtering by created_at and last_updated_at — The span search API now accepts created_at and last_updated_at as filter fields with all comparison operators (=, !=, >, >=, <, <=). These fields were already supported on traces; span support was missing.
OpenTelemetry: in-process spans now linked to the active @opik.track trace — When an OTel-instrumented library (such as logfire or PydanticAI) emits spans from inside an @opik.track-decorated function, those spans are now nested under the active tracked trace rather than starting a separate trace. Distributed flows where parent spans or W3C baggage carry Opik IDs continue to take precedence over the in-process context.
Optimization: best trial configuration now shows the optimized prompt — The Best Trial Configuration panel was displaying the baseline prompt instead of the prompt produced by the optimizer. It now shows the correct optimized result. The Trials table also gains a Prompt column with per-message formatting and a diff-vs-baseline popover.
Experiment views: prompt version labels instead of commit hashes — The Experiments table, the single-experiment Configuration tab, and the Dashboard Experiments leaderboard now display prompts as “name (v3)” instead of raw commit hashes, consistent with the display already used in the Prompt Library.
AI Spend dashboard: total tokens KPI card and onboarding empty state — The placeholder “Budget remaining” card is replaced by a Total tokens KPI showing the sum of all token tiers across models, with a period-over-period trend indicator. The dashboard also shows an onboarding empty state with setup instructions and a ready-to-copy configuration snippet when no trace data has been received yet.
Cost calculation: tiered pricing above 200k tokens now applied — For models such as gemini-2.5-pro and vertex_ai/claude-sonnet-4-5 that carry above_200k_tokens rate tiers, requests exceeding the 200k-token threshold were being billed at the base input rate. Opik now applies the tier rate when the threshold is crossed (the entire request is billed at the tier price, mirroring LiteLLM’s semantics).
Cost calculation: Claude on Vertex AI cached tokens now discounted — Claude models on Vertex AI (vertex_ai/claude-haiku-4-5, vertex_ai/claude-sonnet-4-5, vertex_ai/claude-opus-4-1) were having cache-read tokens billed at the full input rate. They now use the Anthropic cache calculator, correctly applying the discount on cached tokens.
Vertex AI: model selection preserved across provider switches — Switching away from Vertex AI and back in the Playground no longer resets the previously selected model.

Performance Improvements

Span timestamp filters use ClickHouse skip indexes — created_at and last_updated_at on the spans and traces tables now have minmax skip indexes. Range filters on these columns prune granules instead of scanning the full project partition, significantly reducing query time and ClickHouse CPU load on large tables.

And much more! 👉 See full commit log on GitHub

Releases: 2.0.53, 2.0.54, 2.0.55, 2.0.56, 2.0.57, 2.0.58, 2.0.59

1	import opik
2
3	# Continue a partially-completed experiment — only missing trials are replayed
4	result = opik.evaluate_resume(
5	experiment_id="...",
6	task=my_task,
7	scoring_metrics=[Equals()],
8	)

1	# Tag a version at creation time
2	prompt = client.create_prompt("system-prompt", content="...", environments=["staging"])
3
4	# Retrieve by environment — no hard-coded version number needed
5	production_prompt = client.get_prompt("system-prompt", environment="production")
6
7	# Promote a specific version to production
8	client.set_prompt_environments("system-prompt", ["production"], version="v3")

1	# Cached after the first call — no extra latency on subsequent invocations
2	prompt = client.get_prompt("my-system-prompt")
3
4	# Force a fresh fetch, bypassing the cache
5	prompt = client.get_prompt("my-system-prompt", no_cache=True)

1	import opik
2
3	@opik.track(environment="production")
4	def my_agent(input: str) -> str:
5	...

Resume Interrupted Evaluations with evaluate_resume

OpenAI Responses API Support in Playground and LLM-as-a-Judge

Bug Fixes & Improvements

Performance Improvements

Prompt Library Now Available in Opik 2.0

Simplified Filters in the Logs View

Bug Fixes & Improvements

AND/OR Condition Grouping in Alerts

Bug Fixes & Improvements

🚀 Client-Side Prompt Caching (Python & TypeScript SDKs)

🔌 opik connect CLI Improvements

🔧 Bug Fixes & Improvements

⚡ Performance Improvements

🌍 Environment Tracking for Traces, Spans & Threads

🧪 Test Suite Assertions Can Now Inspect Sub-Spans

⚡ Dramatically Faster Trace Table Loading

🤖 OpenAI Playground: Per-Model reasoning_effort Support

🐍 Python SDK Improvements

🔧 Bug Fixes

🤖 Ollie & Opik Connect

🧪 Test Suites

🎮 Agent Playground & Agent Configurations

🏗️ Project-Scoped Organization & UX Improvements

🦞 Native OpenClaw Observability with Opik

🤖 Expanded Model & Provider Support

📦 SDK Improvements

🚀 Optimization Studio & Optimizer SDK

✨ Platform Features & UX Improvements

🏷️ Prompt Version Tags

🛠️ SDK Improvements

🔌 LLM Provider & Integration Updates

✨ Product & UX Improvements

🚀 Optimization Studio

📊 Dashboard Improvements

🎬 Video Generation Support

🧪 Experiment Management

✨ UI/UX Improvements

🔌 SDK & Integrations

🔌 Playground & Provider Enhancements

🧪 Online Evaluation & Scoring

🎨 UI & UX Enhancements

📦 SDK Improvements

Resume Interrupted Evaluations with evaluate_resume

OpenAI Responses API Support in Playground and LLM-as-a-Judge

Bug Fixes & Improvements

Performance Improvements

Prompt Library Now Available in Opik 2.0

Simplified Filters in the Logs View

Bug Fixes & Improvements

AND/OR Condition Grouping in Alerts

Bug Fixes & Improvements

🚀 Client-Side Prompt Caching (Python & TypeScript SDKs)

🔌 opik connect CLI Improvements

🔧 Bug Fixes & Improvements

⚡ Performance Improvements

🌍 Environment Tracking for Traces, Spans & Threads

🧪 Test Suite Assertions Can Now Inspect Sub-Spans

⚡ Dramatically Faster Trace Table Loading

🤖 OpenAI Playground: Per-Model reasoning_effort Support

🐍 Python SDK Improvements

🔧 Bug Fixes

🤖 Ollie & Opik Connect

🧪 Test Suites

🎮 Agent Playground & Agent Configurations

🏗️ Project-Scoped Organization & UX Improvements

🦞 Native OpenClaw Observability with Opik

🤖 Expanded Model & Provider Support

📦 SDK Improvements

🚀 Optimization Studio & Optimizer SDK

✨ Platform Features & UX Improvements

🏷️ Prompt Version Tags

🛠️ SDK Improvements

🔌 LLM Provider & Integration Updates

✨ Product & UX Improvements

🚀 Optimization Studio

📊 Dashboard Improvements

🎬 Video Generation Support

🧪 Experiment Management

✨ UI/UX Improvements

🔌 SDK & Integrations

Resume Interrupted Evaluations with `evaluate_resume`

🔌 `opik connect` CLI Improvements

Resume Interrupted Evaluations with `evaluate_resume`

🔌 `opik connect` CLI Improvements