evaluate_resumeLong-running evaluation jobs that get cut short — by Ctrl-C, an OOM error, a failed scoring metric, or a network blip — can now be continued from where they stopped instead of restarting from scratch. opik.evaluate_resume(experiment_id, task, scoring_metrics=[...]) replays only the trials that did not complete, merges them with the ones that did, and returns a single EvaluationResult covering the whole experiment.
A trial counts as complete only when trace.output is set, which happens after the task, scoring, and score-logging all succeed. Any failure mode that prevents reaching that point — a metric raising an exception, a KeyboardInterrupt between task and scoring — leaves the trial replayable.
The original evaluate(...) call writes a resume snapshot into experiment_config so the exact iteration (pinned dataset version, sample count, per-item trial counts) can be reconstructed server-side. When the original call used a custom dataset_sampler or explicit dataset_item_ids, the SDK also writes a local checkpoint next to the experiment ID for those cases.
👉 Resume evaluations documentation
The Playground and LLM-as-a-Judge now support OpenAI’s /v1/responses API, making it possible to use o-series reasoning models (o1, o3, o3-mini, o4-mini) and other deployments that are only available on the newer API path. Previously, sending these models through the Chat Completions path returned “This is not a chat model and thus not supported in the v1/chat/completions endpoint.”
To opt in, open the Manage AI Providers dialog, select your OpenAI key, and set Pipeline mode to Responses API. The Chat Completions path remains the default and is unchanged for all other models.
The Playground’s Top P slider is now also hidden for OpenAI reasoning models (gpt-5.x, o1*, o3*, o4*). Those models reject top_p outright; the slider was causing 400 errors when it appeared.
Annotation queues: claim mechanism for parallel annotation — Multiple annotators working the same queue simultaneously now see each item locked while another reviewer is looking at it, preventing duplicate work. Items show an “In review” indicator (orange) when all annotator slots are occupied by a combination of active locks and existing scores. Locks are kept alive by a heartbeat and expire via TTL when the reviewer navigates away. The sidebar also gains “To review” / “Processed” filter tabs, and each annotator sees items in a distinct shuffled order to reduce contention.
Collapsible JSON/YAML in trace and span detail view — JSON objects, arrays, and YAML blocks in the trace and span detail view can now be folded and unfolded with an inline chevron at the end of each foldable line. Collapsed blocks render as a clickable gray placeholder. This makes it easier to navigate large payloads without scrolling past content you don’t need.
Redesigned dataset and test suite creation flow — The creation dialog now presents two explicit paths: Upload a file (CSV or JSON dropzone with auto-naming and optional evaluation criteria for test suites) and Use SDK (name + code snippet). Both options are accessible from the header button and from the list empty state. On success the panel closes and a “Go to …” toast appears.
Evaluate experiment traces directly from the UI — The Compare Experiments page has a new Evaluate button (brain icon) in the action bar. It opens the online evaluation dialog scoped to all traces in the current experiment, so you can score an experiment’s output without leaving the page.
Span filtering by created_at and last_updated_at — The span search API now accepts created_at and last_updated_at as filter fields with all comparison operators (=, !=, >, >=, <, <=). These fields were already supported on traces; span support was missing.
OpenTelemetry: in-process spans now linked to the active @opik.track trace — When an OTel-instrumented library (such as logfire or PydanticAI) emits spans from inside an @opik.track-decorated function, those spans are now nested under the active tracked trace rather than starting a separate trace. Distributed flows where parent spans or W3C baggage carry Opik IDs continue to take precedence over the in-process context.
Optimization: best trial configuration now shows the optimized prompt — The Best Trial Configuration panel was displaying the baseline prompt instead of the prompt produced by the optimizer. It now shows the correct optimized result. The Trials table also gains a Prompt column with per-message formatting and a diff-vs-baseline popover.
Experiment views: prompt version labels instead of commit hashes — The Experiments table, the single-experiment Configuration tab, and the Dashboard Experiments leaderboard now display prompts as “name (v3)” instead of raw commit hashes, consistent with the display already used in the Prompt Library.
AI Spend dashboard: total tokens KPI card and onboarding empty state — The placeholder “Budget remaining” card is replaced by a Total tokens KPI showing the sum of all token tiers across models, with a period-over-period trend indicator. The dashboard also shows an onboarding empty state with setup instructions and a ready-to-copy configuration snippet when no trace data has been received yet.
Cost calculation: tiered pricing above 200k tokens now applied — For models such as gemini-2.5-pro and vertex_ai/claude-sonnet-4-5 that carry above_200k_tokens rate tiers, requests exceeding the 200k-token threshold were being billed at the base input rate. Opik now applies the tier rate when the threshold is crossed (the entire request is billed at the tier price, mirroring LiteLLM’s semantics).
Cost calculation: Claude on Vertex AI cached tokens now discounted — Claude models on Vertex AI (vertex_ai/claude-haiku-4-5, vertex_ai/claude-sonnet-4-5, vertex_ai/claude-opus-4-1) were having cache-read tokens billed at the full input rate. They now use the Anthropic cache calculator, correctly applying the discount on cached tokens.
Vertex AI: model selection preserved across provider switches — Switching away from Vertex AI and back in the Playground no longer resets the previously selected model.
created_at and last_updated_at on the spans and traces tables now have minmax skip indexes. Range filters on these columns prune granules instead of scanning the full project partition, significantly reducing query time and ClickHouse CPU load on large tables.And much more! 👉 See full commit log on GitHub
Releases: 2.0.53, 2.0.54, 2.0.55, 2.0.56, 2.0.57, 2.0.58, 2.0.59
The Prompt Library is now part of the Opik 2.0 UI, accessible from the project sidebar under Prompt library. Alongside that, prompt versions have gained first-class environment support — you can tag a version as production or staging and retrieve it by name from the SDK, without tracking version numbers in application code.
What’s new:
client.get_prompt(name, environment="production") returns the version currently tagged as production; version and environment are mutually exclusive and passing both raises a clear errorclient.set_prompt_environments(name, ["production", "staging"]) replaces the full environment set on a version; the same environment is automatically moved away from whatever version previously held itclient.create_prompt(name, content="...", environments=["staging"]) and client.create_chat_prompt(...) accept environments directlysetPromptEnvironments, getPrompt({ environment }), and createPrompt({ environments }) mirror the Python APIv1, v2, v3 in the UI and API instead of raw commit hashesThe Traces, Spans, and Threads tabs now have a redesigned filter bar that makes it faster to narrow down what you’re looking at. Filters appear as chips directly in the toolbar — pick a field, set a value, and the table updates instantly. Frequently-used filters can be pinned to the bar so they’re always one click away, and filter state is preserved in the URL so you can share an exact filtered view with a teammate.
get_trace_spans and read tool calls to inspect intermediate spans during evaluation, enabling correctness checks about tool usage, model selection, and per-span errors inside complex agentsClassCastException under certain configurationsdata:<type>;base64, prefix are now stripped correctly in both the SDK and the frontendopik migrate: skipped items reported clearly — the migration command now reports each skipped item with its reason, count, and sample source IDs, and exits with code 1 so CI pipelines detect incomplete migrationsAnd much more! 👉 See full commit log on GitHub
Releases: 2.0.48, 2.0.49, 2.0.50, 2.0.51, 2.0.52
Alert rules now support structured condition grouping: conditions within a group are evaluated with AND, while groups themselves are combined with OR. This makes it possible to express logic such as “flag a trace if (hallucination score > 0.8 AND relevance score < 0.3) OR (toxicity score > 0.5)”.
Existing single-condition alerts continue to work exactly as before — each legacy condition is automatically treated as its own group, so no migration is needed.
prompt_mask_context(masks) / promptMaskContext(masks) lets you run agent code with specific prompt IDs silently redirected to a different version ID, non-destructively. The agent calls get_prompt() as usual and receives the overridden template without any permanent change to the prompt library. Designed for A/B testing and optimizer sweep scenarios.DatasetItem field (e.g. id, as in HotpotQA) previously raised TypeError: multiple values for keyword argument. The SDK now strips conflicting keys and emits a one-time warning so iteration completes.<0.8 and >=0.8 — track_harbor() now patches whichever method name the installed version of harbor exposes (_setup_environment or _setup_agent_environment), so tracing works regardless of which version is installed.qwen/qwen3.7-max are now available in the model picker.And much more! 👉 See full commit log on GitHub
Releases: 2.0.42, 2.0.43, 2.0.44, 2.0.45, 2.0.46, 2.0.47
client.get_prompt() and client.get_chat_prompt() now cache results in-process, so repeated calls inside a hot path skip the network round-trip entirely. Pinned commits are cached indefinitely; latest-version lookups use a 5-minute TTL that refreshes in the background so your code always gets a reasonably fresh value without blocking.
What’s new:
OPIK_PROMPT_CACHE_TTL_SECONDS to adjust the freshness window (default: 300 s)no_cache=True / noCache: true to force a live fetch from the backend@track context, the prompt ID and commit are automatically recorded in the trace metadata so you know which version was used at inference timeopik connect CLI ImprovementsThe opik connect and opik endpoint CLI commands have been reorganized with a much better error experience:
~/.opik.config file exists, opik connect now offers to run opik configure automatically (skipped in non-interactive / headless environments)OpikADKOtelTracer was killing all active OpenTelemetry spans and re-patching the ADK exporter on every request; the patcher is now idempotent and preserves user-configured OTel pipelinesspan.end(), span.update(), trace.end(), or trace.update() no longer clears the environment field set at creation time/datasets/items/stream call; under high request volume this was pushing database CPU to 80–99%, it now uses a direct primary-key lookup insteadAnd much more! 👉 See full commit log on GitHub
Releases: 2.0.32, 2.0.33, 2.0.34, 2.0.35, 2.0.36, 2.0.37
Here are the most relevant improvements we’ve made since the last release:
You can now tag traces, spans, and threads with an environment field — production, staging, dev, or any label you define. This makes it easy to separate signal from noise: filter your project’s trace view to only production issues, or compare behavior between environments without spinning up separate projects.
What’s new:
production vs staging in a single projectenvironment to @track, opik.trace(), or opik.span() — and it’s preserved through .end() and .update() callsenvironment on trace and span creationTest suite assertions can now look inside a trace — not just the top-level input/output — to reason about tool calls, intermediate LLM steps, and sub-agent behavior. The evaluator LLM gets access to two on-demand tools: get_trace_spans (lists all sub-spans for the trace) and read (fetches a specific span by ID) — so it can drill into exactly what happened at each step.
Why it matters: Previously, an assertion could only see what went in and came out of the agent. Now it can check whether the right tool was called, which model was used in an intermediate step, or whether a specific span had an error — enabling far more meaningful correctness checks for complex agents.
Traces and spans tables no longer download attachment bytes (images, PDFs) when loading a list — attachments are lazy-loaded only when you open an individual trace. In our benchmarks with image and PDF attachments, this reduced the per-page payload from 85 MB → 0.13 MB and load time from 3.4 s to 0.1 s.
Why it matters: If any of your traces include file attachments, the table was silently fetching all that binary data on every page load. The experience is now fast regardless of attachment size or count.
The Playground’s reasoning_effort control now tracks OpenAI’s actual per-model capability matrix. Models like gpt-5.1 that support a "none" option show it; models that don’t support reasoning effort have the control hidden automatically. Previously, the UI could get out of sync with what the backend supported.
Several reliability fixes and small improvements to the Python (and TypeScript) SDKs:
search_traces() and search_spans() now automatically wait and retry on 429 responses instead of raising an error, so large bulk searches complete reliably under API rate limitsAnd much more! 👉 See full commit log on GitHub
Releases: 2.0.25, 2.0.26, 2.0.27, 2.0.28, 2.0.29, 2.0.30, 2.0.31
This is our biggest release yet! A fundamental rethink of how you build, debug, and improve AI agents with Opik. Three major new feature groups (Ollie, Test Suites, and the Agent Playground) work together to close the loop from observing a problem to shipping a fix, all without leaving the platform. Alongside them, we’ve reorganized everything around projects, redesigned the core trace experience, and rebuilt the navigation to match. Here’s what’s new:
Ollie is a powerful coding agent built into the Opik UI. It has full access to your project’s traces and logs, and can analyze patterns across hundreds of interactions, diagnose issues, and take action to fix them, all without leaving the platform.

Highlights:
opik connect, with support for --workspace and --api-key flagsTest Suites bring structured regression testing to agent development. Each suite has global rules that every test case must pass, plus item-level assertions for specific scenarios. Define rules in plain English for what your agent should and shouldn’t do, and get clear pass/fail results when you run them.

Highlights:
The Agent Playground connects to your agent so you can run it directly from the Opik UI. Experiment with different prompts, models, and parameters to see how your whole agent responds, without touching your code. Agent Configurations track and version the full set of prompts, models, and variables as a single unit, so you always know what combination worked.

Highlights:
AgentConfigManager and TypeScript AgentConfig with Zod schema validation and blueprint cachingProjects now map directly to your agents. Test suites, experiments, optimizations, prompts, datasets, alerts, and dashboards are all scoped to the project, giving you a focused view of everything related to a single agent, paired with a redesigned navigation and trace experience.

What’s new:
project_name scoping for datasets, experiments, optimizations, prompts, alerts, and dashboardsAnd much more! 👉 See full commit log on GitHub
Releases: 1.10.24 through 2.0.21
Here are the most relevant improvements we’ve made since the last release:
We’ve released opik-openclaw, a native OpenClaw plugin that gives you full-stack observability for your agents, powered by Opik. This brings enterprise-grade tracing, evaluation, and monitoring to the fastest-growing open-source agent framework.
What you get:
Get started in two minutes: install the plugin with openclaw plugins install @opik/opik-openclaw, configure your API key, and traces start flowing immediately. Works with both Opik Cloud and self-hosted instances.
👉 Visit the GitHub repository here
We’ve broadened the range of models and providers you can use across the platform, giving you more flexibility in how you build and evaluate your LLM applications.
What’s new:
openrouter/free is directly selectable, and openrouter/* route models including /auto are supported and prioritized in model selectionWe’ve continued to expand the capabilities of both the TypeScript and Python SDKs, making it easier to integrate Opik into your workflows programmatically.
What’s new:
searchThreads functionality in the TypeScript SDKWe’ve made the Optimization Studio more powerful and flexible, with new metrics, persistence, and a major Optimizer SDK update.
What’s new:
We’ve made several improvements to make your day-to-day workflow smoother and more intuitive.
What’s improved:
We’ve introduced prompt version tags, giving you a lightweight way to label and organize your prompt versions across the platform.
What’s new:
👉 Prompt Version Tags Documentation
And much more! 👉 See full commit log on GitHub
Releases: 1.10.11, 1.10.12, 1.10.13, 1.10.14, 1.10.15, 1.10.16, 1.10.17, 1.10.18, 1.10.19, 1.10.20, 1.10.21, 1.10.22, 1.10.23
Here are the most relevant improvements we’ve made since the last release:
We’ve significantly expanded the capabilities of both our Python and TypeScript SDKs, making it easier to integrate Opik into your workflows programmatically.
What’s new:
searchThreads functionality👉 Annotation Queues | Opik Query Language | Dataset Versioning
We’ve expanded our LLM provider support and improved integrations to give you more flexibility in your AI workflows.
What’s new:
We’ve made several improvements to make your day-to-day workflow smoother and more intuitive.
What’s improved:
And much more! 👉 See full commit log on GitHub
Releases: 1.9.102, 1.9.103, 1.9.104, 1.10.0, 1.10.1, 1.10.2, 1.10.3, 1.10.4, 1.10.5, 1.10.6, 1.10.7, 1.10.8, 1.10.9, 1.10.10
Here are the most relevant improvements we’ve made since the last release:
We’re excited to introduce Optimization Studio — a powerful new way to improve your prompts without writing code. Bring a prompt, define what “good” looks like, and Opik tests variations to find a better version you can ship with confidence.
What’s new:
For teams that prefer a programmatic workflow, we’ve also released Opik Optimizer SDK v3 with improved algorithms, better performance, and more intuitive APIs.
👉 Optimization Studio Documentation
We’ve enhanced the dashboard with new widgets and visualization capabilities to help you track and compare experiments more effectively.

What’s new:

We’ve added support for the latest video generation models, enabling you to track and log video outputs from your AI applications.
What’s new:
We’ve made it easier to organize and navigate your experiments with new filtering and tagging capabilities.
What’s improved:
We’ve made several improvements to make your day-to-day workflow smoother.
What’s improved:
We’ve improved our SDK integrations with better tracing and performance metrics.
What’s improved:
And much more! 👉 See full commit log on GitHub
Releases: 1.9.79, 1.9.80, 1.9.81, 1.9.82, 1.9.83, 1.9.84, 1.9.85, 1.9.86, 1.9.87, 1.9.88, 1.9.89, 1.9.90, 1.9.91, 1.9.92, 1.9.95, 1.9.96, 1.9.97, 1.9.98, 1.9.99, 1.9.100, 1.9.101
Here are the most relevant improvements we’ve made since the last release:
We’ve expanded the Playground with new provider support and enhanced functionality to make prompt experimentation more powerful.
What’s new:



We’ve made online evaluation more flexible and easier to manage across your projects.
What’s improved:

We’ve refined the user experience across the platform with improved responsiveness and dashboard polish.
What’s improved:
We’ve updated our SDKs with new capabilities and modernized dependencies.
What’s new:
evaluate() methodAnd much more! 👉 See full commit log on GitHub
Releases: 1.9.57, 1.9.58, 1.9.59, 1.9.60, 1.9.61, 1.9.62, 1.9.63, 1.9.64, 1.9.65, 1.9.66, 1.9.67, 1.9.68, 1.9.69, 1.9.70, 1.9.71, 1.9.72, 1.9.73, 1.9.74, 1.9.75, 1.9.76, 1.9.77, 1.9.78