{"id":20285,"date":"2026-07-02T22:21:56","date_gmt":"2026-07-02T22:21:56","guid":{"rendered":"https:\/\/www.comet.com\/site\/?p=20285"},"modified":"2026-07-02T22:21:57","modified_gmt":"2026-07-02T22:21:57","slug":"edd-opik-project-example","status":"publish","type":"post","link":"https:\/\/www.comet.com\/site\/blog\/edd-opik-project-example\/","title":{"rendered":"How Evaluation-Driven Development (EDD) Works"},"content":{"rendered":"\n<p class=\"wp-block-paragraph\"><em>Turn every AI agent change into a measured experiment you compare before and after to detect regressions and measure performance.<\/em><\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"572\" src=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2026\/06\/edd-title-card-1024x572.webp\" alt=\"\" class=\"wp-image-20302\" srcset=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2026\/06\/edd-title-card-1024x572.webp 1024w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2026\/06\/edd-title-card-300x167.webp 300w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2026\/06\/edd-title-card-768x429.webp 768w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2026\/06\/edd-title-card.webp 1376w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">The scariest AI failures are the silent ones.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">You ship a change to your agent on a branch \u2014 a new feature, a prompt fix, a quick refactor. No errors. No complaints. Everything <em>looks<\/em> fine. But does it still work, or did you quietly break something that worked yesterday?<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">As Alejandro Aboy puts it: <em>\u201cthe fact that they\u2019re not complaining doesn\u2019t mean there\u2019s no issue going on.\u201d<\/em> A quiet user is not a happy user. Usually, it\u2019s the opposite.<\/p>\n\n\n\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p class=\"wp-block-paragraph\">The more Alejandro and I talked about AI evals and EDD, the more his struggles sounded like mine. A story from builders to builder.<\/p>\n<\/blockquote>\n\n\n\n<ol class=\"wp-block-list\">\n<li><strong>You can break what already worked.<\/strong> Change a prompt, refactor a tool, and an old feature quietly regresses. Alejandro lived it: cleaning noisy instructions out of his agent\u2019s system prompt made it start fabricating IDs it used to get right. You only catch that by running the same tests before and after the change, and comparing.<\/li>\n\n\n\n<li><strong>The feature is brand new, so you have nothing to test it on<\/strong>. No dataset, no historical traces, no ground truth. Yet you still need to know whether it works, and how well. So how do you generate realistic test data fast, then feed it to evaluators that turn it into hard performance numbers?<\/li>\n<\/ol>\n\n\n\n<p class=\"wp-block-paragraph\">This is the case study that gives you a plan of attack for both: <strong>Evaluation-Driven Development (EDD)<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">How to prove a new feature works, and didn\u2019t regress, before you merge. It comes from a recent conversation with <a href=\"https:\/\/substack.com\/@alejandroaboy\">Alejandro Aboy<\/a>, a senior data and AI engineer at Workpath who owns the entire data stack, built the Workpath AI Companion, and writes <em>The Pipe and The Line Substack<\/em>.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">And we won\u2019t keep it abstract. Every example comes from one real product: <strong>Workpath<\/strong>, a strategy-execution SaaS that keeps large companies\u2019 OKRs and initiatives aligned. (OKRs \u2014 Objectives and Key Results \u2014 are the goal-setting framework teams use to name what they want to achieve and the measurable results that prove they\u2019re getting there.) Its AI-native feature is the <strong>Workpath AI Companion<\/strong>: an agent that scans a company\u2019s strategy and OKR data end-to-end to keep enterprise teams aligned. It\u2019s the exact system Alejandro runs EDD on every day.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">So when we say developing a new feature, picture shipping a change to that Companion and proving, before you merge, that it works and didn\u2019t regress.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"710\" src=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2026\/06\/observability-platform-chart-1024x710.webp\" alt=\"\" class=\"wp-image-20304\" srcset=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2026\/06\/observability-platform-chart-1024x710.webp 1024w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2026\/06\/observability-platform-chart-300x208.webp 300w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2026\/06\/observability-platform-chart-768x532.webp 768w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2026\/06\/observability-platform-chart.webp 1200w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption class=\"wp-element-caption\">The moving parts<\/figcaption><\/figure>\n\n\n\n<h2 id=\"h-the-develop-a-feature-workflow\" class=\"wp-block-heading\">The Develop-a-Feature Workflow<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Imagine. You start working on a new feature, branch, and develop the change. But before you merge, you have to answer 2 questions:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>What\u2019s the performance of my new feature?<\/li>\n\n\n\n<li>Did my change introduce any regressions into the existing codebase?<\/li>\n<\/ol>\n\n\n\n<p class=\"wp-block-paragraph\">Only when both look good you accept the pull request. EDD helps you answer those 2 questions.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"345\" src=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2026\/06\/edd-chart-1024x345.png\" alt=\"\" class=\"wp-image-20305\" srcset=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2026\/06\/edd-chart-1024x345.png 1024w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2026\/06\/edd-chart-300x101.png 300w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2026\/06\/edd-chart-768x259.png 768w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2026\/06\/edd-chart.png 1400w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption class=\"wp-element-caption\">EDD is the offline validation gate between developing a change and merging it<\/figcaption><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">Every feature is hypothesis-first. As Alejandro frames it, \u201c<em>I have a hypothesis\u2026 <\/em>and everything should lie around that.\u201d Every change starts as a stated hypothesis on a branch.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Based on that hypothesis, EDD runs a simulation and scores the results to answer the 2 questions.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Every feature ends in a PR, backed by an experiment, with clear traces and metrics. Framing the eval results as an experiment allows you to compare current results to previous ones, detecting regressions or tracking improvements.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">This is how you can compare two experiments in Opik:<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"418\" src=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2026\/06\/comparison-1024x418.jpg\" alt=\"\" class=\"wp-image-20306\" srcset=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2026\/06\/comparison-1024x418.jpg 1024w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2026\/06\/comparison-300x122.jpg 300w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2026\/06\/comparison-768x314.jpg 768w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2026\/06\/comparison-1536x627.jpg 1536w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2026\/06\/comparison-2048x836.jpg 2048w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption class=\"wp-element-caption\">Comparing two experiments in Opik<\/figcaption><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">What about the process that happens between starting a new feature and its experiment?<\/p>\n\n\n\n<h3 id=\"h-from-an-architectural-perspective-we-have\" class=\"wp-block-heading\">From an architectural perspective, we have:<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>The <strong>AI application<\/strong>, which can be an AI agent, workflow or a simple chatbot. In Alejandro\u2019s case, it\u2019s an AI agent built with Agno. More precisely, it\u2019s the Workpath AI Companion he is building. But due to data privacy reasons, during the demo, he could share only a mock of the data.<\/li>\n\n\n\n<li>A <strong>headless evaluation harness<\/strong>, powered by Claude Code.<\/li>\n\n\n\n<li>An <strong>AI observability and evaluation platform<\/strong> responsible for capturing traces, managing eval datasets and evaluators, running experiments and comparing results. Alejandro is using <a href=\"https:\/\/github.com\/comet-ml\/opik\">Opik<\/a>. The tool is open-source, but for ease of use, you can also try out their managed platform for free <a href=\"https:\/\/www.comet.com\/site\/?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul\">here<\/a> for 25k spans\/month.<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">Now\u2026 how do we generate data for these experiments? How do we get the traces? How do we populate the evaluation harness with the right context? We will see how all of that falls into place, where everything starts with two modes.<\/p>\n\n\n\n<h2 id=\"h-two-modes-manual-quick-check-vs-automated-experiments\" class=\"wp-block-heading\">Two Modes: Manual Quick Check vs. Automated Experiments<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">The two modes are modeled by the \/edd skill, which has two inputs: Mode and Aggression.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Mode 1<\/strong> is a quick, manual check. You fire around 30 fresh traces, let Claude Code read them back from Opik one by one, and trigger a judge by hand only if you want a score. As Alejandro describes it, <em>\u201cit won\u2019t trigger automatic evaluations; you trigger them manually.\u201d<\/em> No dataset, no experiment, ephemeral, minutes. His favorite for a small change: his Substack Author Agent kept over-asking for the publication URL on every new trace. A tiny, targeted fix, exactly what Mode 1 is for.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Mode 2<\/strong> automates the judgment. When you touch a lot or ship new functionality, you turn the traces into a dataset and run an experiment, both Opik objects, where the judges score every item automatically and produce an experiment you can compare across runs. This is the only way to catch a subtle regression, because you compare 2 experiments.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Both modes start from a hypothesis on a branch, emit fresh simulated traces, and can use the same evaluators. The mode only changes whether the evaluation is done by hand or automatically.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The <strong>Aggression setting<\/strong> controls how adversarial the simulated traces get, from happy-path up to fully adversarial. As you turn up the knob, simulated traces get more aggressive, finding harder and harder corner cases to break the agent.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"658\" src=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2026\/07\/router-diagram-1024x658.webp\" alt=\"\" class=\"wp-image-20308\" srcset=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2026\/07\/router-diagram-1024x658.webp 1024w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2026\/07\/router-diagram-300x193.webp 300w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2026\/07\/router-diagram-768x493.webp 768w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2026\/07\/router-diagram.webp 1400w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption class=\"wp-element-caption\">The skill\u2019s decision flow: A small change takes the quick Mode 1 path, while new functionality takes the Mode 2 dataset-and-experiment path.<\/figcaption><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\"><br>The secret sauce of Alejandro\u2019s EDD approach is in how he uses Claude Code to simulate fresh traces.<\/p>\n\n\n\n<h2 id=\"h-scope-the-change-and-simulate-its-traces\" class=\"wp-block-heading\">Scope the Change and Simulate Its Traces<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Remember. Our goal is to simulate relevant traces to test the performance of our feature. To do that, we use Claude Code to read the agent\u2019s source code, especially the code around the new feature. After, we retrieve old traces (stored in Opik) that are relevant to our current code.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Based on these two signals, we generate ~30 traces targeting the new feature\u2019s functionality.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"372\" src=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2026\/07\/running-system-1024x372.webp\" alt=\"\" class=\"wp-image-20310\" srcset=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2026\/07\/running-system-1024x372.webp 1024w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2026\/07\/running-system-300x109.webp 300w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2026\/07\/running-system-768x279.webp 768w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2026\/07\/running-system.webp 1400w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption class=\"wp-element-caption\">The evals can only see what the trace carries. So the trace has to carry the whole harness, not just the answer.<\/figcaption><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\"><br>The traces need to be high signal and as diverse as possible. The goal is to find holes within our system and fix them, not to validate what currently works.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">To achieve that, the traces are generated based on 2 dimensions:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><em>Regression evals<\/em> (what worked still works) vs. <em>capability evals <\/em>(can do it on new things<\/li>\n\n\n\n<li><em>Happy path <\/em>(easy: testing the core logic) vs. <em>adversarial<\/em> (hard: finding edge cases, such as missing data, faulty tool descriptions or guardrails)<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">During generation, we can configure these parameters. For example, if we go full adversarial, the probability of finding errors increases. Which isn&#8217;t necessarily a good thing, as you don\u2019t want to overoptimize in advance either. You want to make the system as good as possible on the hot path. You don\u2019t want to waste time on scenarios that might never happen. That\u2019s why anchoring your trace generation to existing traces is an essential step for properly understanding the user\u2019s behavior and which components to target when generating the traces.<\/p>\n\n\n\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p class=\"wp-block-paragraph\">\u26a0\ufe0f Important! Even if we simulate the data, we still want REAL traces and outputs to evaluate on.<\/p>\n<\/blockquote>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>This is what we have to do<\/strong>. The pipeline starts from the data, not from invented inputs. Claude Code analyzes the current traces to learn what inputs are worth generating, so we s<strong>imulate only the inputs, NOT the outputs or internal state.<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">That\u2019s the whole point. Synthesize the outputs too and you hit Alejandro\u2019s problem:<em> \u201cevery time I try synthetic datasets, I was losing everything the agent was doing beyond the response.\u201d<\/em> Grade the final answer alone and a wrong tool call stays invisible.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">To get there, we send each simulated input to a headless copy of the agent, which runs for real: selecting tools, calling the staging backend, handling whatever comes back. As it runs, Agno records the full tool-call history and outputs into its OpenTelemetry trace, and the agent emits it to Opik.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Using this strategy, we simulate the inputs, run the agent, and record the trace with real values produced by the agent.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"716\" src=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2026\/07\/single-trace-1024x716.webp\" alt=\"\" class=\"wp-image-20311\" srcset=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2026\/07\/single-trace-1024x716.webp 1024w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2026\/07\/single-trace-300x210.webp 300w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2026\/07\/single-trace-768x537.webp 768w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2026\/07\/single-trace.webp 1456w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption class=\"wp-element-caption\">A single simulated trace in Opik<\/figcaption><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">A simulated trace is only as trustworthy as the state the agent was in when it ran, and recreating that state is the hardest part.<\/p>\n\n\n\n<h2 id=\"h-context-population-mocking-production-state\" class=\"wp-block-heading\">Context Population: Mocking Production State<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">The hardest part of agentic evals is getting the agent into the right state, so it passes or fails for reasons that actually relate to your hypothesis. A trace generated from the wrong state is a useless trace.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">In Alejandro\u2019s use case, roughly 90% of the agent\u2019s tools are API calls, so Claude Code gets a token and hits the real internal backend through a staging mocked account that already holds data. For the happy path, it pulls real goals, OKRs, and teams. To go adversarial, it forces errors and asks for data that doesn\u2019t exist.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The reusable trick is where the context gets injected. Before the agent boots, Claude Code calls the API and injects the user\u2019s context into dedicated system-prompt sections. So the agent greets you with <em>\u201cHi Paul, want to check goals from the coding AI team?\u201d<\/em> It runs<em> \u201cas if for real.\u201d<\/em><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Alejandro is explicit that this is not pytest-style fixtures, but <em>\u201cthe prompt is the only thing the LLM sees.\u201d<\/em> A faithful prompt-level state is a faithful enough production proxy. You mock at the system-prompt layer and stop worrying about reproducing the whole backend.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">So instead of using the standard way of using fixtures to populate the backend, you can bypass everything and directly inject the context into the system prompt. From the LLM\u2019s perspective, it\u2019s the same thing.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"533\" src=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2026\/07\/system-prompt-1024x533.png\" alt=\"\" class=\"wp-image-20314\" srcset=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2026\/07\/system-prompt-1024x533.png 1024w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2026\/07\/system-prompt-300x156.png 300w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2026\/07\/system-prompt-768x400.png 768w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2026\/07\/system-prompt.png 1073w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption class=\"wp-element-caption\">Mock the state at the system-prompt level, hit a real staging backend, and the agent runs as if in production.<\/figcaption><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">The last step is to transform the traces into an evals dataset.<\/p>\n\n\n\n<h2 id=\"h-on-demand-datasets\" class=\"wp-block-heading\">On-Demand Datasets<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">You want two types of eval datasets:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>A persistent, hand-built evaluation set that tests the core business logic. Useful for catching regressions.<\/li>\n\n\n\n<li>An on-demand, synthetic dataset used to evaluate the feature you are working on.<\/li>\n<\/ol>\n\n\n\n<p class=\"wp-block-paragraph\">We are interested here in the second option.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Via the \/edd skill Claude Code assembles an Opik dataset, on the fly, from the branch-tagged simulated traces. The dataset is tagged so a later run can filter straight to it, then kept or thrown away.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Before committing to a big sample, Alejandro fires a couple of runs as smoke checks, <em>\u201cto catch anything awful\u201d <\/em>before spending tokens. Then he checks that the dataset\u2019s coverage is good enough to be worth running an experiment against. Small, cheap, and it saves the expensive mistake.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Because the dataset is cheap to regenerate and scoped to one change, it\u2019s disposable. Optionally, you might promote a couple of high signal traces into the persistent regression set.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">A dataset is only useful once you\u2019ve decided what metrics to use \u2014 aka the judges.<\/p>\n\n\n\n<h2 id=\"h-define-the-judge\" class=\"wp-block-heading\">Define the Judge<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">You want to support 2 evaluator types.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Code metrics<\/strong> score the structural things deterministically, server-side, free, no LLM, like whether it called the tool or whether the format is right. Always try to evaluate a given metric via code metrics if possible.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>LLM judges<\/strong> score the subjective things, like completeness, accuracy, and ranking quality.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Both evaluators are designed as binary classifiers: verified, or not. The urge to introduce 1-5 likert scales is huge. But the thing is that it\u2019s incredibly difficult to get it right. What\u2019s the difference between 2 and 3 or 4 and 5? Even when using multiple human annotators, the labels are inconsistent. With binary labels, the decision is clear: it\u2019s correct or not. Which makes it incredibly easy for the LLM to get it right.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">To get some nuance, on top of the binary labels, you want to add a critique that explains in 2-3 sentences why the output is correct or not.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The judge model is deliberately a different model than the agent, so the two don\u2019t share blind spots and the judge can\u2019t rubber-stamp its own family\u2019s mistakes.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">But here is the trick! The evaluators are static, carefully defined and calibrated up front from the codebase. The loop regenerates traces and datasets, not the metrics. When implementing LLM judges, it\u2019s extremely important to align them with the domain expert. Once they are working well, you can use them for inference, which we are doing here on the dynamic datasets.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"447\" src=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2026\/07\/experiment-example-1024x447.webp\" alt=\"\" class=\"wp-image-20317\" srcset=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2026\/07\/experiment-example-1024x447.webp 1024w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2026\/07\/experiment-example-300x131.webp 300w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2026\/07\/experiment-example-768x335.webp 768w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2026\/07\/experiment-example.webp 1456w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption class=\"wp-element-caption\">Opik\u2019s Insights view turns each judge into a per-dimension score profile for the run.<\/figcaption><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">The judges are configured within Opik, using their API to call the model to score each sample from the dataset.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">In the image below, you can see the evaluators Alejandro configured for each experiment:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Content Completeness<\/li>\n\n\n\n<li>Metric Accuracy<\/li>\n\n\n\n<li>Ranking Quality<\/li>\n\n\n\n<li>Relative Grounding<\/li>\n\n\n\n<li>Response Directness<\/li>\n\n\n\n<li>Semantic Search Accuracy<\/li>\n\n\n\n<li>Skill Selection<\/li>\n<\/ul>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"363\" src=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2026\/07\/evaluator-suite-1024x363.webp\" alt=\"\" class=\"wp-image-20318\" srcset=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2026\/07\/evaluator-suite-1024x363.webp 1024w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2026\/07\/evaluator-suite-300x106.webp 300w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2026\/07\/evaluator-suite-768x272.webp 768w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2026\/07\/evaluator-suite.webp 1456w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption class=\"wp-element-caption\">The evaluator suite Alejandro runs on every experiment. One judge per dimension, scoring the whole dataset.<\/figcaption><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">With the dataset built and the judges defined, you run the experiment. And you run it twice.<\/p>\n\n\n\n<h2 id=\"h-run-and-compare-experiments\" class=\"wp-block-heading\">Run and Compare Experiments<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">The experiment runs in Opik, scoring the dataset against all the judges to produce a score distribution.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">In a feature Alejandro was working on, he cleaned all the noisy instructions out of his agent\u2019s system prompt, <em>\u201chygiene before\u201d<\/em> vs. <em>\u201chygiene after,\u201d<\/em> and ran 2 experiments over the same scope in Opik\u2019s comparison view.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"414\" src=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2026\/07\/hygeine-chart-1024x414.webp\" alt=\"\" class=\"wp-image-20319\" srcset=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2026\/07\/hygeine-chart-1024x414.webp 1024w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2026\/07\/hygeine-chart-300x121.webp 300w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2026\/07\/hygeine-chart-768x311.webp 768w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2026\/07\/hygeine-chart.webp 1456w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption class=\"wp-element-caption\">Run the same scope twice and the regression you introduced shows up as one short bar.<\/figcaption><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">The <em>after<\/em> had regressed on one judge: tool-call parameter inference. The agent should remember which ID to pass to a tool, but the cleanup made it<em> \u201cget lost and fabricate IDs.\u201d<\/em> EDD caught his own change before it shipped.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Comparison matters because failure hides where a single trace can\u2019t show it. Trace-level evals are usually fine, but problems surface across 5, 10, or 20-message conversations. Around the 10th message, the model slides into a <em>\u201ccontext rot zone\u201d<\/em>: a request that earlier earned a cooperative <em>\u201clet\u2019s work with this\u201d<\/em> now gets <em>\u201cwhat do you mean by that?\u201d<\/em><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">A user asks the agent to <em>\u201cscan 50 teams, get me all the OKRs.\u201d <\/em>It pushes back and offers to go progressively, returning 17 copy-pasteable tables. But by trace 21 of a 20-message conversation, you\u2019re at 200k total tokens, paying heavily without caching.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"418\" src=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2026\/07\/experiment-feedback-1024x418.webp\" alt=\"\" class=\"wp-image-20320\" srcset=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2026\/07\/experiment-feedback-1024x418.webp 1024w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2026\/07\/experiment-feedback-300x122.webp 300w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2026\/07\/experiment-feedback-768x313.webp 768w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2026\/07\/experiment-feedback.webp 1456w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption class=\"wp-element-caption\">The same scope, before vs after, overlaid in Opik. One regressed dimension can\u2019t hide across the judges.<\/figcaption><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">These are the kinds of errors proper evals protect you from! Not only performance, but also latency and cost issues that can blow up your infrastructure overnight.<\/p>\n\n\n\n<h2 id=\"h-don-t-run-online-evals\" class=\"wp-block-heading\">Don\u2019t Run Online Evals<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Everything so far ran offline, on a branch, before the merge. The expensive default everyone reaches for instead is always-on online evals. That\u2019s the trap.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Alejandro made the same mistake.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">He thought running online evaluations on all the traces was essential for production. Until the bill! It was on credits, not cash, but it would have been around $2k a month just from triggering a few evaluations: <em>\u201cthe bill just pops in \u2014 in one second you have thousands of dollars in debt.\u201d<\/em><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">So you recalibrate. What\u2019s the actual risk of evaluating whether the agent leaked an ID to the user? Low. So you sample or look for a pattern, run the heavy judges offline, and cap spend first: <em>\u201cconsume an amount you know you can afford.\u201d<\/em><\/p>\n\n\n\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p class=\"wp-block-paragraph\"><strong>The good news?<\/strong> The whole \/edd skill and headless harness implemented by Alejandro via Opik is now an <a href=\"https:\/\/github.com\/aboyalejandro\/eval-driven-development\">installable open-source Claude Code plugin<\/a>. You can also create a free account on <a href=\"https:\/\/www.comet.com\/site\/?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul\">Opik<\/a> with 25k spans\/month to try out this EDD strategy on your own project.<\/p>\n<\/blockquote>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"384\" src=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2026\/07\/offline-dataset-1024x384.jpg\" alt=\"\" class=\"wp-image-20321\" srcset=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2026\/07\/offline-dataset-1024x384.jpg 1024w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2026\/07\/offline-dataset-300x113.jpg 300w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2026\/07\/offline-dataset-768x288.jpg 768w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2026\/07\/offline-dataset-1536x576.jpg 1536w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2026\/07\/offline-dataset-2048x768.jpg 2048w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption class=\"wp-element-caption\">The offline eval dataset<\/figcaption><\/figure>\n\n\n\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p class=\"wp-block-paragraph\">\ud83c\udfa5 Watch the full conversation between Alejandro Aboy and me<\/p>\n<\/blockquote>\n\n\n\n<figure class=\"wp-block-embed is-type-video is-provider-youtube wp-block-embed-youtube wp-embed-aspect-16-9 wp-has-aspect-ratio\"><div class=\"wp-block-embed__wrapper\">\n<iframe loading=\"lazy\" title=\"How Evaluation-Driven Development (EDD) Works \u2013 Alejandro Aboy\" width=\"500\" height=\"281\" src=\"https:\/\/www.youtube.com\/embed\/1zuGTgHQGcM?feature=oembed\" frameborder=\"0\" allow=\"accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share\" referrerpolicy=\"strict-origin-when-cross-origin\" allowfullscreen><\/iframe>\n<\/div><\/figure>\n\n\n\n<h2 id=\"h-final-thoughts\" class=\"wp-block-heading\">Final Thoughts<\/h2>\n\n\n\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p class=\"wp-block-paragraph\">You\u2019re already driving one AI process with another. Would you hand the whole thing over to an agent that reads the traces, understands the agent\u2019s mistakes, gets the signal from the evaluators and writes the code changes itself?<br>\u2014 Paul<\/p>\n<\/blockquote>\n\n\n\n<p class=\"wp-block-paragraph\">Alejandro would want exactly that, on one condition. He\u2019s tried prompt-only optimizers and doesn\u2019t trust them, because they change the prompt but never test the agent\u2019s full harness. Until then, the human stays in the loop, and every change earns its pull request.<\/p>\n\n\n\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p class=\"wp-block-paragraph\">Opik has been shipping exactly that: Test Suites, Agent Configuration and Playground. Where does your hand-rolled loop go from here?<br>\u2014 Paul<\/p>\n<\/blockquote>\n\n\n\n<p class=\"wp-block-paragraph\">Because they tackle the prompt-only-optimization gap, Alejandro is bullish on adopting them \u2014 he expects they\u2019ll close the whole agentic loop: analyze the code &amp; failures \u2192 generate inputs \u2192 call the agent \u2192 build a dataset \u2192 evaluate each sample \u2192 fix the code \u2192 repeat, all in one place.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Try out Alejandro\u2019s EDD code on <a href=\"https:\/\/github.com\/aboyalejandro\/eval-driven-development\">GitHub<\/a>, while leveraging <a href=\"https:\/\/www.comet.com\/site\/?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul\">Opik\u2019s free tier<\/a> as the observability platform.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Turn every AI agent change into a measured experiment you compare before and after to detect regressions and measure performance. The scariest AI failures are the silent ones. You ship a change to your agent on a branch \u2014 a new feature, a prompt fix, a quick refactor. No errors. No complaints. Everything looks fine. [&hellip;]<\/p>\n","protected":false},"author":128,"featured_media":20302,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"customer_name":"","customer_description":"","customer_industry":"","customer_technologies":"","customer_logo":"","_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[363],"tags":[],"coauthors":[222],"class_list":["post-20285","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-opik-user-stories"],"yoast_head":"<!-- This site is optimized with the Yoast SEO Premium plugin v25.9 (Yoast SEO v25.9) - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>How Evaluation-Driven Development (EDD) Works<\/title>\n<meta name=\"description\" content=\"Turn every AI agent change into a measured experiment you compare before and after to detect regressions and measure performance.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.comet.com\/site\/blog\/edd-opik-project-example\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"How Evaluation-Driven Development (EDD) Works\" \/>\n<meta property=\"og:description\" content=\"Turn every AI agent change into a measured experiment you compare before and after to detect regressions and measure performance.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.comet.com\/site\/blog\/edd-opik-project-example\/\" \/>\n<meta property=\"og:site_name\" content=\"Comet\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/cometdotml\" \/>\n<meta property=\"article:published_time\" content=\"2026-07-02T22:21:56+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2026-07-02T22:21:57+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2026\/06\/edd-title-card.webp\" \/>\n\t<meta property=\"og:image:width\" content=\"1376\" \/>\n\t<meta property=\"og:image:height\" content=\"768\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/webp\" \/>\n<meta name=\"author\" content=\"Paul Iusztin\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@Cometml\" \/>\n<meta name=\"twitter:site\" content=\"@Cometml\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Paul Iusztin\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"15 minutes\" \/>\n<!-- \/ Yoast SEO Premium plugin. -->","yoast_head_json":{"title":"How Evaluation-Driven Development (EDD) Works","description":"Turn every AI agent change into a measured experiment you compare before and after to detect regressions and measure performance.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.comet.com\/site\/blog\/edd-opik-project-example\/","og_locale":"en_US","og_type":"article","og_title":"How Evaluation-Driven Development (EDD) Works","og_description":"Turn every AI agent change into a measured experiment you compare before and after to detect regressions and measure performance.","og_url":"https:\/\/www.comet.com\/site\/blog\/edd-opik-project-example\/","og_site_name":"Comet","article_publisher":"https:\/\/www.facebook.com\/cometdotml","article_published_time":"2026-07-02T22:21:56+00:00","article_modified_time":"2026-07-02T22:21:57+00:00","og_image":[{"width":1376,"height":768,"url":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2026\/06\/edd-title-card.webp","type":"image\/webp"}],"author":"Paul Iusztin","twitter_card":"summary_large_image","twitter_creator":"@Cometml","twitter_site":"@Cometml","twitter_misc":{"Written by":"Paul Iusztin","Est. reading time":"15 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/www.comet.com\/site\/blog\/edd-opik-project-example\/#article","isPartOf":{"@id":"https:\/\/www.comet.com\/site\/blog\/edd-opik-project-example\/"},"author":{"name":"Paul Iusztin","@id":"https:\/\/www.comet.com\/site\/#\/schema\/person\/87bf0cb600025605b68dcd2f0d597560"},"headline":"How Evaluation-Driven Development (EDD) Works","datePublished":"2026-07-02T22:21:56+00:00","dateModified":"2026-07-02T22:21:57+00:00","mainEntityOfPage":{"@id":"https:\/\/www.comet.com\/site\/blog\/edd-opik-project-example\/"},"wordCount":2929,"commentCount":0,"publisher":{"@id":"https:\/\/www.comet.com\/site\/#organization"},"image":{"@id":"https:\/\/www.comet.com\/site\/blog\/edd-opik-project-example\/#primaryimage"},"thumbnailUrl":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2026\/06\/edd-title-card.webp","articleSection":["Opik User Stories"],"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/www.comet.com\/site\/blog\/edd-opik-project-example\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/www.comet.com\/site\/blog\/edd-opik-project-example\/","url":"https:\/\/www.comet.com\/site\/blog\/edd-opik-project-example\/","name":"How Evaluation-Driven Development (EDD) Works","isPartOf":{"@id":"https:\/\/www.comet.com\/site\/#website"},"primaryImageOfPage":{"@id":"https:\/\/www.comet.com\/site\/blog\/edd-opik-project-example\/#primaryimage"},"image":{"@id":"https:\/\/www.comet.com\/site\/blog\/edd-opik-project-example\/#primaryimage"},"thumbnailUrl":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2026\/06\/edd-title-card.webp","datePublished":"2026-07-02T22:21:56+00:00","dateModified":"2026-07-02T22:21:57+00:00","description":"Turn every AI agent change into a measured experiment you compare before and after to detect regressions and measure performance.","breadcrumb":{"@id":"https:\/\/www.comet.com\/site\/blog\/edd-opik-project-example\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.comet.com\/site\/blog\/edd-opik-project-example\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.comet.com\/site\/blog\/edd-opik-project-example\/#primaryimage","url":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2026\/06\/edd-title-card.webp","contentUrl":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2026\/06\/edd-title-card.webp","width":1376,"height":768},{"@type":"BreadcrumbList","@id":"https:\/\/www.comet.com\/site\/blog\/edd-opik-project-example\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/www.comet.com\/site\/"},{"@type":"ListItem","position":2,"name":"How Evaluation-Driven Development (EDD) Works"}]},{"@type":"WebSite","@id":"https:\/\/www.comet.com\/site\/#website","url":"https:\/\/www.comet.com\/site\/","name":"Comet","description":"Build Better Models Faster","publisher":{"@id":"https:\/\/www.comet.com\/site\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.comet.com\/site\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/www.comet.com\/site\/#organization","name":"Comet ML, Inc.","alternateName":"Comet","url":"https:\/\/www.comet.com\/site\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.comet.com\/site\/#\/schema\/logo\/image\/","url":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/01\/logo_comet_square.png","contentUrl":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/01\/logo_comet_square.png","width":310,"height":310,"caption":"Comet ML, Inc."},"image":{"@id":"https:\/\/www.comet.com\/site\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/cometdotml","https:\/\/x.com\/Cometml","https:\/\/www.youtube.com\/channel\/UCmN63HKvfXSCS-UwVwmK8Hw"]},{"@type":"Person","@id":"https:\/\/www.comet.com\/site\/#\/schema\/person\/87bf0cb600025605b68dcd2f0d597560","name":"Paul Iusztin","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.comet.com\/site\/#\/schema\/person\/image\/82264b94fb97af87b79646edc7e4fd81","url":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2026\/05\/cropped-paul-iusztin-96x96.webp","contentUrl":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2026\/05\/cropped-paul-iusztin-96x96.webp","caption":"Paul Iusztin"},"sameAs":["https:\/\/decodingml.substack.com\/"],"url":"https:\/\/www.comet.com\/site\/blog\/author\/paul-iusztin\/"}]}},"jetpack_featured_media_url":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2026\/06\/edd-title-card.webp","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts\/20285","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/users\/128"}],"replies":[{"embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/comments?post=20285"}],"version-history":[{"count":2,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts\/20285\/revisions"}],"predecessor-version":[{"id":20322,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts\/20285\/revisions\/20322"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/media\/20302"}],"wp:attachment":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/media?parent=20285"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/categories?post=20285"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/tags?post=20285"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/coauthors?post=20285"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}