Your new AI agent looks great in testing. It follows instructions, calls tools, and returns clean, structured outputs. Then it hits production and starts acting…strange. It nails some workflows and whiffs on others that look almost identical. Same user goal, same tools, same high-level prompt, but completely different behavior. Nothing “broke.” The model didn’t suddenly get worse. The difference is that your agent is now seeing real inputs with messy language, underspecified requests, and edge cases you never planned for.

So you try something different. Instead of only telling the model what to do, you show it. You add a few concrete examples—realistic inputs, the kind of step-by-step thinking you want it to follow, and the exact output format you expect. You didn’t change the model; you just added a few examples to each LLM call and reduced system improvisation, making its behavior more predictable.
That’s few-shot prompting in practice—not just describing what each step should do, but teaching it by example.
What Is Few-Shot Prompting?
Few-shot prompting is a method that gives an LLM 2-5 examples to use as a tiny custom dataset to learn from. Each example pairs an input with the desired output (and sometimes the reasoning) that the model should reproduce. Instead of telling the model what to do (“Extract the key fields from this message and return JSON”), you are showing it what to do (“Here are a few examples of messages and the JSON I want. Now do the same for this one.”) This process of setting up a model to learn from examples is called in-context learning because the model learns from the context you create.
Few-shot prompting differs from zero-shot (instructions only) and one-shot (a single example without variation) prompting. Few-shot prompting uses several examples to define a pattern—what’s correct, how to reason, and how to format. In chats, few-shot prompting primarily guides tone and structure. In agentic systems, with their network of prompts and model calls, it becomes core infrastructure.
Why Few-Shot Prompting Matters for Agentic Systems
AI agents aren’t powered by one giant prompt. Many smaller prompts power them, each attached to a specific step:
- Interpreting a messy user request.
- Deciding which tools to call and in what order.
- Mapping text into structured tool parameters.
- Summarizing intermediate results for the next step.
Each of these calls has its own job and its own failure modes. Many production issues trace back to a single brittle step somewhere in the middle.
Few-shot prompting helps you harden those steps. Instead of listing abstract rules (“always return valid JSON,” “never guess IDs,” “only call tools when necessary”), you give that step realistic examples:
- For routing, you show ambiguous messages and the correct route.
- For tool-calling, you show messy user inputs and the exact payloads you want.
- For summarization, you show long traces and how to condense them while preserving what the later steps actually need.
Language models are pattern-followers more than rule-followers. “Do it like this” lands better than “follow these 15 requirements.” In agentic workflows, that difference shows up in:
- Tool-calling precision: Correct arguments, types, and defaults instead of fragile guesses.
- Structured output enforcement: Valid schemas instead of “almost JSON” that breaks parsers.
- Consistency across branches: Similar inputs lead to similar choices, avoiding unpredictable divergence.
- Edge case handling: Unusual inputs are handled according to your policies, not the model’s mood.
Add good examples to a fragile step, and its behavior often becomes dramatically more reliable.
Core Benefits of Few-Shot Prompting in Agents
Once you start using few-shot prompting at the step level, the benefits pile up.
Accuracy and consistency — Examples narrow the space of plausible outputs. The model spends less time inventing interpretations and more time matching the pattern you’ve shown.
Better task understanding — Instructions like “extract what’s relevant” are vague on their own. Examples show what’s relevant, safe, urgent, or complete in your domain.
Structured outputs that actually work — If downstream code expects a particular schema, a few examples of valid responses—across normal, tricky, and odd cases—do far more than a bullet list of formatting rules.
No fine-tuning overhead — You get most of the domain-specific behavior you need without training infrastructure. You can ship changes by simply editing prompts and examples.
Faster iteration — When something misbehaves, you can often fix it by adding or adjusting an example rather than rewriting logic or swapping models. That tightens the loop between logs, debugging, and improvement.
Lower total cost of errors — A handful of examples is a small context cost compared to the expense of bad tool calls, misroutes, or wasted human time cleaning up wrong decisions.
Where Few-Shot Prompting Helps Agentic Systems Most
You could add examples everywhere. But in agentic systems, certain categories benefit the most.
Tool-calling — For steps that build function calls or API requests, examples show:
- How to map free text to arguments.
- How to treat missing or conflicting data.
- When it’s better not to call a tool at all (e.g., when the question is trivial or the tool’s inputs are clearly insufficient).
Multi-step reasoning — When a step requires reading something, deciding what’s going on, and planning a response, examples can demonstrate decomposition without changing the model itself. They can:
- Pull out the key facts.
- Decide which action pattern to follow.
- Represent uncertainty or ask for clarification.
Strict formatting — If the next step expects a specific schema, a few examples of correct outputs for different cases, including error or empty cases, are more effective than paragraphs explaining the schema.
Ambiguous or messy inputs — Real users paste entire email threads and half-formed ideas. Examples that show how to respond to underspecified, emotional, or multi-intent inputs give your agent a consistent playbook.
Tone and style — If your agent talks with customers, examples are the fastest way to teach voice for different channels, whether it’s supportive, concise, or whatever fits your brand.
Edge cases and tool selection — Once you’ve seen repeated failures in traces—overuse of a tool, missing a special case—you can turn those into examples so the agent learns, step-by-step, from its own history.
How to Design Effective Few-Shot Examples
Good examples teach the model something specific. Here are a few guidelines to help you get there.
Choose a small, strong set — Three to five examples are often enough. Each one should cover a different slice of the space: a simple case, a harder one, an ambiguous one, and at least one edge case.
Match production reality — Pull examples from real logs where you can. Synthetic examples are fine for bootstrapping, but they shouldn’t be the only thing the model sees. The closer examples are to real traffic, the more useful they are.
Cover variety on purpose — If all your examples look alike, the model may stumble on anything different. Add variation in length, phrasing, and difficulty so the pattern isn’t overly narrow.
Keep the structure consistent — Pick a template and stick to it—same sections, fields, and order. If one example shows input → reasoning → output, they all should. Consistency helps the model to lock onto what matters.
Include reasoning when it adds value — If you care about how a step reasons, include succinct reasoning in your examples to teach chain-of-thought without bloating the prompt.
Use negative examples sparingly — A couple of realistic “don’t do this, do this instead” examples can clarify boundaries, like when to refuse or when to escalate.
Respect your context budget — Examples share the context window with instructions, history, tool outputs, and retrieved docs. Trim anything that doesn’t materially change the model’s behavior:
- Remove niceties and boilerplate.
- Shorten inputs to essentials.
- Cut redundant explanations.
- Favor one strong example over three similar ones.
How Few-Shot Plays With Other Techniques
Few-shot prompting shines in a broader prompt-engineering toolkit. Paired with chain-of-thought prompting, a few well-chosen examples can demonstrate the stepwise reasoning you want, while the chain-of-thought guidance applies it to new and complex inputs.
Combined with meta prompting, examples can show what a useful self-critique or revision looks like, giving the model a concrete target for self-correction rather than vague instructions to “improve.”
In prompt chaining, few-shot prompts can be compact and stage-specific (e.g., routing, extraction, summarization) so each step stays precise, predictable, and easier to maintain over time. Once you start layering few-shot with these other techniques, the remaining question is not whether to use examples, but how much to use them at each step.
The Cost–Performance Trade-Off
Few-shot prompting isn’t free. Every example costs tokens that show up in latency and spend. The trade-off is simple in theory:
- More and richer examples usually improve quality but cost more.
- Fewer and smaller examples are cheaper but might be less robust.
In practice, you don’t need every step to be equally “heavy.” A sensible pattern is:
- Spend generously on steps where mistakes are expensive, like compliance checks, financial decisions, and safety-critical routing.
- Use lighter prompts or even zero-shot for simple, low-risk transforms.
- Periodically review token burn for each step and whether your examples justify the cost.
The right balance will be different for a low-volume internal tool than for a high-volume product pipeline, but the underlying question is the same: is this example still pulling its weight?
Common Pitfalls to Watch For
The more you rely on examples to steady your agents, the more it pays to notice where they can backfire. A few failure modes show up again and again.
Overly narrow examples — If all your examples look alike, the model may fail on anything outside that narrow lane. Add deliberate variety and check behavior on more than just the “happy path.”
Slow prompt creep — It’s easy to keep adding examples every time something breaks. Over time, prompts bloat until you hit context or cost limits. Make pruning part of your process by removing or shortening examples, and see if quality actually drops.
Inconsistent schemas — If examples don’t all use the same fields and formats, the model will mix patterns. Choose a schema, enforce it in examples, and update code or prompts if they drift apart.
Toy examples disconnected from reality — Overly neat, synthetic examples can yield brittle behavior in the wild. As soon as you have real traces, promote some of them into your example set.
Stale examples — Product rules and tools change. Last quarter’s examples might mislead now. Version, review, and clean up old prompts like code.
No feedback loop — Changing examples without testing impact leaves you flying blind. Even a small, fixed test set for each critical step helps you see whether you’re improving things or just shuffling them.
From Hand-Tuned Examples to Systematic Optimization
Manually crafting examples works well initially. But as your agent grows—more steps, more tools, more traffic—it’s harder to answer questions by intuition alone:
- Which examples matter most?
- How many examples does this step need?
- Can we cut tokens without hurting quality?
The possible example combinations can be numerous, but hand-testing doesn’t scale. That’s when it makes sense to treat example selection like any other optimization problem: define what “good” means, explore the space in a structured way, and let data tell you which set of examples is best for a particular step.
Opik and the Few-Shot Bayesian Optimizer
Opik is built around this core idea that you should trace what your agents are doing, evaluate how well they’re doing it, and then systematically improve them. One of the key tools for implementing this method in few-shot prompting is a Few-Shot Bayesian Optimizer. Consider this scenario.
Your goal is straightforward: Given a pool of candidate examples, find the number and combination that produce the best performance for this task.
You bring three things:
- A step in your agent where examples matter.
- A pool of candidate examples, often drawn from real traces and SME input.
- An evaluation metric that reflects success for that step—accuracy, tool-call correctness, end-to-end task completion, or something similar.
The optimizer:
- Suggests a particular subset of examples.
- Evaluates that configuration using your metric.
- Uses Bayesian optimization to propose better candidates over time, without brute-forcing every subset.
In practice, the loop looks like this:
- Log traces from your agent to see how each step behaves.
- Identify a step where few-shot prompting clearly affects quality or cost.
- Build a candidate pool of examples based on real interactions and expert knowledge.
- Define an evaluation metric aligned with your goals.
- Run the optimizer to explore example combinations and converge on a high-performing set.
- Deploy that set, keep monitoring, and repeat when behavior or requirements change.
The Few-Shot Bayesian Optimizer is a strong fit when task performance hinges on example quality and relevance, when you have more good examples than you can fit in a single prompt, and when you care about both quality and cost and need a principled way to trade them off.
Opik’s free offerings give you enough to get started with that whole loop: logging, LLM evaluation, and optimization in one place. You move from “these examples seem okay” to “this set is measurably better for this step of this agent.”
Few-shot prompting is one of the simplest ways to make agentic systems more predictable. Pair it with real-world traces and a bit of automation, and you get agents that are not only clever but consistently do the work you actually need them to do.
