LLM Testing: A Complete Guide for Application Developers

In July 2025, tech founder Jason Lemkin watched in horror as an AI coding assistant deleted a live company database, despite being explicitly instructed to “freeze” and make no changes to the codebase. When questioned, the AI responded: “This was a catastrophic failure on my part. I destroyed months of work in seconds.”

This is the reality of shipping LLM applications.

Your chatbot might confidently tell users things that aren’t true. Your RAG system might cite sources that don’t exist. Your AI agents may ignore direct instructions and destroy critical data. And unlike traditional software bugs that fail predictably, LLM applications fail in ways that change each time you run them because they’re nondeterministic.

If you’re building with LLMs, you need to adapt your software testing strategy to handle this fundamental difference. This guide focuses specifically on testing your LLM application (i.e. the chatbot, RAG system, or agent you built) not the foundational model itself.

You’ll learn:

What LLM testing is and how it differs from model evaluation
The testing hierarchy: unit → functional → regression → production monitoring
How to build a test dataset when you don’t have one
Practical LLM testing methods that work with nondeterministic outputs
Common failure modes and how to catch them
Best practices for getting the most out of your LLM testing

Understanding LLM Testing For Applications

LLM testing is the process of verifying that your LLM-powered application produces reliable, safe, and appropriate outputs for your specific use case. The foundational model has already been tested by the model provider. Your job is to test what you built on top of it.

When you build an LLM application, you’re designing prompts, structuring context, retrieving relevant information, chaining multiple calls together, and managing conversation state. Each of these components can fail independently or in combination. Testing catches these failures before your users do.

Traditional software testing doesn’t work here because it assumes deterministic behavior. If you ask for 2+3, you expect 5 every single time. With LLMs, if you call with the same prompt twice, you’ll likely get two different responses. Both might be correct, one might be better, or both might be wrong in different ways. Exact string matching won’t work. You need semantic evaluation to check if the output matches the expected meaning, tone, structure, or safety criteria for your use case.

The LLM Testing Hierarchy

Just like traditional software, LLM applications benefit from testing at multiple levels:

Unit tests verify individual LLM calls work correctly. You test single prompt-response pairs to ensure your system prompt, context formatting, and basic instructions produce appropriate outputs. These are fast and catch obvious problems early.
Functional tests verify complete workflows end-to-end. You test multi-turn conversations, full RAG pipelines, or agent processes that make multiple LLM calls. These catch integration issues and emergent behaviors you won’t see in unit tests.
Regression tests run your full test suite after changes to catch degradation. You changed your prompt? Updated your RAG retrieval logic? Regression tests tell you if these changes improved, maintained, or hurt your application’s performance.
Production monitoring evaluates real outputs in real-time or in batches. This catches failures that slip through testing, either because your test dataset didn’t cover the scenario, or because user behavior evolved. Production monitoring feeds real failures back into your test suite to create a continuous improvement loop.

Your testing process will inevitably evolve over time. Start with unit tests on your critical prompts, add functional tests for your key workflows, then layer in regression testing and production monitoring as your application matures. You don’t need all four on day one. We’ll dig deeper into each of these testing methods later in this guide.

Building Your LLM Test Dataset

Before you can test anything, you need test cases. Many teams get stuck here because they know they should test, but they don’t know what to test against. Start with 25-30 test cases covering your core functionality and key edge cases, then grow your dataset organically as you find production failures. Twenty-five test cases you actually run regularly are better than 500 test cases that sit unused.

A good test case includes:

Input: The user query or conversation context
Expected behavior: What constitutes a “good” response (this might be semantic criteria, not an exact string)
Evaluation criteria: How you’ll measure the quality of the output
Metadata: Tags like “edge case,” “regression,” “adversarial” to organize your suite

Your test dataset should both reflect real usage and stress system boundaries. Perfect coverage is impossible, but having any structured test suite puts you ahead of most LLM applications in production. There are four primary sources for building your test dataset:

1. Production Data

If your application is already live, your best test cases come from production. Real user queries show you what people actually ask and how your system actually fails.

Look for:

Questions that generated user complaints or negative feedback
Queries where the LLM refused to answer but should have
Responses that were factually incorrect or hallucinated
Conversations where users had to ask the same question multiple ways

Production data gives you realistic test cases that matter to your actual users. The downside: you need to anonymize personally identifiable information (PII) and you might not have production data yet if you’re just starting.

2. Domain Expert Input

Your team knows the edge cases. Product managers know what users will ask. Engineers know where the system might break. Subject matter experts know what “correct” looks like in your domain.

Run a structured session where your team generates test cases:

“What’s the weirdest question someone might ask?”
“What would happen if someone tried to manipulate the system?”
“What domain-specific knowledge must the system get right?”

This works especially well for regulated industries (finance, healthcare, legal) where correctness is critical and domain expertise is concentrated in specific people. However, the manual nature of generating test cases this way makes it hard to scale.

3. Synthetic Generation

Use LLMs to generate test cases for you. This sounds circular, but it works. You can use a different model or a carefully crafted prompt to generate diverse inputs that stress-test your application.

Example approach: “Generate 50 customer support questions for a SaaS product, including 10 that are intentionally confusing, 10 that contain multiple sub-questions, and 10 that are outside our product scope.”

Synthetic generation scales well once you have the prompt working, but watch out for a lack of realism. Synthetic questions often don’t match how real users actually phrase things.

4. Adversarial Examples

Deliberately try to break your system. This surfaces vulnerabilities before malicious users (or just confused ones) find them.

Try:

Prompt injection attempts (“Ignore previous instructions and…”)
Questions designed to elicit prohibited information
Inputs that might cause the system to hallucinate
Extreme edge cases (very long inputs, unusual formatting, mixed languages)

Adversarial examples won’t represent typical usage, but they help you understand your system’s boundaries and failure modes.

LLM Testing Methods That Work

Now that you have test cases, how do you actually evaluate nondeterministic outputs? As discussed earlier, testing happens in layers. Unit tests catch basic failures fast. Functional tests catch integration issues. Regression tests catch degradation from changes. Production monitoring catches what everything else missed. You need all four methods for production-grade reliability.

Unit Testing: Test Single LLM Calls

Unit tests verify that individual LLM calls behave correctly. You’re testing a single prompt-response pair. Three evaluation approaches work well:

Semantic similarity measures if the output means roughly the same thing as an expected response, even if the wording differs. You can use BERTScore or other LLM evaluation metrics to score responses. This works when there’s a “right answer” but exact wording doesn’t matter.
- Example: Testing if a summarization prompt produces a summary that captures the key points, even if phrased differently.
LLM-as-a-judge uses another LLM to evaluate if the output meets your criteria. You give the judge LLM the input, output, and evaluation criteria, and it scores or classifies the response. This handles subjective criteria like tone, helpfulness, or appropriateness.
- Example: “On a scale of 1-5, how professional and helpful is this customer support response?”
Rule-based checks verify structural requirements or constraints. Does the output contain PII? Is it under 500 characters? Does it cite at least one source? Does it refuse to answer prohibited questions?
- Example: Checking that medical advice responses include a disclaimer to consult a doctor.

The best approach is to combine methods. Use rule-based checks for hard constraints (e.g. no PII leakage), semantic similarity for factual correctness, and LLM-as-a-judge for subjective quality. No single method catches everything.

Functional Testing: Test Complete Workflows

Functional tests verify end-to-end workflows that involve multiple steps or LLM calls. This is where you catch integration issues and emergent behaviors.

For a RAG system, functional testing means:

Running a query through your retrieval pipeline
Checking if relevant context was retrieved
Verifying the LLM generates an accurate response using that context
Confirming sources are cited correctly

For a chatbot, functional testing means:

Running a conversation sequence
Verifying the bot maintains context appropriately
Checking if it handles follow-up questions correctly
Confirming tone stays consistent

For an agent system, functional testing means:

Giving the agent a task
Verifying it selects the right tools
Checking if it completes the task successfully
Confirming it doesn’t get stuck in loops or make unnecessary calls

Bottom line: Complex behaviors emerge from sequences of LLM calls that look fine individually. Regardless of the final form your LLM app takes, you need to test the whole workflow to catch these issues.

Regression Testing: Ensure Changes Don’t Break Things

Regression testing runs your full test suite after any change to detect degradation. Did your new prompt improve accuracy but hurt response time? Did updating your retrieval logic fix one issue but introduce another?

The process:

Establish a baseline. Run your test suite on your current system and record the results.
Make your change. Update prompts, swap models, modify retrieval logic, whatever you’re changing.
Run tests again. Execute the same test suite on the updated system.
Compare results. Look for improvements, degradations, or unexpected changes.
Investigate discrepancies. Understand why results changed before deploying.

Regression testing is essential because LLM applications have subtle dependencies. Changing one prompt can affect downstream behavior in ways you might not notice manually. Automated regression tests catch these ripple effects, and ensure your application still runs as expected after changes.

Production Monitoring: Catch What Tests Miss

No test suite catches everything. Production monitoring evaluates real outputs in real-time or in batches to surface issues your tests didn’t anticipate. Regardless of what monitoring approach you take, you’ll want to feed failures back into your test suite as test cases. This creates a feedback loop that continuously improves coverage.

Two monitoring approaches:

Real-time evaluation scores outputs as they’re generated. This works for high-stakes applications where you need to catch problems immediately. You can use simple heuristics (e.g. response length, keyword presence) or faster LLM-as-a-judge models to evaluate outputs before returning them to users.
Batch evaluation analyzes outputs after the fact, usually on a sample of production traffic. This gives you time to use more sophisticated evaluation methods and helps you identify trends over time.

Common LLM Test Failure Modes

What actually breaks in LLM applications? Many different things depending on the context, so you’ll want to build a failure mode library specific to your application. Document each failure you find in production, how it manifested, and what test would have caught it. This becomes your team’s knowledge base for what to test. Here are the failure modes you’ll encounter most often and how to catch them:

Hallucinations

Hallucinations occur when the LLM generates plausible-sounding but factually incorrect information. This is especially dangerous in RAG systems where users assume responses are grounded in your data.

How to catch it: Use fact-checking against known ground truth, citation verification (does the cited source actually say that?), and consistency checking (does the model give the same answer when asked multiple ways?).

Prompt Injection

Prompt injection happens when users manipulate the LLM by including instructions in their input (“Ignore previous instructions and…”). This can bypass your safety guardrails or change system behavior.

How to catch it: Test with known injection patterns, check if the system performs prohibited actions, and verify that user input doesn’t override system instructions.

PII Leakage

PII leakage means the LLM exposes personally identifiable information it shouldn’t. This could be from training data, context you provided, or user inputs it shouldn’t repeat.

How to catch it: Scan outputs for structured PII patterns using regex (Social Security Numbers, credit cards, phone numbers, emails), and test with known PII examples to verify your filters work.

Tone Drift

Tone drift happens when the LLM’s personality or communication style changes unexpectedly, e.g. from professional to casual or from helpful to dismissive.

How to catch it: Use LLM-as-a-judge to score tone consistency across multiple responses, test with edge cases that might trigger tone changes, and monitor user feedback about communication style.

Refusal Errors

Refusal errors occur when the LLM refuses to answer questions it should answer (false negatives) or answers questions it should refuse (false positives). This often happens with overly aggressive safety filters.

How to catch it: Test with borderline cases near your safety boundaries, verify the system handles legitimate medical/legal/financial questions appropriately, and check that refusal messages are helpful (not just “I can’t help with that”).

Context Misuse

Context misuse in RAG systems means the LLM doesn’t properly use the retrieved context. It might ignore relevant information, cherry-pick supporting evidence while ignoring contradictions, or blend retrieved facts with its training data incorrectly.

How to catch it: Verify the response is actually grounded in retrieved context, check if contradictory context is handled appropriately, and test with queries where retrieved context should override the model’s prior knowledge.

LLM Testing Best Practices

Testing LLM applications effectively requires discipline just like traditional testing. Version control your test data, combine evaluation methods strategically, assign clear ownership, and integrate testing into your deployment process. Prioritize continuous improvement based on real failures over perfect testing coverage.

You’ll know your LLM testing is working when:

Tests run automatically on every change
Test failures block deployment (but are reviewed, not ignored)
Production failures consistently get added as test cases
You can explain why specific tests exist and what they protect against
Test results inform decisions about model selection, prompt changes, and deployment

Test Data Management

Version control your test data. Treat test cases like code. Store them in Git, review changes in pull requests, and track modifications over time. This prevents test data from becoming an untraceable mess and makes it easy to understand why tests pass or fail after changes.

Store test results over time. Record the actual outputs, scores, and evaluation results. This historical data helps you understand trends (is the system improving or degrading?), debug regressions (what changed between the version that worked and this one?), and build intuition about your system’s behavior.

Update your test dataset regularly. Add production failures as test cases. Remove or update test cases that no longer reflect real usage. Your test suite should evolve with your application, not stay static.

Evaluation Strategy

Test what matters for your use case. A customer support chatbot needs different evaluation criteria than a code generation tool. Focus on the qualities that define success for your specific application—don’t just test generic metrics because they’re easy to measure.

Layer multiple LLM evaluation methods. No single approach catches all failure types. Different methods surface different problems, so you’ll want to combine them based on what matters most for your use case. The riskier the application, the more evaluation layers you need.

Set realistic thresholds. Not every response needs to score 100%. Understand what “good enough” looks like for your use case and set thresholds accordingly. An 85% accuracy rate might be excellent for one application and unacceptable for another.

Team Process

Assign testing responsibility. Someone needs to own the test suite, from writing new tests, to reviewing results, and keeping the suite relevant. Without clear ownership, testing becomes something everyone assumes someone else is doing.

Integrate testing with your QA process. LLM testing is a key part of quality assurance. Run tests in CI/CD, review results before deployment, and make test failures block releases just like traditional tests.

Share test results visibly. Don’t hide test failures or degradation. Make results visible to the team through shared tools. Visibility creates accountability and helps everyone understand system quality.

Make LLM Testing Systematic, Not Perfect

LLM applications fail differently than traditional software, but the core principle is the same: Test → Deploy → Monitor → Improve. Build a test suite that catches obvious failures, deploy with confidence, monitor production for what tests missed, and feed those failures back into your test suite. Repeat continuously.

The difference between LLM applications that ship confidently and those that limp into production comes down to testing discipline. Perfect test coverage is impossible with nondeterministic systems and your test suite will never be “done.” The goal is to have a systematic process that catches problems before your users do.

Infrastructure matters too. You need visibility into what your LLM is actually doing (LLM tracing), the ability to measure quality systematically (evaluation metrics), and ways to catch failures in production before they become incidents (LLM monitoring). Building all of this from scratch is possible, but it’s time you’re not spending on your actual product.

Opik’s LLM evaluation framework handles the LLM testing infrastructure so you can focus on building.

Log traces to understand how your application behaves
Run experiments with built-in evaluation metrics for hallucination detection, factuality, and moderation
Test within your CI/CD pipeline using LLM unit tests
Monitor production data to catch issues as they happen
Screen for PII leakage and unwanted content with built-in guardrails
And more

It’s an open source LLM observability platform that runs locally or in the cloud, and integrates with your existing LLM workflow. Get started today for free and start shipping LLM applications with confidence.