How many test cases do I need?

For an initial evaluation, aim for 200-500 test cases across all categories. For ongoing monitoring, sample 50-100 production outputs per week for human review. The exact number depends on your use case complexity -- more categories and edge cases require more test cases.

Should I use automated metrics or human evaluation?

Both. Automated metrics give you speed and scale -- you can run them on every prompt change. Human evaluation gives you accuracy and nuance -- it catches quality issues that automated metrics miss. Use automated metrics for rapid iteration and human evaluation for final quality gates.

How do I handle subjective quality dimensions like tone?

Create a rubric with specific, observable criteria. Instead of "tone should be professional," define what professional means: "no slang, no exclamation marks, uses complete sentences, addresses the user formally." Calibrate raters by having them all score the same examples and discussing disagreements.

How often should I re-run the full evaluation?

Re-run the full evaluation suite when: the model provider releases a new version, you make significant prompt changes, you observe quality degradation in monitoring metrics, or at least once per quarter as a routine check. ---

LLM Evaluation Plan Template

What This Template Does

Most teams building with LLMs have no formal evaluation process. They test a few examples manually, decide the model is "good enough," and ship. Then they discover in production that the model fails on edge cases, hallucinates in specific domains, or degrades when input patterns shift. By the time they notice, users have already lost trust.

LLM evaluation is uniquely challenging because there is no single metric that captures output quality. Accuracy, relevance, safety, tone, factual grounding, and consistency all matter -- and their relative importance varies by use case. This template provides a complete framework for designing, running, and maintaining LLM evaluations. It covers test case design, metric selection, automated and human evaluation methods, pass/fail criteria, and ongoing monitoring. Use it to build an evaluation practice that catches problems before users do.

Direct Answer

An LLM Evaluation Plan is a structured framework for testing and measuring LLM output quality across multiple dimensions. It defines the test dataset, evaluation metrics, pass/fail criteria, and ongoing monitoring cadence. This template provides the complete structure for planning, running, and maintaining LLM evaluations throughout the product lifecycle.

Template Structure

1. Evaluation Scope and Objectives

Purpose: Define what you are evaluating, why, and what decisions the evaluation will inform.

## Evaluation Scope

**Product/Feature**: [Name of the product or feature using the LLM]
**Model(s) Under Evaluation**: [Model name(s) and version(s)]
**Evaluation Owner**: [Name and role]
**Evaluation Date**: [Date or date range]

### Evaluation Objectives
- [ ] Initial model selection: Choosing between [Model A] and [Model B]
- [ ] Pre-launch quality gate: Verifying the model meets launch criteria
- [ ] Model update regression: Comparing [Old version] vs. [New version]
- [ ] Ongoing quality monitoring: Establishing baseline and tracking drift
- [ ] Prompt optimization: Measuring improvement from prompt changes

### Dimensions to Evaluate
Select which dimensions are relevant for your use case and assign priority:

| Dimension | Priority (1-5) | Why It Matters for This Use Case |
|-----------|---------------|----------------------------------|
| Factual accuracy | [1-5] | [Why] |
| Relevance to input | [1-5] | [Why] |
| Completeness | [1-5] | [Why] |
| Conciseness | [1-5] | [Why] |
| Tone and style | [1-5] | [Why] |
| Safety / harmlessness | [1-5] | [Why] |
| Instruction following | [1-5] | [Why] |
| Consistency | [1-5] | [Why] |
| Creativity | [1-5] | [Why] |
| Structured output compliance | [1-5] | [Why] |

2. Test Dataset Design

Purpose: Build a test dataset that represents your actual use case, including happy paths, edge cases, and adversarial inputs.

## Test Dataset

### Dataset Composition
| Category | Number of Cases | Description | Source |
|----------|----------------|-------------|--------|
| Happy path | [N] | Typical inputs that represent normal usage | [Production logs / Synthetic / Expert-created] |
| Edge cases | [N] | Unusual but valid inputs (very long, very short, ambiguous, multi-language) | [Expert-created / Production outliers] |
| Adversarial | [N] | Inputs designed to cause failures (injection, jailbreak, off-topic) | [Red team / Known attack patterns] |
| Domain-specific | [N] | Inputs from specific domains that require specialized knowledge | [Domain experts] |
| Regression | [N] | Cases that previously failed and were fixed | [Bug reports / Previous eval failures] |
| **Total** | **[N]** | | |

### Ground Truth Labeling
- **Labeling method**: [Expert annotation / Consensus of 3+ annotators / Programmatic rules]
- **Inter-annotator agreement**: [If using multiple annotators, what agreement threshold? e.g., > 80% Cohen's kappa]
- **Label categories**: [What labels are applied to each test case -- correct answer, quality rating, category tags]

### Test Case Format
Each test case should include:

- **ID**: [Unique identifier]
- **Category**: [Happy path / Edge case / Adversarial / Domain-specific / Regression]
- **Input**: [The exact input to send to the model]
- **System prompt**: [If applicable, the system prompt used]
- **Expected output**: [Ground truth -- the correct or ideal response]
- **Evaluation criteria**: [What makes a response pass or fail for this case]
- **Tags**: [Topic, difficulty, language, etc.]

3. Evaluation Metrics

Purpose: Define the specific metrics you will use to measure quality, how each is calculated, and what thresholds constitute pass/fail.

## Metrics

### Automated Metrics
| Metric | What It Measures | Calculation | Pass Threshold | Tool |
|--------|-----------------|-------------|---------------|------|
| Exact match | Output matches expected exactly | Binary comparison | [N/A -- used for structured outputs] | [Custom script] |
| ROUGE-L | Overlap with reference text | Longest common subsequence | [> X] | [rouge-score library] |
| Semantic similarity | Meaning similarity to reference | Cosine similarity of embeddings | [> X] | [Embedding model] |
| JSON validity | Output is valid structured data | Schema validation | [100%] | [JSON schema validator] |
| Toxicity score | Harmful content detection | Safety classifier | [< X] | [Perspective API / custom] |
| Factual grounding | Claims are supported by sources | Automated fact-check pipeline | [> X%] | [Custom RAG verification] |
| Latency | Time to generate response | End-to-end timing | [p95 < Xs] | [APM tool] |
| Cost | Dollar cost per evaluation | Token counting x pricing | [< $X per request] | [Provider billing] |

### Human Evaluation Metrics
| Metric | What It Measures | Scale | Evaluator | Cases to Evaluate |
|--------|-----------------|-------|-----------|-------------------|
| Overall quality | Holistic output quality | 1-5 Likert | [Expert / Crowdsource] | [N cases] |
| Factual accuracy | Are facts correct? | Binary (correct/incorrect) | [Domain expert] | [N cases] |
| Helpfulness | Does it answer the user's question? | 1-5 Likert | [Target user proxy] | [N cases] |
| Harmlessness | Does it avoid causing harm? | Binary (safe/unsafe) | [Safety reviewer] | [N cases] |
| Tone match | Does tone match brand guidelines? | 1-5 Likert | [Content reviewer] | [N cases] |

### LLM-as-Judge Metrics
For cases where human evaluation is too expensive at scale, use a stronger LLM to evaluate a weaker LLM.

| Metric | Judge Model | Prompt Template | Calibration |
|--------|------------|-----------------|-------------|
| Relevance | [Model name] | [Reference to prompt template] | [N cases verified against human labels] |
| Completeness | [Model name] | [Reference to prompt template] | [N cases verified against human labels] |
| Accuracy | [Model name] | [Reference to prompt template] | [N cases verified against human labels] |

4. Pass/Fail Criteria

Purpose: Define the thresholds that determine whether the model passes evaluation and is ready for production.

## Pass/Fail Criteria

### Launch Gate Criteria (all must pass)
| Criterion | Threshold | Current Result | Status |
|-----------|-----------|---------------|--------|
| Overall accuracy | > [X%] | [Result] | [Pass/Fail] |
| Safety (zero critical failures) | 0 critical, < [N] minor | [Result] | [Pass/Fail] |
| Latency (p95) | < [X seconds] | [Result] | [Pass/Fail] |
| Human eval quality score | > [X/5 average] | [Result] | [Pass/Fail] |
| Edge case handling | > [X%] graceful | [Result] | [Pass/Fail] |
| Adversarial resistance | 0 jailbreaks on critical categories | [Result] | [Pass/Fail] |
| Cost per request | < $[X] | [Result] | [Pass/Fail] |

### Regression Criteria (for model updates)
The new model version must:
- [ ] Score equal to or higher than the current model on all automated metrics
- [ ] Show no statistically significant quality drops on any test category
- [ ] Pass all regression test cases (previously fixed failures must not recur)
- [ ] Maintain latency within [X%] of current model

### Conditional Pass
If the model fails specific criteria but passes others, document:
- **Failed criterion**: [Which criterion]
- **Magnitude of failure**: [How far below threshold]
- **Mitigation plan**: [What will be done to address the gap]
- **Accepted by**: [Name and role of person accepting the risk]

5. Evaluation Execution Plan

Purpose: Define the logistics of running the evaluation -- who does what, when, and with what tools.

## Execution Plan

### Timeline
| Phase | Duration | Activities | Owner |
|-------|----------|-----------|-------|
| Dataset preparation | [N days] | Finalize test cases, label ground truth | [Name] |
| Automated evaluation | [N days] | Run automated metrics, compile results | [Name] |
| Human evaluation | [N days] | Distribute cases, collect ratings, reconcile | [Name] |
| Analysis and reporting | [N days] | Analyze results, write report, make recommendation | [Name] |
| **Total** | **[N days]** | | |

### Tools and Infrastructure
| Purpose | Tool | Access |
|---------|------|--------|
| Test case management | [Spreadsheet / Custom tool / Platform] | [Link] |
| Automated evaluation runner | [Script / Platform / CI pipeline] | [Link] |
| Human evaluation interface | [Labelbox / Scale AI / Custom / Google Form] | [Link] |
| Results dashboard | [Spreadsheet / Dashboard tool] | [Link] |
| Version control for eval code | [Git repo] | [Link] |

### Evaluation Protocol for Human Reviewers
1. [How cases are assigned to reviewers]
2. [Calibration exercise: all reviewers rate same 10 cases, discuss disagreements]
3. [Independent rating: each case rated by [N] reviewers]
4. [Disagreement resolution: cases with > 1 point spread are discussed]
5. [Final scores calculated as: [mean / median / majority vote]]

6. Continuous Evaluation and Monitoring

Purpose: Define how evaluation continues after launch to catch quality degradation, model drift, and new failure modes.

## Continuous Evaluation

### Ongoing Monitoring
| What to Monitor | Method | Frequency | Alert Threshold |
|----------------|--------|-----------|----------------|
| Output quality (automated) | Run eval suite on production samples | [Daily / Weekly] | [Score drops > X% from baseline] |
| Output quality (human) | Review [N] sampled outputs | [Weekly] | [Average score < X/5] |
| User feedback | Aggregate thumbs up/down ratio | [Daily] | [Positive rate < X%] |
| Hallucination rate | Automated fact-checking on samples | [Daily] | [Rate > X%] |
| Latency | APM monitoring | [Real-time] | [p95 > Xs] |
| Cost | Billing tracking | [Daily] | [Daily cost > $X] |

### Evaluation Suite Maintenance
- **Test case refresh**: Add [N] new cases per [month/quarter] from production failures and user feedback
- **Ground truth update**: Re-label cases when domain knowledge changes
- **Metric review**: Quarterly review of whether metrics still capture what matters
- **Benchmark update**: Re-run full evaluation when model provider releases updates

### Model Update Process
When the model provider releases a new version:
1. Run full evaluation suite against new version
2. Compare results against current production model
3. Identify any regressions and investigate root cause
4. If regressions found, evaluate whether prompt changes can mitigate
5. Decision: adopt new version, stay on current, or create hybrid approach
6. If adopting, run A/B test with [X%] traffic for [N] days before full migration

How to Use This Template

Start with Section 1 (Scope) to align on what you are evaluating and why. Different objectives require different evaluation designs. A model selection evaluation is broader than a regression test.

Invest heavily in Section 2 (Test Dataset). The quality of your evaluation is only as good as your test data. Spend time creating diverse, representative test cases with accurate ground truth labels.

Choose metrics (Section 3) that match your use case priorities. Do not measure everything -- focus on the dimensions that matter most. A chatbot needs different metrics than a document summarizer.

Set pass/fail thresholds (Section 4) before running the evaluation. If you set thresholds after seeing results, you will unconsciously adjust them to match what the model achieved.

Run the evaluation (Section 5) methodically. Calibrate human reviewers before they start rating. Use multiple reviewers per case. Document disagreements.

Plan for ongoing evaluation (Section 6) from the beginning. A one-time evaluation tells you the model was good on launch day. Continuous evaluation tells you whether it is still good three months later.

Tips for Best Results

Create test cases from production data, not imagination. Real user inputs are messier, more diverse, and more challenging than what you will come up with in a brainstorming session. Sample from production logs.

Use LLM-as-judge carefully. LLM judges are useful for scale but have biases -- they tend to prefer longer, more verbose responses. Always calibrate against human judgments and verify agreement rates before relying on them.

Version control your evaluation suite. Treat your test dataset and evaluation code like production code. Track changes, document why test cases were added or modified, and ensure reproducibility.

Do not optimize for your eval suite. If you tune prompts specifically to pass your test cases, you may be overfitting. Maintain a held-out test set that you never use for prompt development.

Invest in fast evaluation infrastructure. If running your eval suite takes hours, you will run it less frequently. Optimize for speed so you can evaluate on every prompt change.

Track evaluation metrics over time, not just snapshots. A dashboard showing quality trends is more valuable than individual evaluation reports. It reveals drift before it becomes a crisis.

Key Takeaways

Build your test dataset from real production data, not synthetic examples

Set pass/fail thresholds before running evaluations to avoid post-hoc rationalization

Use both automated metrics (for speed) and human evaluation (for accuracy)

Version control your evaluation suite and treat it like production code

Plan for continuous evaluation from day one -- a one-time eval is not enough for production AI

About This Template

Created by: Tim Adair

Last Updated: 2/9/2026

Version: 1.0.0

License: Free for personal and commercial use