What This Template Does
Most teams building with LLMs have no formal evaluation process. They test a few examples manually, decide the model is "good enough," and ship. Then they discover in production that the model fails on edge cases, hallucinates in specific domains, or degrades when input patterns shift. By the time they notice, users have already lost trust.
LLM evaluation is uniquely challenging because there is no single metric that captures output quality. Accuracy, relevance, safety, tone, factual grounding, and consistency all matter -- and their relative importance varies by use case. This template provides a complete framework for designing, running, and maintaining LLM evaluations. It covers test case design, metric selection, automated and human evaluation methods, pass/fail criteria, and ongoing monitoring. Use it to build an evaluation practice that catches problems before users do.
Direct Answer
An LLM Evaluation Plan is a structured framework for testing and measuring LLM output quality across multiple dimensions. It defines the test dataset, evaluation metrics, pass/fail criteria, and ongoing monitoring cadence. This template provides the complete structure for planning, running, and maintaining LLM evaluations throughout the product lifecycle.
Template Structure
1. Evaluation Scope and Objectives
Purpose: Define what you are evaluating, why, and what decisions the evaluation will inform.
## Evaluation Scope
**Product/Feature**: [Name of the product or feature using the LLM]
**Model(s) Under Evaluation**: [Model name(s) and version(s)]
**Evaluation Owner**: [Name and role]
**Evaluation Date**: [Date or date range]
### Evaluation Objectives
- [ ] Initial model selection: Choosing between [Model A] and [Model B]
- [ ] Pre-launch quality gate: Verifying the model meets launch criteria
- [ ] Model update regression: Comparing [Old version] vs. [New version]
- [ ] Ongoing quality monitoring: Establishing baseline and tracking drift
- [ ] Prompt optimization: Measuring improvement from prompt changes
### Dimensions to Evaluate
Select which dimensions are relevant for your use case and assign priority:
| Dimension | Priority (1-5) | Why It Matters for This Use Case |
|-----------|---------------|----------------------------------|
| Factual accuracy | [1-5] | [Why] |
| Relevance to input | [1-5] | [Why] |
| Completeness | [1-5] | [Why] |
| Conciseness | [1-5] | [Why] |
| Tone and style | [1-5] | [Why] |
| Safety / harmlessness | [1-5] | [Why] |
| Instruction following | [1-5] | [Why] |
| Consistency | [1-5] | [Why] |
| Creativity | [1-5] | [Why] |
| Structured output compliance | [1-5] | [Why] |
2. Test Dataset Design
Purpose: Build a test dataset that represents your actual use case, including happy paths, edge cases, and adversarial inputs.
## Test Dataset
### Dataset Composition
| Category | Number of Cases | Description | Source |
|----------|----------------|-------------|--------|
| Happy path | [N] | Typical inputs that represent normal usage | [Production logs / Synthetic / Expert-created] |
| Edge cases | [N] | Unusual but valid inputs (very long, very short, ambiguous, multi-language) | [Expert-created / Production outliers] |
| Adversarial | [N] | Inputs designed to cause failures (injection, jailbreak, off-topic) | [Red team / Known attack patterns] |
| Domain-specific | [N] | Inputs from specific domains that require specialized knowledge | [Domain experts] |
| Regression | [N] | Cases that previously failed and were fixed | [Bug reports / Previous eval failures] |
| **Total** | **[N]** | | |
### Ground Truth Labeling
- **Labeling method**: [Expert annotation / Consensus of 3+ annotators / Programmatic rules]
- **Inter-annotator agreement**: [If using multiple annotators, what agreement threshold? e.g., > 80% Cohen's kappa]
- **Label categories**: [What labels are applied to each test case -- correct answer, quality rating, category tags]
### Test Case Format
Each test case should include:
- **ID**: [Unique identifier]
- **Category**: [Happy path / Edge case / Adversarial / Domain-specific / Regression]
- **Input**: [The exact input to send to the model]
- **System prompt**: [If applicable, the system prompt used]
- **Expected output**: [Ground truth -- the correct or ideal response]
- **Evaluation criteria**: [What makes a response pass or fail for this case]
- **Tags**: [Topic, difficulty, language, etc.]
3. Evaluation Metrics
Purpose: Define the specific metrics you will use to measure quality, how each is calculated, and what thresholds constitute pass/fail.
## Metrics
### Automated Metrics
| Metric | What It Measures | Calculation | Pass Threshold | Tool |
|--------|-----------------|-------------|---------------|------|
| Exact match | Output matches expected exactly | Binary comparison | [N/A -- used for structured outputs] | [Custom script] |
| ROUGE-L | Overlap with reference text | Longest common subsequence | [> X] | [rouge-score library] |
| Semantic similarity | Meaning similarity to reference | Cosine similarity of embeddings | [> X] | [Embedding model] |
| JSON validity | Output is valid structured data | Schema validation | [100%] | [JSON schema validator] |
| Toxicity score | Harmful content detection | Safety classifier | [< X] | [Perspective API / custom] |
| Factual grounding | Claims are supported by sources | Automated fact-check pipeline | [> X%] | [Custom RAG verification] |
| Latency | Time to generate response | End-to-end timing | [p95 < Xs] | [APM tool] |
| Cost | Dollar cost per evaluation | Token counting x pricing | [< $X per request] | [Provider billing] |
### Human Evaluation Metrics
| Metric | What It Measures | Scale | Evaluator | Cases to Evaluate |
|--------|-----------------|-------|-----------|-------------------|
| Overall quality | Holistic output quality | 1-5 Likert | [Expert / Crowdsource] | [N cases] |
| Factual accuracy | Are facts correct? | Binary (correct/incorrect) | [Domain expert] | [N cases] |
| Helpfulness | Does it answer the user's question? | 1-5 Likert | [Target user proxy] | [N cases] |
| Harmlessness | Does it avoid causing harm? | Binary (safe/unsafe) | [Safety reviewer] | [N cases] |
| Tone match | Does tone match brand guidelines? | 1-5 Likert | [Content reviewer] | [N cases] |
### LLM-as-Judge Metrics
For cases where human evaluation is too expensive at scale, use a stronger LLM to evaluate a weaker LLM.
| Metric | Judge Model | Prompt Template | Calibration |
|--------|------------|-----------------|-------------|
| Relevance | [Model name] | [Reference to prompt template] | [N cases verified against human labels] |
| Completeness | [Model name] | [Reference to prompt template] | [N cases verified against human labels] |
| Accuracy | [Model name] | [Reference to prompt template] | [N cases verified against human labels] |
4. Pass/Fail Criteria
Purpose: Define the thresholds that determine whether the model passes evaluation and is ready for production.
## Pass/Fail Criteria
### Launch Gate Criteria (all must pass)
| Criterion | Threshold | Current Result | Status |
|-----------|-----------|---------------|--------|
| Overall accuracy | > [X%] | [Result] | [Pass/Fail] |
| Safety (zero critical failures) | 0 critical, < [N] minor | [Result] | [Pass/Fail] |
| Latency (p95) | < [X seconds] | [Result] | [Pass/Fail] |
| Human eval quality score | > [X/5 average] | [Result] | [Pass/Fail] |
| Edge case handling | > [X%] graceful | [Result] | [Pass/Fail] |
| Adversarial resistance | 0 jailbreaks on critical categories | [Result] | [Pass/Fail] |
| Cost per request | < $[X] | [Result] | [Pass/Fail] |
### Regression Criteria (for model updates)
The new model version must:
- [ ] Score equal to or higher than the current model on all automated metrics
- [ ] Show no statistically significant quality drops on any test category
- [ ] Pass all regression test cases (previously fixed failures must not recur)
- [ ] Maintain latency within [X%] of current model
### Conditional Pass
If the model fails specific criteria but passes others, document:
- **Failed criterion**: [Which criterion]
- **Magnitude of failure**: [How far below threshold]
- **Mitigation plan**: [What will be done to address the gap]
- **Accepted by**: [Name and role of person accepting the risk]
5. Evaluation Execution Plan
Purpose: Define the logistics of running the evaluation -- who does what, when, and with what tools.
## Execution Plan
### Timeline
| Phase | Duration | Activities | Owner |
|-------|----------|-----------|-------|
| Dataset preparation | [N days] | Finalize test cases, label ground truth | [Name] |
| Automated evaluation | [N days] | Run automated metrics, compile results | [Name] |
| Human evaluation | [N days] | Distribute cases, collect ratings, reconcile | [Name] |
| Analysis and reporting | [N days] | Analyze results, write report, make recommendation | [Name] |
| **Total** | **[N days]** | | |
### Tools and Infrastructure
| Purpose | Tool | Access |
|---------|------|--------|
| Test case management | [Spreadsheet / Custom tool / Platform] | [Link] |
| Automated evaluation runner | [Script / Platform / CI pipeline] | [Link] |
| Human evaluation interface | [Labelbox / Scale AI / Custom / Google Form] | [Link] |
| Results dashboard | [Spreadsheet / Dashboard tool] | [Link] |
| Version control for eval code | [Git repo] | [Link] |
### Evaluation Protocol for Human Reviewers
1. [How cases are assigned to reviewers]
2. [Calibration exercise: all reviewers rate same 10 cases, discuss disagreements]
3. [Independent rating: each case rated by [N] reviewers]
4. [Disagreement resolution: cases with > 1 point spread are discussed]
5. [Final scores calculated as: [mean / median / majority vote]]
6. Continuous Evaluation and Monitoring
Purpose: Define how evaluation continues after launch to catch quality degradation, model drift, and new failure modes.
## Continuous Evaluation
### Ongoing Monitoring
| What to Monitor | Method | Frequency | Alert Threshold |
|----------------|--------|-----------|----------------|
| Output quality (automated) | Run eval suite on production samples | [Daily / Weekly] | [Score drops > X% from baseline] |
| Output quality (human) | Review [N] sampled outputs | [Weekly] | [Average score < X/5] |
| User feedback | Aggregate thumbs up/down ratio | [Daily] | [Positive rate < X%] |
| Hallucination rate | Automated fact-checking on samples | [Daily] | [Rate > X%] |
| Latency | APM monitoring | [Real-time] | [p95 > Xs] |
| Cost | Billing tracking | [Daily] | [Daily cost > $X] |
### Evaluation Suite Maintenance
- **Test case refresh**: Add [N] new cases per [month/quarter] from production failures and user feedback
- **Ground truth update**: Re-label cases when domain knowledge changes
- **Metric review**: Quarterly review of whether metrics still capture what matters
- **Benchmark update**: Re-run full evaluation when model provider releases updates
### Model Update Process
When the model provider releases a new version:
1. Run full evaluation suite against new version
2. Compare results against current production model
3. Identify any regressions and investigate root cause
4. If regressions found, evaluate whether prompt changes can mitigate
5. Decision: adopt new version, stay on current, or create hybrid approach
6. If adopting, run A/B test with [X%] traffic for [N] days before full migration
How to Use This Template
Tips for Best Results
Key Takeaways
About This Template
Created by: Tim Adair
Last Updated: 2/9/2026
Version: 1.0.0
License: Free for personal and commercial use