How to Write a PRD for AI and LLM Features

Why Standard PRDs Fall Apart for AI Features

Traditional PRDs are built around deterministic logic. You define inputs, expected outputs, and edge cases. Engineering builds it, QA verifies it, and the feature either works or it does not. AI features break this model completely.

When you ship an LLM-powered feature, the same input can produce different outputs every time. "Correct" is not binary, it is a spectrum. The feature might work brilliantly for 90% of queries and hallucinate dangerously for the remaining 10%. A standard PRD has no framework for expressing this reality, which means engineering builds to the wrong spec, QA tests the wrong things, and the feature launches with risks nobody documented.

This guide covers the sections you need to add to your PRD when the feature involves AI, with concrete examples you can adapt for your own products.

Start with the Problem, Not the Model

Before you write a single line about any particular model, your PRD needs to answer a fundamental question: what user problem are you solving, and why does AI solve it better than a deterministic approach?

Most AI feature PRDs skip this. They start with "we will use an LLM to..." instead of "users struggle with X because Y, and AI enables a solution that was previously impossible because Z."

What to include in the problem statement

The user pain point with evidence: Customer interview quotes, support ticket volumes, or usage data showing where users get stuck. Link to your discovery research if you have it documented.

Why deterministic solutions fall short: Be specific. "Rules-based categorization fails because our taxonomy has 200+ categories and users describe issues in unpredictable natural language" is useful. "AI would be cool here" is not.

The cost of the status quo: What happens if you do nothing? Quantify it in terms of user time wasted, support costs, conversion drop-off, or churn.

Example

Users spend an average of 8 minutes manually categorizing each support ticket across our 247 category taxonomy. Miscategorization rate is 23%, causing tickets to be routed to the wrong team and increasing resolution time by 2.3x. Rules-based routing covers only 34% of ticket types accurately. An LLM-based classifier can handle the full natural-language variability of ticket descriptions and reduce categorization to under 2 seconds with a target accuracy above 92%.

Define Your Eval Criteria Before Anything Else

The single biggest difference between a good AI PRD and a bad one is whether it defines evaluation criteria upfront. Without eval criteria, you have no way to know if the feature is working, no way to compare model options, and no way to make rational decisions about tradeoffs.

Types of eval criteria

Accuracy metrics vary by feature type:

Classification tasks: Precision, recall, F1 score, with per-class breakdowns for important categories.

Generation tasks: Factual correctness rate (requires human evaluation or reference comparison), relevance score, completeness score.

Extraction tasks: Exact match rate, partial match rate, false positive rate.

Quality metrics capture what raw accuracy misses:

Coherence: Does the output make logical sense? Is it well-structured?

Tone alignment: Does the output match your brand voice and the user's context?

Actionability: Can the user actually use the output without significant editing?

Operational metrics keep the feature viable:

Latency: P50, P95, and P99 response times. Users have different tolerance for AI latency depending on the interaction pattern, so be specific about the acceptable range.

Cost per query: What does each API call cost at current token pricing? What is the projected cost at scale?

Throughput: Can the system handle your peak concurrent usage?

Setting thresholds

For each metric, define three levels:

Launch threshold: The minimum quality level required to ship. Below this, the feature does not go live.

Target threshold: The quality level you are aiming for within the first quarter post-launch.

Aspirational threshold: The level that would make this feature a true differentiator.

Be honest about launch thresholds. An AI feature that launches at 85% accuracy and improves to 95% over three months is far better than one that stays in development for six months trying to hit 95% before any user sees it.

Hallucination Tolerance and Guardrails

Every LLM hallucinates. Your PRD needs to define how much hallucination is acceptable and what happens when the model gets it wrong.

Categorize your hallucination risk

Not all hallucinations carry equal risk. Map your feature's outputs to risk tiers:

Critical (zero tolerance): Medical advice, financial calculations, legal statements, security-relevant outputs. If the model gets these wrong, users could suffer real harm.

High (minimal tolerance): Product recommendations that influence purchasing decisions, data analysis summaries that inform business strategy, customer-facing communications sent on behalf of the user.

Medium (bounded tolerance): Search result rankings, content suggestions, draft copy that the user will review before using.

Low (acceptable tolerance): Internal categorization, ideation assistance, rough summarization where the user treats output as a starting point.

Guardrail specifications

For each risk tier, define the guardrails:

Input validation: What inputs should be rejected or flagged before reaching the model? Prompt injection attempts, out-of-scope queries, personally identifiable information that should not be sent to a third-party API.

Output validation: What post-processing checks catch bad outputs? Regex patterns for known-bad formats, confidence score thresholds, fact-checking against your own database.

Fallback behavior: What happens when guardrails trigger? Does the user see a generic message, get routed to a human, or receive a simplified non-AI response?

Human-in-the-loop checkpoints: Where does a human need to review or approve the AI output before it reaches the end user?

Example guardrail spec

Feature: AI-generated release notes from commit history

Hallucination risk tier: Medium (user reviews before publishing)

Input guardrails: Strip internal ticket references and employee names before sending to model. Reject if commit history exceeds 50,000 tokens (summarize first).

Output guardrails: Flag any output containing URLs not found in the input commits. Flag any customer names or specific revenue figures. Reject outputs shorter than 100 words (likely model failure).

Fallback: Display a bulleted list of commit messages grouped by date as the non-AI alternative. Show the message: "AI summary unavailable. Showing raw commit history instead."

Model Requirements and Selection Criteria

Your PRD should specify model requirements in terms of capabilities needed, not specific model names. Models evolve too fast to pin your spec to a particular version. Instead, define the capability profile your feature requires.

Capability requirements

Context window: What is the maximum input size your feature needs to handle? If you are summarizing long documents, you need a large context window. Be specific: "Must handle inputs up to 100,000 tokens" is useful.

Output structure: Does the feature require structured JSON output, or is freeform text acceptable? Structured output significantly narrows your model options.

Reasoning complexity: Does the feature require multi-step reasoning, simple pattern matching, or creative generation? This affects both model selection and prompting strategy.

Language support: What languages must the model handle? Be explicit about tier 1 (full support) versus tier 2 (best effort) languages.

Multimodal needs: Does the feature need to process images, audio, or video alongside text?

Hosting and data constraints

Data residency: Can user data be sent to a third-party API, or does the model need to run in your infrastructure? This is often a dealbreaker that eliminates most hosted API options.

Fine-tuning requirements: Does the feature need a fine-tuned model, or will prompt engineering with a general-purpose model suffice? Fine-tuning adds significant complexity and cost.

Offline capability: Does the feature need to work without an internet connection?

Cost modeling

Include a cost projection table in your PRD:

Scenario	Queries per day	Avg tokens per query	Monthly API cost
Launch (Month 1)	500	2,000	$450
Growth (Month 6)	5,000	2,200	$4,950
Scale (Month 12)	25,000	2,500	$28,125

These numbers force a real conversation about unit economics. If your AI feature costs $1.12 per user per month and your average revenue per user is $15, the margins work. If the feature costs $8 per user per month, you need a different architecture or a different pricing model.

Data Requirements and Training Considerations

AI features are only as good as the data behind them. Your PRD needs to specify what data the feature needs, where it comes from, and how it stays fresh.

Prompt context data

Most LLM features use retrieval-augmented generation (RAG) or structured prompt context rather than fine-tuning. Define:

Knowledge sources: What databases, documents, or APIs provide the context the model needs? Be specific about freshness requirements. "Product catalog data must be no more than 1 hour stale" is a real requirement.

Embedding strategy: If using RAG, specify the chunking strategy, embedding model, and vector database requirements.

Context window budget: How do you allocate the model's context window across system prompt, retrieved context, user input, and conversation history?

Training and fine-tuning data (if applicable)

Dataset size: How many examples do you need? Where will they come from?

Labeling requirements: Who labels the data? What are the labeling guidelines? How do you handle disagreements between labelers?

Data refresh cadence: How often does the model need retraining? What triggers a retrain?

Data privacy and compliance

PII handling: What user data enters the model pipeline? How is it anonymized or encrypted?

Consent requirements: Do your terms of service cover sending user data to a third-party model provider? If not, what needs to change?

Retention policies: How long are prompts and completions stored? By you? By the model provider?

Audit trail: Can you reconstruct what the model saw and produced for any given user interaction?

Feedback Loops and Continuous Improvement

AI features are not "build and forget." Your PRD needs to define how the feature gets better over time.

Implicit feedback signals

Acceptance rate: What percentage of AI outputs does the user accept without modification?

Edit distance: When users modify the AI output, how much do they change?

Regeneration rate: How often do users ask for a new output?

Abandonment rate: How often do users start an AI interaction and leave without using the result?

Explicit feedback mechanisms

Thumbs up/down: Simple binary feedback on output quality.

Correction interface: Let users fix specific errors, which generates labeled training data.

Reporting flow: Users can flag harmful, inaccurate, or inappropriate outputs for review.

Improvement workflow

Define who owns the feedback loop and what the review cadence looks like:

Weekly: Review flagged outputs and identify patterns.

Monthly: Analyze aggregate quality metrics, compare against thresholds, and decide whether prompt adjustments or model changes are needed.

Quarterly: Evaluate whether the eval criteria themselves need updating based on what you have learned about real-world usage.

The PRD Template Section Checklist

When writing your next AI feature PRD, make sure you cover each of these sections. Not every section applies to every feature, but you should consciously decide to skip a section rather than forgetting it exists.

Standard PRD sections (still needed)

Problem statement with evidence

User stories and personas

Success metrics tied to north star metric

Scope (what is in, what is out)

UX requirements and wireframes

Launch plan and rollout strategy

AI-specific sections to add

Eval criteria with launch, target, and aspirational thresholds

Hallucination risk assessment and guardrail specifications

Model capability requirements (not model names)

Cost modeling at launch, growth, and scale

Data requirements (context, training, privacy)

Fallback behavior for every failure mode

Feedback loop design and improvement cadence

Monitoring and alerting plan

A/B testing approach for model and prompt changes

Ethical review checklist

Common Mistakes to Avoid

After reviewing dozens of AI feature PRDs across different companies, these are the mistakes that show up most frequently:

Specifying a model instead of capabilities. "We will use GPT-4" is not a requirement. It is a premature implementation decision. Define what you need the model to do, and let engineering evaluate options against those requirements.

Ignoring cost at scale. A prototype that costs $0.03 per query seems cheap until you multiply it by 50,000 daily active users. Always model costs at your 12-month usage projection.

Treating accuracy as a single number. Aggregate accuracy hides critical failures. A model with 95% average accuracy might have 60% accuracy on your most important category. Break eval criteria down by segment, use case, and risk tier.

No fallback behavior. If the model goes down or produces garbage, what does the user see? If your PRD does not answer this question, your users will find out the answer the hard way during an outage.

Skipping the ethics review. AI features can discriminate, manipulate, or mislead at scale. Define your ethical boundaries in the PRD, not after launch when the press coverage forces you to.

Putting It into Practice

The best way to adopt this framework is incrementally. You do not need to rewrite every PRD template overnight. Start with your next AI feature and add the eval criteria and hallucination tolerance sections. Those two alone will sharpen your engineering conversations and launch decisions.

As your team ships more AI features, build a shared eval library that standardizes how you measure quality across the product. Over time, this library becomes one of your most valuable assets because it encodes what "good" means for your specific users and use cases, something no off-the-shelf framework can provide.

The PRD is not just a document. It is a forcing function that makes your team think through the hard questions before code gets written. For AI features, those hard questions are different from traditional software, and your PRD needs to reflect that.