How to Run AI Product Evals

Why PMs Need to Run Evals Themselves

If you are a product manager working on AI features and you have never personally evaluated your model's outputs, you are making product decisions blind. You are relying entirely on aggregate accuracy numbers from your ML team, numbers that hide the specific failures your users will encounter.

This is not a criticism of ML engineers. They evaluate from a technical perspective: precision, recall, F1 scores, perplexity. You need to evaluate from a product perspective: does this output actually help the user accomplish their goal? Is it trustworthy? Would I be embarrassed if a customer screenshot this and posted it on social media?

The good news is that you do not need to write Python or understand PyTorch to run meaningful evals. This guide covers practical techniques that any PM can use with tools you already have.

The Spreadsheet Eval Method

The simplest and most effective eval method for PMs is a structured spreadsheet. It sounds low-tech because it is, and that is its strength. No setup, no dependencies, no engineering support needed.

Setting up your eval spreadsheet

Create a spreadsheet with these columns:

Test input: The prompt or user query you are testing.

Expected behavior: What a good output looks like in plain language.

Actual output: Paste the model's real response.

Rating: Score on a scale that matches your use case.

Failure category: If the output is bad, why? Categorize it.

Notes: Any context that might explain the result.

Building your test set

Your test set is the most important part of the eval. Start with 50-100 test cases from these sources:

Real user inputs: Pull actual queries from your product's logs. These represent how users actually use the feature, not how you imagine they will use it.

Edge cases from support tickets: When users report AI failures, add those inputs to your test set. Over time, your test set accumulates the hardest cases.

Adversarial inputs: Deliberately try to break the feature. What happens with empty inputs, extremely long inputs, inputs in unsupported languages, or inputs that contain contradictions?

Segment-specific inputs: If your product serves multiple personas, make sure your test set covers each one.

Rating scales that work

Avoid 1-10 scales. People use them inconsistently. Use one of these instead:

Binary (Pass/Fail): Best for tasks with clear success criteria.

Three-point scale (Good/Acceptable/Bad): Best for generative tasks. Good means ship without changes. Acceptable means needs minor edits. Bad means wrong, misleading, or unhelpful.

Task completion scale: Did the user accomplish their goal? Complete success, partial success, or failure.

Running Evals in the AI Playground

Most model providers offer a playground or sandbox where you can test prompts interactively. This is your primary tool for understanding model behavior.

Structured playground testing

Baseline run: Test your current prompt against your full test set. Record every output.

Change one variable: Modify one thing and run the full test set again.

Compare: Look at what improved, what degraded, and what stayed the same.

Temperature testing

Temperature controls randomness. Run your test set at three temperature settings: 0, 0.3, and 0.7. Factual tasks perform better at temperature 0. Creative generation benefits from higher temperatures. Conversational features typically work best around 0.3-0.5.

Prompt versioning

Keep a log of every prompt version you test with its eval results. Track version number, change description, eval date, pass rate, failure breakdown, and decision (ship, iterate, or revert).

The Failure Taxonomy

When an AI output is bad, "it is wrong" is not useful feedback. You need a taxonomy of failure modes.

Factual errors

Hallucinated facts: The model invents information that sounds plausible but is fabricated.

Outdated information: The model provides information that was once correct but is no longer.

Confused entities: The model mixes up names, dates, or attributes between similar entities.

Instruction violations

Format violations: Asked for a bullet list, got a paragraph.

Scope violations: Instructed to only answer billing questions, but responded to a product question.

Tone violations: Instructed to be professional, but used casual language.

Helpfulness failures

Too vague: Generic advice with no specifics.

Too verbose: A 500-word answer to a question that needed two sentences.

Missed intent: Answered a different question than what the user asked.

Safety and trust failures

Harmful recommendations: Suggesting actions that could cause harm.

Privacy violations: Including personal information in outputs when it should not.

Bias: Outputs that treat different demographic groups differently in quality or tone.

Tracking failure patterns

After categorizing 50-100 failures, patterns emerge. Maybe 40% of your failures are missed intent because your prompt does not handle ambiguous queries well. These patterns tell you exactly where to focus your prompt engineering efforts.

Comparing Model Versions

When your ML team proposes switching models or your provider releases a new version, you need to evaluate the impact on your product.

The side-by-side comparison method

Run your full test set against both models. Create a comparison spreadsheet with old output, new output, and a winner column. This gives you improvement rate, regression rate, and neutral rate.

A model upgrade is worth shipping if improvements significantly outweigh regressions and the regressions are not concentrated in high-risk areas.

Blind evaluation

When possible, evaluate outputs without knowing which model produced them. This eliminates anchoring bias.

Building an Eval Cadence

Recommended cadence

Weekly spot checks: Spend 30 minutes reviewing 10-15 real user interactions from the past week.

Monthly full evals: Run your complete test set against the current production model. Compare results month over month.

Before every prompt or model change: Run the full test set before and after. No exceptions.

Quarterly test set review: Add new test cases, remove outdated ones, recalibrate rating criteria.

For ML engineers: Specific failure examples with categories. They need concrete cases to debug.

For designers: Patterns in helpfulness failures that suggest UX improvements.

For leadership: Trend lines showing quality improvement, cost per interaction, and business impact. Use metrics they care about.

Tools for Non-Technical Eval

Spreadsheets (Google Sheets or Excel): Still the most flexible eval tool. Use conditional formatting, pivot tables, and charts.

Model provider playgrounds: OpenAI Playground, Anthropic Console, Google AI Studio. Test prompts interactively with full parameter control.

Prompt management tools: Tools like PromptLayer, Humanloop, or Langfuse offer version control and evaluation features without requiring code.

Screen recording: Record yourself using the AI feature for 30 minutes and narrate your reactions. This qualitative eval reveals usability issues that metrics miss.

When to Escalate to Engineering

Spreadsheet evals have limits. Escalate when:

Scale: Your test set exceeds 500 cases and manual evaluation takes more than a full day.

Automated checks: Certain failures can be detected programmatically.

Statistical rigor: You need confidence intervals or significance testing.

Regression testing in CI/CD: Prompt changes are frequent enough that manual testing becomes a bottleneck.

At that point, work with ML engineers to automate the pipeline, but keep yourself in the loop as the person who defines what "good" means.

Getting Started This Week

Day 1: Pull 50 real user inputs from your product's logs.

Day 2: Set up your eval spreadsheet. Run all 50 inputs and record outputs.

Day 3: Rate every output using the three-point scale. Categorize every Bad output.

Day 4: Analyze results. What is your pass rate? What are the top three failure categories?

Day 5: Share the summary with your ML engineer and designer. Agree on top improvement priorities.

You will learn more about your AI feature's real-world quality in this one week than you would from months of looking at aggregate accuracy dashboards. And that learning is what makes you an effective AI product manager.

How to Run AI Product Evals Without Being an Engineer

Why PMs Need to Run Evals Themselves

The Spreadsheet Eval Method

Setting up your eval spreadsheet

Building your test set

Rating scales that work

Running Evals in the AI Playground

Structured playground testing

Temperature testing

Prompt versioning

The Failure Taxonomy

Factual errors

Instruction violations

Helpfulness failures

Safety and trust failures

Tracking failure patterns

Comparing Model Versions

The side-by-side comparison method

Blind evaluation

Building an Eval Cadence

Recommended cadence

Tools for Non-Technical Eval

When to Escalate to Engineering

Getting Started This Week

Enjoyed This Article?

Keep Reading

How to Run AI Product Evals Without Being an Engineer

Why PMs Need to Run Evals Themselves

The Spreadsheet Eval Method

Setting up your eval spreadsheet

Building your test set

Rating scales that work

Running Evals in the AI Playground

Structured playground testing

Temperature testing

Prompt versioning

The Failure Taxonomy

Factual errors

Instruction violations

Helpfulness failures

Safety and trust failures

Tracking failure patterns

Comparing Model Versions

The side-by-side comparison method

Blind evaluation

Building an Eval Cadence

Recommended cadence

Sharing eval results

Tools for Non-Technical Eval

When to Escalate to Engineering

Getting Started This Week

Enjoyed This Article?

Keep Reading