Why PMs Need to Run Evals Themselves
If you are a product manager working on AI features and you have never personally evaluated your model's outputs, you are making product decisions blind. You are relying entirely on aggregate accuracy numbers from your ML team, numbers that hide the specific failures your users will encounter.
This is not a criticism of ML engineers. They evaluate from a technical perspective: precision, recall, F1 scores, perplexity. You need to evaluate from a product perspective: does this output actually help the user accomplish their goal? Is it trustworthy? Would I be embarrassed if a customer screenshot this and posted it on social media?
The good news is that you do not need to write Python or understand PyTorch to run meaningful evals. This guide covers practical techniques that any PM can use with tools you already have.
The Spreadsheet Eval Method
The simplest and most effective eval method for PMs is a structured spreadsheet. It sounds low-tech because it is, and that is its strength. No setup, no dependencies, no engineering support needed.
Setting up your eval spreadsheet
Create a spreadsheet with these columns:
Building your test set
Your test set is the most important part of the eval. Start with 50-100 test cases from these sources:
Real user inputs: Pull actual queries from your product's logs. These represent how users actually use the feature, not how you imagine they will use it.
Edge cases from support tickets: When users report AI failures, add those inputs to your test set. Over time, your test set accumulates the hardest cases.
Adversarial inputs: Deliberately try to break the feature. What happens with empty inputs, extremely long inputs, inputs in unsupported languages, or inputs that contain contradictions?
Segment-specific inputs: If your product serves multiple personas, make sure your test set covers each one.
Rating scales that work
Avoid 1-10 scales. People use them inconsistently. Use one of these instead:
Binary (Pass/Fail): Best for tasks with clear success criteria.
Three-point scale (Good/Acceptable/Bad): Best for generative tasks. Good means ship without changes. Acceptable means needs minor edits. Bad means wrong, misleading, or unhelpful.
Task completion scale: Did the user accomplish their goal? Complete success, partial success, or failure.
Running Evals in the AI Playground
Most model providers offer a playground or sandbox where you can test prompts interactively. This is your primary tool for understanding model behavior.
Structured playground testing
Temperature testing
Temperature controls randomness. Run your test set at three temperature settings: 0, 0.3, and 0.7. Factual tasks perform better at temperature 0. Creative generation benefits from higher temperatures. Conversational features typically work best around 0.3-0.5.
Prompt versioning
Keep a log of every prompt version you test with its eval results. Track version number, change description, eval date, pass rate, failure breakdown, and decision (ship, iterate, or revert).
The Failure Taxonomy
When an AI output is bad, "it is wrong" is not useful feedback. You need a taxonomy of failure modes.
Factual errors
Instruction violations
Helpfulness failures
Safety and trust failures
Tracking failure patterns
After categorizing 50-100 failures, patterns emerge. Maybe 40% of your failures are missed intent because your prompt does not handle ambiguous queries well. These patterns tell you exactly where to focus your prompt engineering efforts.
Comparing Model Versions
When your ML team proposes switching models or your provider releases a new version, you need to evaluate the impact on your product.
The side-by-side comparison method
Run your full test set against both models. Create a comparison spreadsheet with old output, new output, and a winner column. This gives you improvement rate, regression rate, and neutral rate.
A model upgrade is worth shipping if improvements significantly outweigh regressions and the regressions are not concentrated in high-risk areas.
Blind evaluation
When possible, evaluate outputs without knowing which model produced them. This eliminates anchoring bias.
Building an Eval Cadence
Recommended cadence
Weekly spot checks: Spend 30 minutes reviewing 10-15 real user interactions from the past week.
Monthly full evals: Run your complete test set against the current production model. Compare results month over month.
Before every prompt or model change: Run the full test set before and after. No exceptions.
Quarterly test set review: Add new test cases, remove outdated ones, recalibrate rating criteria.
Sharing eval results
Tools for Non-Technical Eval
Spreadsheets (Google Sheets or Excel): Still the most flexible eval tool. Use conditional formatting, pivot tables, and charts.
Model provider playgrounds: OpenAI Playground, Anthropic Console, Google AI Studio. Test prompts interactively with full parameter control.
Prompt management tools: Tools like PromptLayer, Humanloop, or Langfuse offer version control and evaluation features without requiring code.
Screen recording: Record yourself using the AI feature for 30 minutes and narrate your reactions. This qualitative eval reveals usability issues that metrics miss.
When to Escalate to Engineering
Spreadsheet evals have limits. Escalate when:
At that point, work with ML engineers to automate the pipeline, but keep yourself in the loop as the person who defines what "good" means.
Getting Started This Week
Day 1: Pull 50 real user inputs from your product's logs.
Day 2: Set up your eval spreadsheet. Run all 50 inputs and record outputs.
Day 3: Rate every output using the three-point scale. Categorize every Bad output.
Day 4: Analyze results. What is your pass rate? What are the top three failure categories?
Day 5: Share the summary with your ML engineer and designer. Agree on top improvement priorities.
You will learn more about your AI feature's real-world quality in this one week than you would from months of looking at aggregate accuracy dashboards. And that learning is what makes you an effective AI product manager.