Definition
A Service Level Agreement (SLA) is a formal contract between a service provider and a customer that defines the expected level of service -- typically covering uptime (availability), performance (latency, throughput), and support responsiveness. SLAs include measurable targets and specify what happens when those targets are missed, usually in the form of financial credits or penalties.
The SLA sits at the top of a three-level hierarchy. Service Level Indicators (SLIs) are the raw metrics: request latency, error rate, uptime percentage. Service Level Objectives (SLOs) are the internal targets engineering teams aim for. SLAs are the external, contractual commitments -- always set at or below the SLO to provide a margin of safety. Google popularized this hierarchy in their Site Reliability Engineering (SRE) book.
Common SLA tiers in SaaS: 99.9% uptime (the "three nines") allows roughly 8.7 hours of downtime per year. 99.99% ("four nines") allows about 52 minutes per year. 99.999% ("five nines") allows about 5 minutes per year. Each additional nine requires exponentially more engineering investment in redundancy, failover, and monitoring.
Why It Matters for Product Managers
SLAs constrain what you can ship and how you ship it. If your product guarantees 99.99% uptime, you cannot deploy changes during business hours without zero-downtime deployment practices. Maintenance windows shrink. The bar for testing before production rises. Every architecture decision must consider failure modes.
For PMs at B2B SaaS companies, SLAs are also a competitive differentiator and a pricing lever. Enterprise customers evaluate SLAs during procurement. A startup offering 99.9% uptime will lose deals to a competitor offering 99.99% -- assuming both can actually deliver. Promising an SLA you cannot meet is worse than not offering one, because breaches erode trust and cost real money in service credits.
PMs also need to understand SLAs when their product depends on third-party services. If your payment processor has a 99.9% SLA and your notification provider has a 99.9% SLA, your combined checkout-plus-notification flow has a theoretical maximum of roughly 99.8% availability. Each dependency in your stack compounds the risk.
How It Works in Practice
Define SLIs -- Work with engineering to identify the metrics that matter most to customers. For a web application: availability (percentage of successful responses), latency (95th percentile response time), and error rate. For an API: all of the above plus throughput and rate limit fairness.
Set SLOs -- Establish internal targets that are stricter than the customer-facing SLA. If the SLA promises 99.9% uptime, the SLO should target 99.95%. This gives the team an error budget -- a known amount of acceptable downtime that can be "spent" on risky deployments or experiments.
Formalize the SLA -- Legal and product teams draft the customer-facing agreement specifying the commitment, measurement methodology, exclusions (e.g., scheduled maintenance, customer-caused issues), and remedies (service credits). Salesforce, AWS, and Azure all publish their SLAs publicly.
Monitor continuously -- Automated dashboards track SLIs against SLOs in real time. When an SLI approaches the SLO threshold, alerts fire. Teams like Google's SRE use error budgets: if the monthly error budget is 50% consumed by mid-month, the team freezes risky deployments.
Report and remediate -- Provide customers with regular uptime reports (monthly or quarterly). When the SLA is breached, issue service credits proactively rather than waiting for claims. This builds trust even when things go wrong.
Common Pitfalls
Promising more than you can deliver. An SLA of 99.99% uptime requires redundant infrastructure, automated failover, zero-downtime deployments, and 24/7 on-call coverage. If your engineering team does not have these capabilities, set a lower SLA and invest in the infrastructure to raise it over time.
Measuring the wrong SLIs. Server uptime is not the same as user-perceived availability. Your server might return 200 OK while serving a blank page due to a front-end bug. Measure what customers experience, not what your server reports.
Ignoring the cost of each nine. Going from 99.9% to 99.99% might double your infrastructure bill and require hiring an SRE team. PMs should model the ROI: does the additional reliability win enough enterprise deals to justify the cost?
SLAs without error budgets. An SLA without an internal error budget means the team either never takes risks (no deployments, no experiments) or constantly violates the SLA. Error budgets provide a structured way to balance reliability with velocity.
Related Concepts
DevOps -- the practices and culture that enable teams to meet SLA commitments through automation and monitoring
Continuous Delivery -- the deployment practice that supports SLA compliance through safer, incremental releases
Dependency -- each external dependency in your stack affects your ability to meet SLA targetsExplore More PM Terms
Browse our complete glossary of 100+ product management terms.