SRE and Platform Engineering
In the first post of this series, we established what platform engineering is and how it differs from related disciplines. One of those disciplines - Site Reliability Engineering (SRE) - deserves deeper exploration because of how fundamentally intertwined it is with platform engineering success.
Understanding the relationship between SRE and platform engineering is crucial for both engineering leaders building these capabilities and engineers working in these roles. While distinct disciplines with different primary focuses, they share common goals and reinforce each other in powerful ways.
What Is Site Reliability Engineering?
Site Reliability Engineering originated at Google around 2003, born from a practical problem: the traditional operations model doesn’t scale. As systems grow in complexity and traffic, you can’t rely on manual intervention and tribal knowledge to keep things running. You need engineers who can both understand the systems deeply and write code to solve operational problems systematically.
The core principle: apply software engineering discipline to operations problems.
This isn’t just “ops people who can code” or “devs who got paged too much.” It’s a fundamental reconception of how to approach reliability. SRE teams own the reliability of services, but they do it by building systems, automation, and processes rather than through manual heroics.
The SRE Framework
SRE is built on several key concepts that form a comprehensive approach to reliability:
Service Level Indicators, Objectives, and Agreements (SLIs, SLOs, SLAs)
These form SRE’s reliability measurement framework:
Service Level Indicators (SLIs) are the raw measurements that matter to users - request latency, error rate, system throughput, availability. These are concrete, measurable signals of service health from the user’s perspective.
Service Level Objectives (SLOs) are the targets you set for those measurements. “99.9% of requests complete in under 200ms” or “99.95% availability measured over a rolling 30-day window.” SLOs define what “good enough” reliability looks like.
Service Level Agreements (SLAs) are the external-facing promises you make to customers, often with financial consequences. These are business decisions that should be informed by, but less stringent than, your SLOs.
The relationship is hierarchical: SLIs are measurements, SLOs are internal targets based on those measurements, SLAs are external commitments that need buffer room below your SLOs.
Error Budgets
This is where SRE gets strategically interesting. If your SLO is 99.9% availability, you have a 0.1% “budget” to spend on unreliability - planned or unplanned downtime, degraded performance, whatever causes SLO violations.
The error budget becomes the data-driven mediator between velocity and stability:
When you’re within budget, you have room to take risks. Deploy that new feature. Try the experimental optimization. Move fast.
When you’ve burned through your error budget, you slow down. Focus on reliability improvements. Pay down technical debt. Strengthen your foundations.
This eliminates the traditional tension where ops says “no more changes, too risky” and dev says “we need to ship faster.” The error budget is an objective measure everyone can see and discuss.
The error budget also aligns incentives. Development teams are incentivized to ship reliable code because burning the budget means slower releases. Operations teams are incentivized to enable deployments because being too conservative wastes error budget that could be spent on features.
Toil Reduction
SRE defines toil as operational work that is:
- Manual (requires human intervention)
- Repetitive (done over and over)
- Automatable (could be done by a machine)
- Tactical (interrupt-driven, reactive)
- Lacks enduring value (doesn’t permanently improve the system)
Classic examples: manually restarting services, responding to alerts that require the same action every time, scaling infrastructure up and down based on predictable patterns, rotating credentials manually.
SRE teams measure toil and actively work to eliminate it. Google’s benchmark is keeping toil below 50% of an SRE’s time - the other 50%+ should be spent on engineering work that reduces future toil.
This is crucial because toil grows linearly with service scale. If your service doubles in size and you handle scaling manually, you’ve doubled your toil. Automation scales differently - the investment is upfront, then it handles 10x or 100x scale with minimal additional effort.
Incident Management and Learning
SREs own the response to production incidents:
During incidents - Clear roles (incident commander, communications lead, subject matter experts), structured communication, focus on restoration over root cause analysis.
After incidents - Blameless post-mortems that focus on system and process failures, not individual mistakes. The goal is learning and improvement, not punishment.
Preventing recurrence - Action items from post-mortems that address root causes and systemic issues, not just symptoms.
The incident management process itself becomes a system that can be improved. Over time, organizations get better at responding to incidents, which reduces their impact and duration.
How SRE and Platform Engineering Intersect
Now we get to the heart of it: these disciplines are deeply complementary. Neither replaces the other, but they work together in powerful ways.
Standards Definition vs. Scale Enforcement
SRE defines what good looks like. Based on experience running services and analyzing incidents, SREs establish reliability standards: all services need health checks, circuit breakers for external dependencies, structured logging for debugging, proper monitoring and alerting.
Platform engineering enforces these standards at scale. Instead of every team implementing health checks differently (or forgetting them entirely), the platform makes health checks mandatory and provides the scaffolding to implement them correctly. The platform bakes SRE’s hard-won lessons into the deployment process.
This division of labor is efficient. SREs can focus on high-leverage work - understanding system behavior, defining appropriate SLOs, improving incident response - rather than reviewing every team’s monitoring configuration. The platform ensures consistent implementation across hundreds of services.
Error Budgets Inform Platform Investment
The platform itself needs SLOs and error budgets. If your deployment pipeline has an SLO of “95% of deployments succeed without manual intervention,” you can track whether you’re meeting that target.
When a platform capability burns error budget - say, your database provisioning service is unreliable - that’s a clear signal about where to invest. Error budgets make platform prioritization more objective.
This also creates accountability. Platform teams can measure whether they’re providing a reliable foundation. If application teams are burning their error budgets due to platform issues rather than their own code, that’s a problem the platform team owns.
Toil Reduction as Shared Mission
Much of platform engineering is automating toil at scale. SREs identify the toil through operational experience; platform engineers build the systems that eliminate it for everyone.
Examples of this partnership:
Manual scaling - SREs notice they’re constantly adjusting resource allocation based on traffic patterns. Platform engineers build autoscaling into the platform with sensible defaults. Teams get automatic scaling without needing to understand Kubernetes HPA configuration.
Certificate rotation - SREs handle certificate expiration incidents repeatedly. Platform engineers build automated certificate management into the platform. Certificates renew automatically; teams never think about it.
Log aggregation - SREs waste time SSHing into individual instances to debug issues. Platform engineers ensure every deployed service automatically sends logs to centralized aggregation. Debugging becomes searching logs in one place, not hunting across dozens of machines.
The pattern: SRE identifies repetitive operational work → Platform team automates it → Toil is eliminated across the entire organization, not just for one team.
Observability as the Critical Bridge
SREs need deep visibility into system behavior to:
- Understand if SLOs are being met
- Debug incidents effectively
- Identify performance bottlenecks
- Spot anomalies before they become incidents
Platforms need to provide that observability automatically. When a developer deploys a new service, the platform should:
- Instrument the service for metrics collection
- Set up log shipping to centralized storage
- Configure distributed tracing
- Create baseline dashboards
- Establish alerts for obvious problems
This is both a platform feature and an SRE requirement. The platform makes observability a default rather than something each team builds from scratch. SREs can then focus on interpreting the data and refining the observability rather than fighting to collect it in the first place.
Reliability by Default
This is the ultimate goal of the SRE-platform partnership: making reliability the path of least resistance.
Without a platform, reliability requires expertise and effort. Developers need to understand monitoring systems, learn alerting best practices, implement circuit breakers, configure proper health checks. Many teams don’t have that expertise or don’t prioritize it until after an incident.
With a platform informed by SRE practices:
- Services get monitoring and alerting automatically
- Reliability patterns (circuit breakers, retries with backoff, timeouts) are built into service frameworks
- Deployment strategies (canary deployments, gradual rollouts) are platform features, not custom implementations
- Load balancing, health checking, and graceful shutdown are handled by the platform
Developers who want to opt out or customize can do so, but the default is reliable. This dramatically reduces the cognitive load on developers while raising the baseline reliability across the organization.
Organizational Models
How do organizations structure SRE and platform engineering teams? There’s no single right answer, but some common patterns:
Separate Teams with Strong Collaboration
SRE teams focus on running production services and defining reliability standards. Platform teams focus on building internal tooling and infrastructure. They collaborate closely:
- SREs participate in platform design reviews
- Platform teams have SLOs and error budgets owned by SRE
- Regular knowledge sharing and joint planning
This works well at larger scale where both functions justify dedicated teams.
Platform Team with SRE Responsibilities
The platform team owns both building the platform and ensuring its reliability. They apply SRE practices to the platform itself and encode SRE principles into platform capabilities.
This works well for mid-sized organizations where a separate SRE team would be too small to be effective.
Embedded SRE Model with Platform Foundation
SREs are embedded with product teams (the Google model), but they rely heavily on a shared platform. The platform provides the foundation; embedded SREs customize and operate on top of it for their specific services.
This works when you need SRE expertise close to specific services but want to avoid duplicating infrastructure work.
Platform Team Enables Self-Service SRE
The platform provides self-service capabilities for teams to implement SRE practices themselves: easy SLO definition, automated error budget tracking, incident response tools, post-mortem templates.
This works when you have high-maturity engineering teams who can handle SRE practices with good tooling support.
The right model depends on organization size, engineering maturity, service complexity, and available talent. What matters more than the specific structure is that SRE principles inform platform design and platform capabilities enable SRE practices.
Practical Examples of the Partnership
Let’s look at concrete examples of how SRE and platform engineering work together:
Deployment Reliability
SRE insight: Most incidents are caused by changes. Deployments are high-risk events that need careful management.
Platform implementation:
- Progressive delivery built into the deployment process
- Automatic canary analysis comparing new version metrics to baseline
- Automatic rollback if error rates spike
- Deployment gates that check for passing tests, security scans, and config validation
Result: Deployments become safer by default. Teams get sophisticated deployment strategies without needing to build them.
Service Discovery and Load Balancing
SRE insight: Services need to be able to find each other reliably, and traffic should avoid unhealthy instances.
Platform implementation:
- Service mesh or built-in service discovery
- Automatic health checking with configurable probes
- Intelligent load balancing that respects instance health
- Circuit breaking when downstream services are struggling
Result: Services are more resilient to partial failures. Teams don’t need to implement complex retry and circuit breaker logic.
Observability and Alerting
SRE insight: You can’t fix what you can’t see. Alerts should be actionable and signal real problems, not noise.
Platform implementation:
- Automatic instrumentation of all services for key metrics (latency, error rate, traffic)
- Structured logging with automatic correlation IDs for tracing requests
- Default dashboards for every service showing SLI-relevant metrics
- Alert templates based on SLO violations rather than arbitrary thresholds
Result: Every service has baseline observability. Teams can focus on service-specific metrics rather than fighting to get basic visibility.
Capacity Management
SRE insight: Services need enough capacity to handle load, but overprovisioning is wasteful.
Platform implementation:
- Autoscaling based on actual load metrics
- Resource quotas to prevent runaway resource consumption
- Automatic right-sizing recommendations based on actual usage
- Load testing tools integrated into the deployment pipeline
Result: Services scale appropriately without manual intervention. Cost is optimized without sacrificing reliability.
Key Takeaways
For Engineering Leaders:
- SRE and platform engineering are complementary investments that reinforce each other
- Platform engineering scales SRE best practices across the organization without requiring every team to have SRE expertise
- Error budgets and SLOs are useful not just for services but for platform capabilities themselves
- The organizational structure matters less than ensuring SRE principles inform platform design
For Engineers:
- SRE skills (understanding failure modes, defining SLOs, analyzing incidents) are valuable even if you’re not on an SRE team
- Platform engineers benefit enormously from understanding SRE principles - you’re building the foundation for reliability
- The best platforms encode SRE learnings so reliability becomes automatic rather than requiring constant effort
- Toil reduction is one of the highest-leverage activities in infrastructure engineering
For Both:
- Reliability should be a default property of your infrastructure, not something each team builds from scratch
- Observability needs to be built into the platform from day one
- Error budgets provide an objective way to balance velocity and stability
- The goal is making the right thing the easy thing
Looking Ahead
SRE and platform engineering together represent a mature approach to building and operating software at scale. SRE provides the principles and practices for reliability; platform engineering provides the mechanisms to make those practices universal.
In future posts, we’ll explore the technical foundations that make platforms effective, including Infrastructure as Code patterns, internal developer portals, and the specific platform capabilities that matter most. We’ll also dive into the organizational and cultural aspects of building platform teams that truly serve their internal customers.
This is the second post in a series exploring platform engineering in depth. The first post covered the fundamentals of what platform engineering is and what platforms are.