A Grain of Salt

SRE and Platform Engineering

· Teddy Aryono

In the first post of this series, we established what platform engineering is and how it differs from related disciplines. One of those disciplines - Site Reliability Engineering (SRE) - deserves deeper exploration because of how fundamentally intertwined it is with platform engineering success.

Understanding the relationship between SRE and platform engineering is crucial for both engineering leaders building these capabilities and engineers working in these roles. While distinct disciplines with different primary focuses, they share common goals and reinforce each other in powerful ways.

What Is Site Reliability Engineering?

Site Reliability Engineering originated at Google around 2003, born from a practical problem: the traditional operations model doesn’t scale. As systems grow in complexity and traffic, you can’t rely on manual intervention and tribal knowledge to keep things running. You need engineers who can both understand the systems deeply and write code to solve operational problems systematically.

The core principle: apply software engineering discipline to operations problems.

This isn’t just “ops people who can code” or “devs who got paged too much.” It’s a fundamental reconception of how to approach reliability. SRE teams own the reliability of services, but they do it by building systems, automation, and processes rather than through manual heroics.

The SRE Framework

SRE is built on several key concepts that form a comprehensive approach to reliability:

Service Level Indicators, Objectives, and Agreements (SLIs, SLOs, SLAs)

These form SRE’s reliability measurement framework:

Service Level Indicators (SLIs) are the raw measurements that matter to users - request latency, error rate, system throughput, availability. These are concrete, measurable signals of service health from the user’s perspective.

Service Level Objectives (SLOs) are the targets you set for those measurements. “99.9% of requests complete in under 200ms” or “99.95% availability measured over a rolling 30-day window.” SLOs define what “good enough” reliability looks like.

Service Level Agreements (SLAs) are the external-facing promises you make to customers, often with financial consequences. These are business decisions that should be informed by, but less stringent than, your SLOs.

The relationship is hierarchical: SLIs are measurements, SLOs are internal targets based on those measurements, SLAs are external commitments that need buffer room below your SLOs.

Error Budgets

This is where SRE gets strategically interesting. If your SLO is 99.9% availability, you have a 0.1% “budget” to spend on unreliability - planned or unplanned downtime, degraded performance, whatever causes SLO violations.

The error budget becomes the data-driven mediator between velocity and stability:

When you’re within budget, you have room to take risks. Deploy that new feature. Try the experimental optimization. Move fast.

When you’ve burned through your error budget, you slow down. Focus on reliability improvements. Pay down technical debt. Strengthen your foundations.

This eliminates the traditional tension where ops says “no more changes, too risky” and dev says “we need to ship faster.” The error budget is an objective measure everyone can see and discuss.

The error budget also aligns incentives. Development teams are incentivized to ship reliable code because burning the budget means slower releases. Operations teams are incentivized to enable deployments because being too conservative wastes error budget that could be spent on features.

Toil Reduction

SRE defines toil as operational work that is:

Classic examples: manually restarting services, responding to alerts that require the same action every time, scaling infrastructure up and down based on predictable patterns, rotating credentials manually.

SRE teams measure toil and actively work to eliminate it. Google’s benchmark is keeping toil below 50% of an SRE’s time - the other 50%+ should be spent on engineering work that reduces future toil.

This is crucial because toil grows linearly with service scale. If your service doubles in size and you handle scaling manually, you’ve doubled your toil. Automation scales differently - the investment is upfront, then it handles 10x or 100x scale with minimal additional effort.

Incident Management and Learning

SREs own the response to production incidents:

During incidents - Clear roles (incident commander, communications lead, subject matter experts), structured communication, focus on restoration over root cause analysis.

After incidents - Blameless post-mortems that focus on system and process failures, not individual mistakes. The goal is learning and improvement, not punishment.

Preventing recurrence - Action items from post-mortems that address root causes and systemic issues, not just symptoms.

The incident management process itself becomes a system that can be improved. Over time, organizations get better at responding to incidents, which reduces their impact and duration.

How SRE and Platform Engineering Intersect

Now we get to the heart of it: these disciplines are deeply complementary. Neither replaces the other, but they work together in powerful ways.

Standards Definition vs. Scale Enforcement

SRE defines what good looks like. Based on experience running services and analyzing incidents, SREs establish reliability standards: all services need health checks, circuit breakers for external dependencies, structured logging for debugging, proper monitoring and alerting.

Platform engineering enforces these standards at scale. Instead of every team implementing health checks differently (or forgetting them entirely), the platform makes health checks mandatory and provides the scaffolding to implement them correctly. The platform bakes SRE’s hard-won lessons into the deployment process.

This division of labor is efficient. SREs can focus on high-leverage work - understanding system behavior, defining appropriate SLOs, improving incident response - rather than reviewing every team’s monitoring configuration. The platform ensures consistent implementation across hundreds of services.

Error Budgets Inform Platform Investment

The platform itself needs SLOs and error budgets. If your deployment pipeline has an SLO of “95% of deployments succeed without manual intervention,” you can track whether you’re meeting that target.

When a platform capability burns error budget - say, your database provisioning service is unreliable - that’s a clear signal about where to invest. Error budgets make platform prioritization more objective.

This also creates accountability. Platform teams can measure whether they’re providing a reliable foundation. If application teams are burning their error budgets due to platform issues rather than their own code, that’s a problem the platform team owns.

Toil Reduction as Shared Mission

Much of platform engineering is automating toil at scale. SREs identify the toil through operational experience; platform engineers build the systems that eliminate it for everyone.

Examples of this partnership:

Manual scaling - SREs notice they’re constantly adjusting resource allocation based on traffic patterns. Platform engineers build autoscaling into the platform with sensible defaults. Teams get automatic scaling without needing to understand Kubernetes HPA configuration.

Certificate rotation - SREs handle certificate expiration incidents repeatedly. Platform engineers build automated certificate management into the platform. Certificates renew automatically; teams never think about it.

Log aggregation - SREs waste time SSHing into individual instances to debug issues. Platform engineers ensure every deployed service automatically sends logs to centralized aggregation. Debugging becomes searching logs in one place, not hunting across dozens of machines.

The pattern: SRE identifies repetitive operational work → Platform team automates it → Toil is eliminated across the entire organization, not just for one team.

Observability as the Critical Bridge

SREs need deep visibility into system behavior to:

Platforms need to provide that observability automatically. When a developer deploys a new service, the platform should:

This is both a platform feature and an SRE requirement. The platform makes observability a default rather than something each team builds from scratch. SREs can then focus on interpreting the data and refining the observability rather than fighting to collect it in the first place.

Reliability by Default

This is the ultimate goal of the SRE-platform partnership: making reliability the path of least resistance.

Without a platform, reliability requires expertise and effort. Developers need to understand monitoring systems, learn alerting best practices, implement circuit breakers, configure proper health checks. Many teams don’t have that expertise or don’t prioritize it until after an incident.

With a platform informed by SRE practices:

Developers who want to opt out or customize can do so, but the default is reliable. This dramatically reduces the cognitive load on developers while raising the baseline reliability across the organization.

Organizational Models

How do organizations structure SRE and platform engineering teams? There’s no single right answer, but some common patterns:

Separate Teams with Strong Collaboration

SRE teams focus on running production services and defining reliability standards. Platform teams focus on building internal tooling and infrastructure. They collaborate closely:

This works well at larger scale where both functions justify dedicated teams.

Platform Team with SRE Responsibilities

The platform team owns both building the platform and ensuring its reliability. They apply SRE practices to the platform itself and encode SRE principles into platform capabilities.

This works well for mid-sized organizations where a separate SRE team would be too small to be effective.

Embedded SRE Model with Platform Foundation

SREs are embedded with product teams (the Google model), but they rely heavily on a shared platform. The platform provides the foundation; embedded SREs customize and operate on top of it for their specific services.

This works when you need SRE expertise close to specific services but want to avoid duplicating infrastructure work.

Platform Team Enables Self-Service SRE

The platform provides self-service capabilities for teams to implement SRE practices themselves: easy SLO definition, automated error budget tracking, incident response tools, post-mortem templates.

This works when you have high-maturity engineering teams who can handle SRE practices with good tooling support.

The right model depends on organization size, engineering maturity, service complexity, and available talent. What matters more than the specific structure is that SRE principles inform platform design and platform capabilities enable SRE practices.

Practical Examples of the Partnership

Let’s look at concrete examples of how SRE and platform engineering work together:

Deployment Reliability

SRE insight: Most incidents are caused by changes. Deployments are high-risk events that need careful management.

Platform implementation:

Result: Deployments become safer by default. Teams get sophisticated deployment strategies without needing to build them.

Service Discovery and Load Balancing

SRE insight: Services need to be able to find each other reliably, and traffic should avoid unhealthy instances.

Platform implementation:

Result: Services are more resilient to partial failures. Teams don’t need to implement complex retry and circuit breaker logic.

Observability and Alerting

SRE insight: You can’t fix what you can’t see. Alerts should be actionable and signal real problems, not noise.

Platform implementation:

Result: Every service has baseline observability. Teams can focus on service-specific metrics rather than fighting to get basic visibility.

Capacity Management

SRE insight: Services need enough capacity to handle load, but overprovisioning is wasteful.

Platform implementation:

Result: Services scale appropriately without manual intervention. Cost is optimized without sacrificing reliability.

Key Takeaways

For Engineering Leaders:

For Engineers:

For Both:

Looking Ahead

SRE and platform engineering together represent a mature approach to building and operating software at scale. SRE provides the principles and practices for reliability; platform engineering provides the mechanisms to make those practices universal.

In future posts, we’ll explore the technical foundations that make platforms effective, including Infrastructure as Code patterns, internal developer portals, and the specific platform capabilities that matter most. We’ll also dive into the organizational and cultural aspects of building platform teams that truly serve their internal customers.


This is the second post in a series exploring platform engineering in depth. The first post covered the fundamentals of what platform engineering is and what platforms are.

#platform-engineering

Reply to this post by email ↪