Platform Engineering Overview
What is Platform Engineering?
Platform Engineering is the discipline of designing and building toolchains and workflows that enable self-service capabilities for software engineering organizations in the cloud-native era. A Platform Engineering team builds and maintains the Internal Developer Platform (IDP) — a layer of tools, services, and processes that application teams consume to build, deploy, and operate their software.
The core philosophy is building "paved roads": opinionated, well-maintained paths that developers can follow to get their work done quickly and safely. Paved roads do not eliminate escape hatches — teams can deviate when genuinely necessary — but they reduce cognitive load for the common case.
Internal Developer Platform (IDP)
An IDP typically encompasses:
- Self-service infrastructure provisioning — developers request environments, databases, queues without filing tickets
- Golden path templates — opinionated project scaffolding that encodes security, observability, and compliance by default
- Developer portal — a unified UI (e.g., Backstage) for discovering services, docs, and runbooks
- Integrated CI/CD pipelines — reusable pipeline templates that handle build, test, security scan, and deploy
- Secrets and config management — centralized, auditable secret delivery without manual distribution
- Observability defaults — logs, metrics, and traces wired automatically for every new service
Platform Engineering vs DevOps vs SRE
These roles are complementary, not competing. Understanding the distinction avoids organizational confusion:
| Dimension | DevOps | SRE | Platform Engineering |
|---|---|---|---|
| Primary focus | Culture & collaboration between dev and ops | Reliability, SLOs, incident response | Developer productivity via IDP |
| Customer | The organization as a whole | End users (reliability) | Internal developers |
| Output | Practices and culture | Runbooks, SLOs, on-call | Tools, APIs, self-service workflows |
| Success metric | Deployment frequency, lead time | Error budget, MTTR | Developer NPS, time-to-first-deploy |
In mature organizations, all three exist: SRE defines reliability standards, Platform Engineering implements the tooling that makes those standards easy to meet, and DevOps culture ensures teams actually collaborate around them.
Team Topologies
The Team Topologies framework by Skelton & Pais provides the language for structuring platform organizations. Four fundamental team types:
Stream-Aligned Teams
Aligned to a flow of work from a business domain (e.g., "Checkout", "Payments"). They own their service end-to-end. They are the primary consumers of the platform — everything the platform team builds must reduce cognitive load for stream-aligned teams.
Platform Teams
Provide a compelling internal product that stream-aligned teams can use self-service. They absorb accidental complexity (Kubernetes, Vault, observability stack) and expose simple, reliable APIs. They must treat internal developers as customers.
Enabling Teams
Help stream-aligned teams acquire missing capabilities (e.g., a security enabling team that helps teams adopt SAST tooling). Enabling teams work in a time-limited, collaborative mode — they upskill and then step back.
Complicated-Subsystem Teams
Own components requiring deep specialist knowledge (e.g., a video encoding pipeline, a trading risk engine). They expose their subsystem as a service to stream-aligned teams, reducing the cognitive load of maintaining specialized expertise broadly.
Interaction Modes
- Collaboration — two teams work closely for a defined period (high bandwidth, high cost, not sustainable long-term)
- X-as-a-Service — one team consumes another's output with minimal interaction (low bandwidth, scalable)
- Facilitating — an enabling team helps another team learn and grow, then steps back
DORA Metrics and Platform Engineering
The DORA (DevOps Research and Assessment) four key metrics measure software delivery performance. Platform Engineering directly influences all four:
Deployment Frequency
How often does your organization deploy to production? — Elite performers deploy multiple times per day. Platform Engineering improves this by providing standardized CI/CD pipelines that remove manual gates and reduce friction. When deploying is easy, teams deploy more often.
Lead Time for Changes
Time from code commit to running in production. — Golden path templates with pre-wired pipelines eliminate hours of setup per service. Automated security scanning integrated into the pipeline prevents late-stage rework.
Mean Time to Restore (MTTR)
How quickly can you recover from a failure? — Platform-provided observability (centralized logs, distributed traces, dashboards) means engineers spend minutes diagnosing rather than hours instrumenting. Runbook automation and self-healing infrastructure reduce MTTR further.
Change Failure Rate (CFR)
What percentage of deployments cause a failure? — Platform-enforced testing gates, progressive delivery (canary/blue-green), and automated rollback reduce the blast radius and frequency of bad deployments.
The Golden Path
A Golden Path is an opinionated, supported software delivery path that balances speed and correctness. It is not mandatory — teams can diverge — but divergence means accepting higher cognitive load and losing platform support.
Anatomy of a Golden Path
- Project scaffolding — a Backstage Software Template that generates a repository with Dockerfile, CI pipeline, Helm chart, and catalog-info.yaml pre-configured
- Build defaults — pinned base images, mandatory SBOM generation, SAST/SCA scanning in CI
- Deploy defaults — GitOps via ArgoCD, rolling update strategy, resource requests/limits, PodDisruptionBudget
- Observability defaults — Prometheus scrape annotations, structured JSON logging, OpenTelemetry SDK wired
- Security defaults — non-root container, read-only filesystem, NetworkPolicy baseline, Vault sidecar for secrets
- Escape hatches — any default can be overridden with a documented, reviewable reason
Backstage — Developer Portal
Backstage is an open-source developer portal framework from Spotify, now a CNCF incubating project. It provides a unified frontend for the IDP.
Core Architecture
Software Catalog
A central registry of all software assets (services, libraries, websites, APIs, ML models). Each entity is described by a catalog-info.yaml file in its repository. Teams discover ownership, dependencies, and documentation here.
TechDocs
Documentation-as-Code — Backstage renders MkDocs-based documentation from the same repository as the service. Keeps docs co-located with code and eliminates stale wiki pages.
Scaffolder (Software Templates)
The golden path engine. Templates define a sequence of steps (fetch template, run scripts, create repository, register entity) that produce a fully configured new service in minutes.
Plugins
Backstage's extensibility model. Frontend and backend plugins surface data from external systems (ArgoCD, PagerDuty, Vault, GitHub Actions, SonarQube) directly in the portal — developers never need to switch contexts.
Entity Descriptors — catalog-info.yaml
Every entity in the Backstage catalog is described by a YAML file committed to its source repository. The Backstage catalog continuously reconciles from these files.
Microservice entity:
apiVersion: backstage.io/v1alpha1
kind: Component
metadata:
name: payment-service
description: Handles payment processing and refunds
annotations:
github.com/project-slug: acme/payment-service
backstage.io/techdocs-ref: dir:.
prometheus.io/scrape: "true"
argocd/app-name: payment-service-prod
tags:
- go
- payments
- pci-in-scope
links:
- url: https://grafana.internal/d/payment-service
title: Grafana Dashboard
icon: dashboard
- url: https://runbooks.internal/payment-service
title: Runbook
icon: book
spec:
type: service
lifecycle: production
owner: group:payments-team
system: payment-platform
dependsOn:
- component:postgres-payments
- component:kafka-cluster
providesApis:
- payment-api-v2
Library entity:
apiVersion: backstage.io/v1alpha1
kind: Component
metadata:
name: acme-observability-sdk
description: Shared Go library for OpenTelemetry instrumentation
annotations:
github.com/project-slug: acme/observability-sdk
tags:
- go
- library
- observability
spec:
type: library
lifecycle: production
owner: group:platform-team
Website entity:
apiVersion: backstage.io/v1alpha1
kind: Component
metadata:
name: customer-portal
description: Customer-facing web application
annotations:
github.com/project-slug: acme/customer-portal
tags:
- react
- frontend
spec:
type: website
lifecycle: production
owner: group:frontend-team
system: customer-experience
Backstage Software Template
A template.yaml defines the golden path for creating a new Go microservice:
apiVersion: scaffolder.backstage.io/v1beta3
kind: Template
metadata:
name: go-microservice
title: Go Microservice
description: Creates a production-ready Go microservice with CI/CD, Helm chart, and observability
tags:
- go
- microservice
- recommended
spec:
owner: group:platform-team
type: service
parameters:
- title: Service Information
required: [name, description, owner]
properties:
name:
title: Service Name
type: string
pattern: '^[a-z][a-z0-9-]*$'
description: Lowercase, hyphen-separated (e.g. payment-service)
description:
title: Description
type: string
owner:
title: Owning Team
type: string
ui:field: OwnerPicker
ui:options:
allowedKinds: [Group]
- title: Infrastructure
properties:
database:
title: Provision PostgreSQL database?
type: boolean
default: false
queue:
title: Provision Kafka topic?
type: boolean
default: false
steps:
- id: fetch-template
name: Fetch Template
action: fetch:template
input:
url: ./skeleton
values:
name: ${{ parameters.name }}
description: ${{ parameters.description }}
owner: ${{ parameters.owner }}
- id: create-repo
name: Create GitHub Repository
action: publish:github
input:
repoUrl: github.com?owner=acme&repo=${{ parameters.name }}
defaultBranch: main
repoVisibility: private
- id: register
name: Register in Catalog
action: catalog:register
input:
repoContentsUrl: ${{ steps['create-repo'].output.repoContentsUrl }}
catalogInfoPath: /catalog-info.yaml
output:
links:
- title: Repository
url: ${{ steps['create-repo'].output.remoteUrl }}
- title: Open in Catalog
icon: catalog
entityRef: ${{ steps['register'].output.entityRef }}
Platform as Product
The most common failure mode for platform teams is building what they think developers need rather than what developers actually need. Treating the platform as a product means applying product management discipline to internal tooling.
Product Thinking for Platforms
- Identify internal customers — stream-aligned developers are the users. Segment by team size, tech stack, maturity
- Conduct user research — developer interviews, friction logs, shadowing on-call rotations
- Define a product roadmap — prioritize by impact on DORA metrics and reduction in support tickets
- Measure adoption — track what percentage of teams use the golden path vs. rolling their own
- Collect NPS — quarterly developer satisfaction surveys surface pain points before they become exodus risk
- Deprecation as a product decision — old platform versions need sunset plans, migration guides, and communication campaigns
Self-Service Capabilities
Infrastructure Provisioning — Terraform Modules
Platform teams build opinionated Terraform modules that encode best practices. Teams consume them without needing Terraform expertise:
# teams consume platform modules — they don't write Terraform from scratch
module "service_database" {
source = "git::https://github.com/acme/terraform-modules.git//modules/rds-postgres?ref=v2.3.0"
service_name = "payment-service"
environment = "production"
instance_class = "db.t3.medium"
allocated_storage = 100
multi_az = true
deletion_protection = true
backup_retention = 7
# Automatic: security group rules, parameter group, monitoring, tagging
}
CI/CD Pipeline Templates
Reusable GitHub Actions workflows that teams reference rather than copy-paste:
# .github/workflows/deploy.yaml in a service repo
name: Deploy
on:
push:
branches: [main]
jobs:
deploy:
uses: acme/platform-workflows/.github/workflows/go-service-deploy.yaml@v1
with:
service-name: payment-service
environment: production
helm-chart-path: ./charts/payment-service
secrets: inherit
# Platform workflow handles: build, SAST, container scan, SBOM, push to ECR,
# ArgoCD sync, smoke test, rollback on failure
Observability Setup
Observability is automatic for golden path services. A service annotated with the platform label gets:
- Prometheus scraping configured via ServiceMonitor CRD
- A pre-built Grafana dashboard (request rate, error rate, latency — RED method)
- Log aggregation to the central Loki stack via the node log agent
- Distributed tracing via OpenTelemetry collector sidecar
- PagerDuty alert routing based on the catalog
ownerfield
Platform Engineering Metrics
Beyond DORA, platform teams should track metrics specific to platform health and adoption:
| Metric | Description | Target |
|---|---|---|
| Golden path adoption rate | % of services using the standard template | >80% |
| Time-to-first-deploy | Time from "new service created" to first production deploy | <1 day |
| Developer NPS | Net Promoter Score from quarterly survey | >40 |
| Platform ticket volume | Support tickets routed to platform team (lower = more self-service) | Declining trend |
| Cognitive load index | Survey: how many tools/systems must a developer understand to deploy? | Declining trend |
| Infrastructure provisioning time | Time from request to usable resource | <15 minutes |
| Pipeline success rate | % of CI runs that succeed (infra flakiness excluded) | >95% |
Implementation Roadmap
Building a platform is a multi-year journey. The following phased approach avoids the common pitfall of over-engineering before validating need:
Phase 1: Discover Pain Points (Month 1-2)
- Conduct developer interviews across all stream-aligned teams
- Audit current deployment times, on-call burden, and ticket sources
- Map the current state of infrastructure provisioning (how long does a new env take?)
- Identify the top 3 pain points by frequency and severity
Phase 2: Foundational Platform (Month 3-6)
- Establish a Kubernetes platform (EKS/GKE/AKS) with GitOps (ArgoCD)
- Build initial Terraform modules for the 3 most-used infrastructure components
- Create a basic CI/CD pipeline template for the dominant language/framework
- Set up centralized logging and metrics (Loki, Prometheus, Grafana)
- Deploy Backstage with catalog populated from existing repositories
Phase 3: Golden Paths (Month 7-12)
- Build Backstage Software Templates for common service archetypes
- Integrate secrets management (Vault) with the platform
- Implement service mesh (Istio) for mTLS and traffic management
- Build the observability defaults into the golden path template
- Establish developer NPS baseline and begin quarterly surveys
Phase 4: Self-Service at Scale (Month 13+)
- Full self-service environment provisioning (no tickets, no wait)
- Policy-as-Code enforcement (OPA/Gatekeeper) across all clusters
- Cost allocation and showback dashboards per team
- Chaos engineering tooling integrated with platform
- Platform roadmap driven by NPS data and DORA trend analysis