Platform Engineering Overview

Platform Engineering builds Internal Developer Platforms (IDPs) that abstract infrastructure complexity and provide self-service capabilities to application developers — enabling them to move faster without needing deep ops expertise.

What is Platform Engineering?

Platform Engineering is the discipline of designing and building toolchains and workflows that enable self-service capabilities for software engineering organizations in the cloud-native era. A Platform Engineering team builds and maintains the Internal Developer Platform (IDP) — a layer of tools, services, and processes that application teams consume to build, deploy, and operate their software.

The core philosophy is building "paved roads": opinionated, well-maintained paths that developers can follow to get their work done quickly and safely. Paved roads do not eliminate escape hatches — teams can deviate when genuinely necessary — but they reduce cognitive load for the common case.

Internal Developer Platform (IDP)

An IDP typically encompasses:

Self-service infrastructure provisioning — developers request environments, databases, queues without filing tickets
Golden path templates — opinionated project scaffolding that encodes security, observability, and compliance by default
Developer portal — a unified UI (e.g., Backstage) for discovering services, docs, and runbooks
Integrated CI/CD pipelines — reusable pipeline templates that handle build, test, security scan, and deploy
Secrets and config management — centralized, auditable secret delivery without manual distribution
Observability defaults — logs, metrics, and traces wired automatically for every new service

Key outcome: A developer creating a new microservice goes from zero to production-ready in hours, not weeks — because the platform handles the undifferentiated heavy lifting.

Platform Engineering vs DevOps vs SRE

These roles are complementary, not competing. Understanding the distinction avoids organizational confusion:

Dimension	DevOps	SRE	Platform Engineering
Primary focus	Culture & collaboration between dev and ops	Reliability, SLOs, incident response	Developer productivity via IDP
Customer	The organization as a whole	End users (reliability)	Internal developers
Output	Practices and culture	Runbooks, SLOs, on-call	Tools, APIs, self-service workflows
Success metric	Deployment frequency, lead time	Error budget, MTTR	Developer NPS, time-to-first-deploy

In mature organizations, all three exist: SRE defines reliability standards, Platform Engineering implements the tooling that makes those standards easy to meet, and DevOps culture ensures teams actually collaborate around them.

Team Topologies

The Team Topologies framework by Skelton & Pais provides the language for structuring platform organizations. Four fundamental team types:

Stream-Aligned Teams

Aligned to a flow of work from a business domain (e.g., "Checkout", "Payments"). They own their service end-to-end. They are the primary consumers of the platform — everything the platform team builds must reduce cognitive load for stream-aligned teams.

Platform Teams

Provide a compelling internal product that stream-aligned teams can use self-service. They absorb accidental complexity (Kubernetes, Vault, observability stack) and expose simple, reliable APIs. They must treat internal developers as customers.

Enabling Teams

Help stream-aligned teams acquire missing capabilities (e.g., a security enabling team that helps teams adopt SAST tooling). Enabling teams work in a time-limited, collaborative mode — they upskill and then step back.

Complicated-Subsystem Teams

Own components requiring deep specialist knowledge (e.g., a video encoding pipeline, a trading risk engine). They expose their subsystem as a service to stream-aligned teams, reducing the cognitive load of maintaining specialized expertise broadly.

Interaction Modes

Collaboration — two teams work closely for a defined period (high bandwidth, high cost, not sustainable long-term)
X-as-a-Service — one team consumes another's output with minimal interaction (low bandwidth, scalable)
Facilitating — an enabling team helps another team learn and grow, then steps back

Anti-pattern to avoid: A platform team that only ever collaborates becomes a bottleneck. The goal is to evolve toward X-as-a-Service relationships where stream-aligned teams consume the platform without needing platform team involvement.

DORA Metrics and Platform Engineering

The DORA (DevOps Research and Assessment) four key metrics measure software delivery performance. Platform Engineering directly influences all four:

Deployment Frequency

How often does your organization deploy to production? — Elite performers deploy multiple times per day. Platform Engineering improves this by providing standardized CI/CD pipelines that remove manual gates and reduce friction. When deploying is easy, teams deploy more often.

Lead Time for Changes

Time from code commit to running in production. — Golden path templates with pre-wired pipelines eliminate hours of setup per service. Automated security scanning integrated into the pipeline prevents late-stage rework.

Mean Time to Restore (MTTR)

How quickly can you recover from a failure? — Platform-provided observability (centralized logs, distributed traces, dashboards) means engineers spend minutes diagnosing rather than hours instrumenting. Runbook automation and self-healing infrastructure reduce MTTR further.

Change Failure Rate (CFR)

What percentage of deployments cause a failure? — Platform-enforced testing gates, progressive delivery (canary/blue-green), and automated rollback reduce the blast radius and frequency of bad deployments.

The Golden Path

A Golden Path is an opinionated, supported software delivery path that balances speed and correctness. It is not mandatory — teams can diverge — but divergence means accepting higher cognitive load and losing platform support.

Anatomy of a Golden Path

Project scaffolding — a Backstage Software Template that generates a repository with Dockerfile, CI pipeline, Helm chart, and catalog-info.yaml pre-configured
Build defaults — pinned base images, mandatory SBOM generation, SAST/SCA scanning in CI
Deploy defaults — GitOps via ArgoCD, rolling update strategy, resource requests/limits, PodDisruptionBudget
Observability defaults — Prometheus scrape annotations, structured JSON logging, OpenTelemetry SDK wired
Security defaults — non-root container, read-only filesystem, NetworkPolicy baseline, Vault sidecar for secrets
Escape hatches — any default can be overridden with a documented, reviewable reason

Escape hatch governance: Escape hatches should require a code comment or PR description explaining why the default is inappropriate. This preserves the audit trail without creating a bureaucratic approval process.

Backstage — Developer Portal

Backstage is an open-source developer portal framework from Spotify, now a CNCF incubating project. It provides a unified frontend for the IDP.

Core Architecture

Software Catalog

A central registry of all software assets (services, libraries, websites, APIs, ML models). Each entity is described by a catalog-info.yaml file in its repository. Teams discover ownership, dependencies, and documentation here.

TechDocs

Documentation-as-Code — Backstage renders MkDocs-based documentation from the same repository as the service. Keeps docs co-located with code and eliminates stale wiki pages.

Scaffolder (Software Templates)

The golden path engine. Templates define a sequence of steps (fetch template, run scripts, create repository, register entity) that produce a fully configured new service in minutes.

Plugins

Backstage's extensibility model. Frontend and backend plugins surface data from external systems (ArgoCD, PagerDuty, Vault, GitHub Actions, SonarQube) directly in the portal — developers never need to switch contexts.

Entity Descriptors — catalog-info.yaml

Every entity in the Backstage catalog is described by a YAML file committed to its source repository. The Backstage catalog continuously reconciles from these files.

Microservice entity:

apiVersion: backstage.io/v1alpha1
kind: Component
metadata:
  name: payment-service
  description: Handles payment processing and refunds
  annotations:
    github.com/project-slug: acme/payment-service
    backstage.io/techdocs-ref: dir:.
    prometheus.io/scrape: "true"
    argocd/app-name: payment-service-prod
  tags:
    - go
    - payments
    - pci-in-scope
  links:
    - url: https://grafana.internal/d/payment-service
      title: Grafana Dashboard
      icon: dashboard
    - url: https://runbooks.internal/payment-service
      title: Runbook
      icon: book
spec:
  type: service
  lifecycle: production
  owner: group:payments-team
  system: payment-platform
  dependsOn:
    - component:postgres-payments
    - component:kafka-cluster
  providesApis:
    - payment-api-v2

Library entity:

apiVersion: backstage.io/v1alpha1
kind: Component
metadata:
  name: acme-observability-sdk
  description: Shared Go library for OpenTelemetry instrumentation
  annotations:
    github.com/project-slug: acme/observability-sdk
  tags:
    - go
    - library
    - observability
spec:
  type: library
  lifecycle: production
  owner: group:platform-team

Website entity:

apiVersion: backstage.io/v1alpha1
kind: Component
metadata:
  name: customer-portal
  description: Customer-facing web application
  annotations:
    github.com/project-slug: acme/customer-portal
  tags:
    - react
    - frontend
spec:
  type: website
  lifecycle: production
  owner: group:frontend-team
  system: customer-experience

Backstage Software Template

A template.yaml defines the golden path for creating a new Go microservice:

apiVersion: scaffolder.backstage.io/v1beta3
kind: Template
metadata:
  name: go-microservice
  title: Go Microservice
  description: Creates a production-ready Go microservice with CI/CD, Helm chart, and observability
  tags:
    - go
    - microservice
    - recommended
spec:
  owner: group:platform-team
  type: service

  parameters:
    - title: Service Information
      required: [name, description, owner]
      properties:
        name:
          title: Service Name
          type: string
          pattern: '^[a-z][a-z0-9-]*$'
          description: Lowercase, hyphen-separated (e.g. payment-service)
        description:
          title: Description
          type: string
        owner:
          title: Owning Team
          type: string
          ui:field: OwnerPicker
          ui:options:
            allowedKinds: [Group]
    - title: Infrastructure
      properties:
        database:
          title: Provision PostgreSQL database?
          type: boolean
          default: false
        queue:
          title: Provision Kafka topic?
          type: boolean
          default: false

  steps:
    - id: fetch-template
      name: Fetch Template
      action: fetch:template
      input:
        url: ./skeleton
        values:
          name: ${{ parameters.name }}
          description: ${{ parameters.description }}
          owner: ${{ parameters.owner }}

    - id: create-repo
      name: Create GitHub Repository
      action: publish:github
      input:
        repoUrl: github.com?owner=acme&repo=${{ parameters.name }}
        defaultBranch: main
        repoVisibility: private

    - id: register
      name: Register in Catalog
      action: catalog:register
      input:
        repoContentsUrl: ${{ steps['create-repo'].output.repoContentsUrl }}
        catalogInfoPath: /catalog-info.yaml

  output:
    links:
      - title: Repository
        url: ${{ steps['create-repo'].output.remoteUrl }}
      - title: Open in Catalog
        icon: catalog
        entityRef: ${{ steps['register'].output.entityRef }}

Platform as Product

The most common failure mode for platform teams is building what they think developers need rather than what developers actually need. Treating the platform as a product means applying product management discipline to internal tooling.

Product Thinking for Platforms

Identify internal customers — stream-aligned developers are the users. Segment by team size, tech stack, maturity
Conduct user research — developer interviews, friction logs, shadowing on-call rotations
Define a product roadmap — prioritize by impact on DORA metrics and reduction in support tickets
Measure adoption — track what percentage of teams use the golden path vs. rolling their own
Collect NPS — quarterly developer satisfaction surveys surface pain points before they become exodus risk
Deprecation as a product decision — old platform versions need sunset plans, migration guides, and communication campaigns

The "thinnest viable platform" principle: Start with the smallest platform that removes the most pain. A well-documented shared Terraform module may deliver more value than a complex self-service portal that takes a year to build.

Self-Service Capabilities

Infrastructure Provisioning — Terraform Modules

Platform teams build opinionated Terraform modules that encode best practices. Teams consume them without needing Terraform expertise:

# teams consume platform modules — they don't write Terraform from scratch
module "service_database" {
  source  = "git::https://github.com/acme/terraform-modules.git//modules/rds-postgres?ref=v2.3.0"

  service_name        = "payment-service"
  environment         = "production"
  instance_class      = "db.t3.medium"
  allocated_storage   = 100
  multi_az            = true
  deletion_protection = true
  backup_retention    = 7

  # Automatic: security group rules, parameter group, monitoring, tagging
}

CI/CD Pipeline Templates

Reusable GitHub Actions workflows that teams reference rather than copy-paste:

# .github/workflows/deploy.yaml in a service repo
name: Deploy
on:
  push:
    branches: [main]

jobs:
  deploy:
    uses: acme/platform-workflows/.github/workflows/go-service-deploy.yaml@v1
    with:
      service-name: payment-service
      environment: production
      helm-chart-path: ./charts/payment-service
    secrets: inherit
    # Platform workflow handles: build, SAST, container scan, SBOM, push to ECR,
    # ArgoCD sync, smoke test, rollback on failure

Observability Setup

Observability is automatic for golden path services. A service annotated with the platform label gets:

Prometheus scraping configured via ServiceMonitor CRD
A pre-built Grafana dashboard (request rate, error rate, latency — RED method)
Log aggregation to the central Loki stack via the node log agent
Distributed tracing via OpenTelemetry collector sidecar
PagerDuty alert routing based on the catalog owner field

Platform Engineering Metrics

Beyond DORA, platform teams should track metrics specific to platform health and adoption:

Metric	Description	Target
Golden path adoption rate	% of services using the standard template	>80%
Time-to-first-deploy	Time from "new service created" to first production deploy	<1 day
Developer NPS	Net Promoter Score from quarterly survey	>40
Platform ticket volume	Support tickets routed to platform team (lower = more self-service)	Declining trend
Cognitive load index	Survey: how many tools/systems must a developer understand to deploy?	Declining trend
Infrastructure provisioning time	Time from request to usable resource	<15 minutes
Pipeline success rate	% of CI runs that succeed (infra flakiness excluded)	>95%

Implementation Roadmap

Building a platform is a multi-year journey. The following phased approach avoids the common pitfall of over-engineering before validating need:

Phase 1: Discover Pain Points (Month 1-2)

Conduct developer interviews across all stream-aligned teams
Audit current deployment times, on-call burden, and ticket sources
Map the current state of infrastructure provisioning (how long does a new env take?)
Identify the top 3 pain points by frequency and severity

Phase 2: Foundational Platform (Month 3-6)

Establish a Kubernetes platform (EKS/GKE/AKS) with GitOps (ArgoCD)
Build initial Terraform modules for the 3 most-used infrastructure components
Create a basic CI/CD pipeline template for the dominant language/framework
Set up centralized logging and metrics (Loki, Prometheus, Grafana)
Deploy Backstage with catalog populated from existing repositories

Phase 3: Golden Paths (Month 7-12)

Build Backstage Software Templates for common service archetypes
Integrate secrets management (Vault) with the platform
Implement service mesh (Istio) for mTLS and traffic management
Build the observability defaults into the golden path template
Establish developer NPS baseline and begin quarterly surveys

Phase 4: Self-Service at Scale (Month 13+)

Full self-service environment provisioning (no tickets, no wait)
Policy-as-Code enforcement (OPA/Gatekeeper) across all clusters
Cost allocation and showback dashboards per team
Chaos engineering tooling integrated with platform
Platform roadmap driven by NPS data and DORA trend analysis

Key principle: Ship something useful in the first 30 days. A single Terraform module that saves every team 4 hours of setup builds more trust than a 6-month platform project that developers haven't seen yet. Incremental delivery is not just efficient — it is how you earn the organizational mandate to keep building.