Platform Engineering Overview

Platform Engineering builds Internal Developer Platforms (IDPs) that abstract infrastructure complexity and provide self-service capabilities to application developers — enabling them to move faster without needing deep ops expertise.

What is Platform Engineering?

Platform Engineering is the discipline of designing and building toolchains and workflows that enable self-service capabilities for software engineering organizations in the cloud-native era. A Platform Engineering team builds and maintains the Internal Developer Platform (IDP) — a layer of tools, services, and processes that application teams consume to build, deploy, and operate their software.

The core philosophy is building "paved roads": opinionated, well-maintained paths that developers can follow to get their work done quickly and safely. Paved roads do not eliminate escape hatches — teams can deviate when genuinely necessary — but they reduce cognitive load for the common case.

Internal Developer Platform (IDP)

An IDP typically encompasses:

  • Self-service infrastructure provisioning — developers request environments, databases, queues without filing tickets
  • Golden path templates — opinionated project scaffolding that encodes security, observability, and compliance by default
  • Developer portal — a unified UI (e.g., Backstage) for discovering services, docs, and runbooks
  • Integrated CI/CD pipelines — reusable pipeline templates that handle build, test, security scan, and deploy
  • Secrets and config management — centralized, auditable secret delivery without manual distribution
  • Observability defaults — logs, metrics, and traces wired automatically for every new service
Key outcome: A developer creating a new microservice goes from zero to production-ready in hours, not weeks — because the platform handles the undifferentiated heavy lifting.

Platform Engineering vs DevOps vs SRE

These roles are complementary, not competing. Understanding the distinction avoids organizational confusion:

Dimension DevOps SRE Platform Engineering
Primary focus Culture & collaboration between dev and ops Reliability, SLOs, incident response Developer productivity via IDP
Customer The organization as a whole End users (reliability) Internal developers
Output Practices and culture Runbooks, SLOs, on-call Tools, APIs, self-service workflows
Success metric Deployment frequency, lead time Error budget, MTTR Developer NPS, time-to-first-deploy

In mature organizations, all three exist: SRE defines reliability standards, Platform Engineering implements the tooling that makes those standards easy to meet, and DevOps culture ensures teams actually collaborate around them.

Team Topologies

The Team Topologies framework by Skelton & Pais provides the language for structuring platform organizations. Four fundamental team types:

Stream-Aligned Teams

Aligned to a flow of work from a business domain (e.g., "Checkout", "Payments"). They own their service end-to-end. They are the primary consumers of the platform — everything the platform team builds must reduce cognitive load for stream-aligned teams.

Platform Teams

Provide a compelling internal product that stream-aligned teams can use self-service. They absorb accidental complexity (Kubernetes, Vault, observability stack) and expose simple, reliable APIs. They must treat internal developers as customers.

Enabling Teams

Help stream-aligned teams acquire missing capabilities (e.g., a security enabling team that helps teams adopt SAST tooling). Enabling teams work in a time-limited, collaborative mode — they upskill and then step back.

Complicated-Subsystem Teams

Own components requiring deep specialist knowledge (e.g., a video encoding pipeline, a trading risk engine). They expose their subsystem as a service to stream-aligned teams, reducing the cognitive load of maintaining specialized expertise broadly.

Interaction Modes

  • Collaboration — two teams work closely for a defined period (high bandwidth, high cost, not sustainable long-term)
  • X-as-a-Service — one team consumes another's output with minimal interaction (low bandwidth, scalable)
  • Facilitating — an enabling team helps another team learn and grow, then steps back
Anti-pattern to avoid: A platform team that only ever collaborates becomes a bottleneck. The goal is to evolve toward X-as-a-Service relationships where stream-aligned teams consume the platform without needing platform team involvement.

DORA Metrics and Platform Engineering

The DORA (DevOps Research and Assessment) four key metrics measure software delivery performance. Platform Engineering directly influences all four:

Deployment Frequency

How often does your organization deploy to production? — Elite performers deploy multiple times per day. Platform Engineering improves this by providing standardized CI/CD pipelines that remove manual gates and reduce friction. When deploying is easy, teams deploy more often.

Lead Time for Changes

Time from code commit to running in production. — Golden path templates with pre-wired pipelines eliminate hours of setup per service. Automated security scanning integrated into the pipeline prevents late-stage rework.

Mean Time to Restore (MTTR)

How quickly can you recover from a failure? — Platform-provided observability (centralized logs, distributed traces, dashboards) means engineers spend minutes diagnosing rather than hours instrumenting. Runbook automation and self-healing infrastructure reduce MTTR further.

Change Failure Rate (CFR)

What percentage of deployments cause a failure? — Platform-enforced testing gates, progressive delivery (canary/blue-green), and automated rollback reduce the blast radius and frequency of bad deployments.

The Golden Path

A Golden Path is an opinionated, supported software delivery path that balances speed and correctness. It is not mandatory — teams can diverge — but divergence means accepting higher cognitive load and losing platform support.

Anatomy of a Golden Path

  • Project scaffolding — a Backstage Software Template that generates a repository with Dockerfile, CI pipeline, Helm chart, and catalog-info.yaml pre-configured
  • Build defaults — pinned base images, mandatory SBOM generation, SAST/SCA scanning in CI
  • Deploy defaults — GitOps via ArgoCD, rolling update strategy, resource requests/limits, PodDisruptionBudget
  • Observability defaults — Prometheus scrape annotations, structured JSON logging, OpenTelemetry SDK wired
  • Security defaults — non-root container, read-only filesystem, NetworkPolicy baseline, Vault sidecar for secrets
  • Escape hatches — any default can be overridden with a documented, reviewable reason
Escape hatch governance: Escape hatches should require a code comment or PR description explaining why the default is inappropriate. This preserves the audit trail without creating a bureaucratic approval process.

Backstage — Developer Portal

Backstage is an open-source developer portal framework from Spotify, now a CNCF incubating project. It provides a unified frontend for the IDP.

Core Architecture

Software Catalog

A central registry of all software assets (services, libraries, websites, APIs, ML models). Each entity is described by a catalog-info.yaml file in its repository. Teams discover ownership, dependencies, and documentation here.

TechDocs

Documentation-as-Code — Backstage renders MkDocs-based documentation from the same repository as the service. Keeps docs co-located with code and eliminates stale wiki pages.

Scaffolder (Software Templates)

The golden path engine. Templates define a sequence of steps (fetch template, run scripts, create repository, register entity) that produce a fully configured new service in minutes.

Plugins

Backstage's extensibility model. Frontend and backend plugins surface data from external systems (ArgoCD, PagerDuty, Vault, GitHub Actions, SonarQube) directly in the portal — developers never need to switch contexts.

Entity Descriptors — catalog-info.yaml

Every entity in the Backstage catalog is described by a YAML file committed to its source repository. The Backstage catalog continuously reconciles from these files.

Microservice entity:

apiVersion: backstage.io/v1alpha1
kind: Component
metadata:
  name: payment-service
  description: Handles payment processing and refunds
  annotations:
    github.com/project-slug: acme/payment-service
    backstage.io/techdocs-ref: dir:.
    prometheus.io/scrape: "true"
    argocd/app-name: payment-service-prod
  tags:
    - go
    - payments
    - pci-in-scope
  links:
    - url: https://grafana.internal/d/payment-service
      title: Grafana Dashboard
      icon: dashboard
    - url: https://runbooks.internal/payment-service
      title: Runbook
      icon: book
spec:
  type: service
  lifecycle: production
  owner: group:payments-team
  system: payment-platform
  dependsOn:
    - component:postgres-payments
    - component:kafka-cluster
  providesApis:
    - payment-api-v2

Library entity:

apiVersion: backstage.io/v1alpha1
kind: Component
metadata:
  name: acme-observability-sdk
  description: Shared Go library for OpenTelemetry instrumentation
  annotations:
    github.com/project-slug: acme/observability-sdk
  tags:
    - go
    - library
    - observability
spec:
  type: library
  lifecycle: production
  owner: group:platform-team

Website entity:

apiVersion: backstage.io/v1alpha1
kind: Component
metadata:
  name: customer-portal
  description: Customer-facing web application
  annotations:
    github.com/project-slug: acme/customer-portal
  tags:
    - react
    - frontend
spec:
  type: website
  lifecycle: production
  owner: group:frontend-team
  system: customer-experience

Backstage Software Template

A template.yaml defines the golden path for creating a new Go microservice:

apiVersion: scaffolder.backstage.io/v1beta3
kind: Template
metadata:
  name: go-microservice
  title: Go Microservice
  description: Creates a production-ready Go microservice with CI/CD, Helm chart, and observability
  tags:
    - go
    - microservice
    - recommended
spec:
  owner: group:platform-team
  type: service

  parameters:
    - title: Service Information
      required: [name, description, owner]
      properties:
        name:
          title: Service Name
          type: string
          pattern: '^[a-z][a-z0-9-]*$'
          description: Lowercase, hyphen-separated (e.g. payment-service)
        description:
          title: Description
          type: string
        owner:
          title: Owning Team
          type: string
          ui:field: OwnerPicker
          ui:options:
            allowedKinds: [Group]
    - title: Infrastructure
      properties:
        database:
          title: Provision PostgreSQL database?
          type: boolean
          default: false
        queue:
          title: Provision Kafka topic?
          type: boolean
          default: false

  steps:
    - id: fetch-template
      name: Fetch Template
      action: fetch:template
      input:
        url: ./skeleton
        values:
          name: ${{ parameters.name }}
          description: ${{ parameters.description }}
          owner: ${{ parameters.owner }}

    - id: create-repo
      name: Create GitHub Repository
      action: publish:github
      input:
        repoUrl: github.com?owner=acme&repo=${{ parameters.name }}
        defaultBranch: main
        repoVisibility: private

    - id: register
      name: Register in Catalog
      action: catalog:register
      input:
        repoContentsUrl: ${{ steps['create-repo'].output.repoContentsUrl }}
        catalogInfoPath: /catalog-info.yaml

  output:
    links:
      - title: Repository
        url: ${{ steps['create-repo'].output.remoteUrl }}
      - title: Open in Catalog
        icon: catalog
        entityRef: ${{ steps['register'].output.entityRef }}

Platform as Product

The most common failure mode for platform teams is building what they think developers need rather than what developers actually need. Treating the platform as a product means applying product management discipline to internal tooling.

Product Thinking for Platforms

  • Identify internal customers — stream-aligned developers are the users. Segment by team size, tech stack, maturity
  • Conduct user research — developer interviews, friction logs, shadowing on-call rotations
  • Define a product roadmap — prioritize by impact on DORA metrics and reduction in support tickets
  • Measure adoption — track what percentage of teams use the golden path vs. rolling their own
  • Collect NPS — quarterly developer satisfaction surveys surface pain points before they become exodus risk
  • Deprecation as a product decision — old platform versions need sunset plans, migration guides, and communication campaigns
The "thinnest viable platform" principle: Start with the smallest platform that removes the most pain. A well-documented shared Terraform module may deliver more value than a complex self-service portal that takes a year to build.

Self-Service Capabilities

Infrastructure Provisioning — Terraform Modules

Platform teams build opinionated Terraform modules that encode best practices. Teams consume them without needing Terraform expertise:

# teams consume platform modules — they don't write Terraform from scratch
module "service_database" {
  source  = "git::https://github.com/acme/terraform-modules.git//modules/rds-postgres?ref=v2.3.0"

  service_name        = "payment-service"
  environment         = "production"
  instance_class      = "db.t3.medium"
  allocated_storage   = 100
  multi_az            = true
  deletion_protection = true
  backup_retention    = 7

  # Automatic: security group rules, parameter group, monitoring, tagging
}

CI/CD Pipeline Templates

Reusable GitHub Actions workflows that teams reference rather than copy-paste:

# .github/workflows/deploy.yaml in a service repo
name: Deploy
on:
  push:
    branches: [main]

jobs:
  deploy:
    uses: acme/platform-workflows/.github/workflows/go-service-deploy.yaml@v1
    with:
      service-name: payment-service
      environment: production
      helm-chart-path: ./charts/payment-service
    secrets: inherit
    # Platform workflow handles: build, SAST, container scan, SBOM, push to ECR,
    # ArgoCD sync, smoke test, rollback on failure

Observability Setup

Observability is automatic for golden path services. A service annotated with the platform label gets:

  • Prometheus scraping configured via ServiceMonitor CRD
  • A pre-built Grafana dashboard (request rate, error rate, latency — RED method)
  • Log aggregation to the central Loki stack via the node log agent
  • Distributed tracing via OpenTelemetry collector sidecar
  • PagerDuty alert routing based on the catalog owner field

Platform Engineering Metrics

Beyond DORA, platform teams should track metrics specific to platform health and adoption:

Metric Description Target
Golden path adoption rate % of services using the standard template >80%
Time-to-first-deploy Time from "new service created" to first production deploy <1 day
Developer NPS Net Promoter Score from quarterly survey >40
Platform ticket volume Support tickets routed to platform team (lower = more self-service) Declining trend
Cognitive load index Survey: how many tools/systems must a developer understand to deploy? Declining trend
Infrastructure provisioning time Time from request to usable resource <15 minutes
Pipeline success rate % of CI runs that succeed (infra flakiness excluded) >95%

Implementation Roadmap

Building a platform is a multi-year journey. The following phased approach avoids the common pitfall of over-engineering before validating need:

Phase 1: Discover Pain Points (Month 1-2)

  • Conduct developer interviews across all stream-aligned teams
  • Audit current deployment times, on-call burden, and ticket sources
  • Map the current state of infrastructure provisioning (how long does a new env take?)
  • Identify the top 3 pain points by frequency and severity

Phase 2: Foundational Platform (Month 3-6)

  • Establish a Kubernetes platform (EKS/GKE/AKS) with GitOps (ArgoCD)
  • Build initial Terraform modules for the 3 most-used infrastructure components
  • Create a basic CI/CD pipeline template for the dominant language/framework
  • Set up centralized logging and metrics (Loki, Prometheus, Grafana)
  • Deploy Backstage with catalog populated from existing repositories

Phase 3: Golden Paths (Month 7-12)

  • Build Backstage Software Templates for common service archetypes
  • Integrate secrets management (Vault) with the platform
  • Implement service mesh (Istio) for mTLS and traffic management
  • Build the observability defaults into the golden path template
  • Establish developer NPS baseline and begin quarterly surveys

Phase 4: Self-Service at Scale (Month 13+)

  • Full self-service environment provisioning (no tickets, no wait)
  • Policy-as-Code enforcement (OPA/Gatekeeper) across all clusters
  • Cost allocation and showback dashboards per team
  • Chaos engineering tooling integrated with platform
  • Platform roadmap driven by NPS data and DORA trend analysis
Key principle: Ship something useful in the first 30 days. A single Terraform module that saves every team 4 hours of setup builds more trust than a 6-month platform project that developers haven't seen yet. Incremental delivery is not just efficient — it is how you earn the organizational mandate to keep building.