Service Mesh — Istio

Istio is a production-grade service mesh that provides traffic management, mTLS security, and observability for microservices — without requiring application code changes. It runs as a CNCF graduated project and is the dominant service mesh for Kubernetes at scale.

Architecture

Istio follows a split control-plane / data-plane architecture. Understanding this split is essential for debugging and capacity planning.

Control Plane — istiod

Starting with Istio 1.5, the three original control-plane components (Pilot, Citadel, Galley) were unified into a single binary called istiod. This dramatically simplifies operations while preserving all functionality.

Pilot (xDS configuration)

Translates Istio CRDs (VirtualService, DestinationRule, etc.) into Envoy xDS (Discovery Service) configuration and pushes it to the data-plane proxies. Manages service discovery by watching the Kubernetes API for Endpoints, Services, and Pods. Implements the xDS protocol: LDS (listeners), RDS (routes), CDS (clusters), EDS (endpoints).

Citadel (Certificate Authority)

Issues and rotates X.509 certificates for every workload in the mesh. Each pod gets a SPIFFE-compliant identity (spiffe://cluster.local/ns/<namespace>/sa/<service-account>). Certificates are rotated before expiry (default: 24 hours) without service disruption. Powers mTLS between all mesh participants.

Galley (Config Validation)

Validates Istio configuration resources before they reach Pilot, preventing misconfigured VirtualServices or DestinationRules from causing outages. Implemented as a Kubernetes Admission Webhook — invalid configs are rejected at apply time.

Data Plane — Envoy Proxy

Every pod in the mesh gets an Envoy proxy injected as a sidecar container. All inbound and outbound traffic flows through the sidecar — the application container has no awareness of the mesh. Envoy handles TLS termination/origination, load balancing, circuit breaking, retries, and telemetry collection at the L7 level.

Key Envoy concepts: listeners (accept traffic), filters (process traffic — HTTP Connection Manager, gRPC transcoding, Lua), clusters (upstream services), endpoints (individual instances).

Key CRDs

CRD	Purpose
`VirtualService`	HTTP/TCP routing rules: weights, retries, timeouts, fault injection
`DestinationRule`	Traffic policy for a destination: load balancing, connection pool, circuit breaker, subsets
`Gateway`	Configures the ingress/egress gateway (ports, protocols, TLS)
`ServiceEntry`	Registers external services in the mesh registry for routing and policy
`PeerAuthentication`	Configures mTLS mode (STRICT, PERMISSIVE, DISABLE) per namespace/workload
`AuthorizationPolicy`	L7 access control: allow/deny rules based on source, method, path, JWT claims

Installation

istioctl Install Profiles

Istio ships with named profiles that configure component sets and resource defaults:

Profile	Use Case	Components
`minimal`	Control plane only, no ingress gateway	istiod
`default`	Production baseline	istiod + ingress gateway
`demo`	All features, for learning (high resources)	istiod + ingress + egress
`ambient`	Sidecarless ambient mode (preview)	istiod + ztunnel + waypoint

# Download istioctl (replace VERSION with target, e.g. 1.22.0)
curl -L https://istio.io/downloadIstio | ISTIO_VERSION=1.22.0 sh -
export PATH="$PWD/istio-1.22.0/bin:$PATH"

# Pre-flight check
istioctl x precheck

# Install with default profile
istioctl install --set profile=default -y

# Install with custom overlay (recommended for production)
istioctl install -f istio-operator.yaml -y

# Verify installation
istioctl verify-install

# Check component status
kubectl get pods -n istio-system

# Upgrade (in-place, canary-safe)
istioctl upgrade --set profile=default -y

Production IstioOperator overlay:

apiVersion: install.istio.io/v1alpha1
kind: IstioOperator
metadata:
  name: production
spec:
  profile: default
  meshConfig:
    accessLogFile: /dev/stdout
    defaultConfig:
      tracing:
        sampling: 1.0   # 1% sampling in production
        zipkin:
          address: jaeger-collector.monitoring:9411
    outboundTrafficPolicy:
      mode: REGISTRY_ONLY  # Block unregistered egress traffic

  components:
    pilot:
      k8s:
        resources:
          requests:
            cpu: 500m
            memory: 2Gi
          limits:
            cpu: 1000m
            memory: 4Gi
        hpaSpec:
          minReplicas: 2
          maxReplicas: 5
    ingressGateways:
      - name: istio-ingressgateway
        enabled: true
        k8s:
          resources:
            requests:
              cpu: 100m
              memory: 128Mi
            limits:
              cpu: 2000m
              memory: 1Gi
          hpaSpec:
            minReplicas: 2
            maxReplicas: 10

  values:
    global:
      proxy:
        resources:
          requests:
            cpu: 100m
            memory: 128Mi
          limits:
            cpu: 200m
            memory: 256Mi

Sidecar Injection

# Enable automatic sidecar injection for a namespace
kubectl label namespace production istio-injection=enabled

# Verify injection label
kubectl get namespace production --show-labels

# Manually inject sidecar into existing deployment (for testing)
istioctl kube-inject -f deployment.yaml | kubectl apply -f -

# Opt a specific pod OUT of injection (e.g., for a database sidecar)
# Add this annotation to the Pod spec:
# annotations:
#   sidecar.istio.io/inject: "false"

# Check which pods have sidecars injected
kubectl get pods -n production -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.containers[*].name}{"\n"}{end}'

Traffic Management

VirtualService — HTTP Routing

A VirtualService defines routing rules for traffic destined for a Kubernetes service. It intercepts requests at the Envoy sidecar level.

apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: payment-service
  namespace: production
spec:
  hosts:
    - payment-service
  http:
    # Header-based routing (e.g., for internal QA traffic)
    - match:
        - headers:
            x-env:
              exact: staging
      route:
        - destination:
            host: payment-service
            subset: v2-canary

    # Default weighted routing: 90% v1, 10% v2
    - route:
        - destination:
            host: payment-service
            subset: v1
          weight: 90
        - destination:
            host: payment-service
            subset: v2
          weight: 10
      # Retry configuration
      retries:
        attempts: 3
        perTryTimeout: 5s
        retryOn: "connect-failure,refused-stream,unavailable,cancelled,5xx"
      # Timeout for entire request
      timeout: 30s

Fault Injection

Inject faults in production-like environments to test service resilience without modifying application code:

apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: payment-service-chaos
spec:
  hosts:
    - payment-service
  http:
    - match:
        - headers:
            x-chaos-test:
              exact: "true"
      fault:
        delay:
          percentage:
            value: 10.0   # inject 500ms delay for 10% of requests
          fixedDelay: 500ms
        abort:
          percentage:
            value: 5.0    # return HTTP 503 for 5% of requests
          httpStatus: 503
      route:
        - destination:
            host: payment-service
    - route:
        - destination:
            host: payment-service

DestinationRule — Traffic Policy & Circuit Breaker

apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: payment-service
  namespace: production
spec:
  host: payment-service
  trafficPolicy:
    loadBalancer:
      simple: LEAST_CONN   # ROUND_ROBIN | LEAST_CONN | RANDOM | PASSTHROUGH
    connectionPool:
      tcp:
        maxConnections: 100
        connectTimeout: 3s
      http:
        http1MaxPendingRequests: 64
        http2MaxRequests: 1000
        maxRequestsPerConnection: 10
        maxRetries: 3
    outlierDetection:
      # Circuit breaker: eject hosts with too many errors
      consecutiveGatewayErrors: 5
      consecutive5xxErrors: 5
      interval: 10s
      baseEjectionTime: 30s
      maxEjectionPercent: 50
      minHealthPercent: 30
  subsets:
    - name: v1
      labels:
        version: v1
      trafficPolicy:
        loadBalancer:
          simple: ROUND_ROBIN
    - name: v2
      labels:
        version: v2
    - name: v2-canary
      labels:
        version: v2-canary

Gateway — Ingress TLS

apiVersion: networking.istio.io/v1beta1
kind: Gateway
metadata:
  name: production-gateway
  namespace: istio-system
spec:
  selector:
    istio: ingressgateway
  servers:
    - port:
        number: 443
        name: https
        protocol: HTTPS
      tls:
        mode: SIMPLE
        credentialName: production-tls-cert   # K8s secret with tls.crt/tls.key
      hosts:
        - api.example.com
        - admin.example.com
    - port:
        number: 80
        name: http
        protocol: HTTP
      tls:
        httpsRedirect: true   # redirect all HTTP to HTTPS
      hosts:
        - api.example.com
        - admin.example.com
---
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: api-routing
  namespace: production
spec:
  hosts:
    - api.example.com
  gateways:
    - istio-system/production-gateway
  http:
    - match:
        - uri:
            prefix: /payment
      route:
        - destination:
            host: payment-service
            port:
              number: 8080
    - match:
        - uri:
            prefix: /orders
      route:
        - destination:
            host: order-service
            port:
              number: 8080

Canary Deployment

Progressive traffic shifting with Istio — no Kubernetes resource changes needed between steps:

# Step 1: Deploy v2 pods (but send 0% traffic)
kubectl apply -f payment-service-v2-deployment.yaml

# Step 2: Configure initial 5% canary
kubectl apply -f - <<EOF
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: payment-service
spec:
  hosts: [payment-service]
  http:
    - route:
        - destination:
            host: payment-service
            subset: v1
          weight: 95
        - destination:
            host: payment-service
            subset: v2
          weight: 5
EOF

# Step 3: Watch error rate in Prometheus before progressing
# istio_requests_total{destination_service="payment-service",response_code=~"5.."}

# Step 4: Shift to 50/50
# (patch VirtualService weight: v1=50, v2=50)

# Step 5: Full cutover
# (patch VirtualService weight: v1=0, v2=100)

# Step 6: Decommission v1 pods
kubectl delete deployment payment-service-v1

Traffic Mirroring (Shadow Traffic)

apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: payment-service
spec:
  hosts:
    - payment-service
  http:
    - route:
        - destination:
            host: payment-service
            subset: v1
          weight: 100
      mirror:
        host: payment-service
        subset: v2-shadow    # receives a copy of 20% of live traffic
      mirrorPercentage:
        value: 20.0

Security

mTLS — PeerAuthentication

mTLS ensures all service-to-service communication is encrypted and mutually authenticated using SPIFFE identities. Enforce STRICT mode after migration is complete:

# STRICT mode for entire namespace — all traffic must be mTLS
apiVersion: security.istio.io/v1beta1
kind: PeerAuthentication
metadata:
  name: default
  namespace: production
spec:
  mtls:
    mode: STRICT

---
# PERMISSIVE during migration — accept both mTLS and plaintext
apiVersion: security.istio.io/v1beta1
kind: PeerAuthentication
metadata:
  name: default
  namespace: production
spec:
  mtls:
    mode: PERMISSIVE

---
# Workload-specific override — exempt a legacy pod
apiVersion: security.istio.io/v1beta1
kind: PeerAuthentication
metadata:
  name: legacy-exception
  namespace: production
spec:
  selector:
    matchLabels:
      app: legacy-db-client
  mtls:
    mode: PERMISSIVE

Migration strategy: Start mesh-wide with PERMISSIVE, then progressively add STRICT per-namespace as you verify mTLS is working. Use istioctl x authz check <pod> to debug policy application.

AuthorizationPolicy — Service-Level RBAC

AuthorizationPolicy implements L7 access control. The default (no policy) is allow-all. A deny-all baseline with explicit allows is the production-safe pattern:

# Step 1: Deny all inbound traffic to production namespace
apiVersion: security.istio.io/v1beta1
kind: AuthorizationPolicy
metadata:
  name: deny-all
  namespace: production
spec:
  {}   # no selector = applies to all workloads in namespace
  # no rules = deny all

---
# Step 2: Allow payment-service to be called from order-service and api-gateway only
apiVersion: security.istio.io/v1beta1
kind: AuthorizationPolicy
metadata:
  name: payment-service-access
  namespace: production
spec:
  selector:
    matchLabels:
      app: payment-service
  action: ALLOW
  rules:
    - from:
        - source:
            principals:
              - "cluster.local/ns/production/sa/order-service"
              - "cluster.local/ns/production/sa/api-gateway"
      to:
        - operation:
            methods: ["GET", "POST"]
            paths: ["/api/v1/payments/*", "/api/v1/refunds/*"]
    - from:
        - source:
            namespaces: ["monitoring"]
      to:
        - operation:
            methods: ["GET"]
            paths: ["/metrics"]

JWT Validation

# Validate JWTs issued by your IdP
apiVersion: security.istio.io/v1beta1
kind: RequestAuthentication
metadata:
  name: jwt-validation
  namespace: production
spec:
  selector:
    matchLabels:
      app: api-gateway
  jwtRules:
    - issuer: "https://auth.example.com"
      jwksUri: "https://auth.example.com/.well-known/jwks.json"
      audiences:
        - "api.example.com"
      forwardOriginalToken: true

---
# Require valid JWT for all paths except /health
apiVersion: security.istio.io/v1beta1
kind: AuthorizationPolicy
metadata:
  name: require-jwt
  namespace: production
spec:
  selector:
    matchLabels:
      app: api-gateway
  action: ALLOW
  rules:
    - from:
        - source:
            requestPrincipals: ["https://auth.example.com/*"]
    - to:
        - operation:
            paths: ["/health", "/ready"]

Egress Control — ServiceEntry

With outboundTrafficPolicy: REGISTRY_ONLY, all egress to unregistered hosts is blocked. Register external services explicitly:

apiVersion: networking.istio.io/v1beta1
kind: ServiceEntry
metadata:
  name: external-payment-gateway
  namespace: production
spec:
  hosts:
    - api.stripe.com
  ports:
    - number: 443
      name: https
      protocol: HTTPS
  location: MESH_EXTERNAL
  resolution: DNS

---
# Route egress through egress gateway for auditing
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: stripe-via-egress
  namespace: production
spec:
  hosts:
    - api.stripe.com
  tls:
    - match:
        - port: 443
          sniHosts: [api.stripe.com]
      route:
        - destination:
            host: istio-egressgateway.istio-system.svc.cluster.local
            port:
              number: 443

Observability

Automatic Prometheus Metrics

Istio generates the following L7 metrics automatically for every request, labeled by source and destination workload:

istio_requests_total — counter of requests (labels: reporter, source/dest workload, namespace, response_code)
istio_request_duration_milliseconds — histogram of request latencies
istio_request_bytes — histogram of request body sizes
istio_response_bytes — histogram of response body sizes
istio_tcp_sent_bytes_total / istio_tcp_received_bytes_total — TCP-level byte counters

Useful PromQL queries:

# Request rate per service (RED: Rate)
sum(rate(istio_requests_total{reporter="destination"}[5m])) by (destination_service_name)

# Error rate per service (RED: Errors)
sum(rate(istio_requests_total{reporter="destination",response_code=~"5.."}[5m])) by (destination_service_name)
/
sum(rate(istio_requests_total{reporter="destination"}[5m])) by (destination_service_name)

# P99 latency per service (RED: Duration)
histogram_quantile(0.99,
  sum(rate(istio_request_duration_milliseconds_bucket{reporter="destination"}[5m]))
  by (destination_service_name, le)
)

Distributed Tracing — Jaeger

# Deploy Jaeger (all-in-one for dev, production operator for prod)
kubectl apply -f https://github.com/jaegertracing/jaeger-operator/releases/latest/download/jaeger-operator.yaml -n observability

# Jaeger instance
apiVersion: jaegertracing.io/v1
kind: Jaeger
metadata:
  name: jaeger
  namespace: observability
spec:
  strategy: production
  storage:
    type: elasticsearch
    options:
      es:
        server-urls: https://elasticsearch:9200
  collector:
    maxReplicas: 5

# Configure Istio to send traces to Jaeger
# In IstioOperator:
# meshConfig:
#   defaultConfig:
#     tracing:
#       sampling: 1.0
#       zipkin:
#         address: jaeger-collector.observability:9411

Kiali — Service Graph

# Install Kiali via Helm
helm repo add kiali https://kiali.org/helm-charts
helm install kiali-server kiali/kiali-server \
  --namespace istio-system \
  --set auth.strategy="token" \
  --set external_services.prometheus.url="http://prometheus.monitoring:9090" \
  --set external_services.jaeger.url="http://jaeger-query.observability:16686"

# Access Kiali UI
kubectl port-forward svc/kiali -n istio-system 20001:20001
# open http://localhost:20001

Kiali provides: service topology graph with real-time traffic flow, error rate heat maps, configuration validation (detects broken VirtualServices), mTLS status per edge, and workload health indicators.

Production Tips & Best Practices

Envoy Sidecar Resources

Always set resource requests and limits on Envoy sidecars. An undersized sidecar under load can throttle all traffic through a pod even when the application is healthy. Tune based on your traffic profile.

# Set sidecar resource defaults via annotation on Deployment/Pod
annotations:
  proxy.istio.io/config: |
    concurrency: 4
  sidecar.istio.io/proxyCPU: "200m"
  sidecar.istio.io/proxyMemory: "256Mi"
  sidecar.istio.io/proxyCPULimit: "500m"
  sidecar.istio.io/proxyMemoryLimit: "512Mi"

Graceful Termination

Envoy needs time to drain in-flight connections on pod shutdown. Without this, you get 503s during rolling deploys:

# Add a preStop lifecycle hook to your application container
# to give Envoy time to drain connections
lifecycle:
  preStop:
    exec:
      command: ["/bin/sh", "-c", "sleep 5"]

# Also configure Envoy drain duration via annotation
annotations:
  proxy.istio.io/config: |
    terminationDrainDuration: 30s

# Set pod terminationGracePeriodSeconds to exceed drain duration
spec:
  terminationGracePeriodSeconds: 60

Ambient Mode (Sidecarless)

Istio Ambient Mode (graduated to stable in Istio 1.22) eliminates per-pod sidecars. Instead:

ztunnel — a DaemonSet node proxy handling L4 mTLS and basic L4 authorization
Waypoint proxy — an optional, namespace-scoped Envoy deployment for L7 features (only deploy when you need L7 routing/auth)

Benefits: ~50% lower resource overhead, no pod restart required to add/remove from mesh, simpler upgrade path. Trade-off: L7 features require explicit waypoint deployment and add a hop. Ambient is ideal for large clusters where sidecar overhead is a major cost driver.

# Enable ambient mode for a namespace
kubectl label namespace production istio.io/dataplane-mode=ambient

# Deploy a waypoint proxy for L7 features
istioctl waypoint apply --namespace production --enroll-namespace

Debugging Commands

# Analyze Istio configuration for issues
istioctl analyze -n production

# Check proxy config for a specific pod
istioctl proxy-config listener <pod-name> -n production
istioctl proxy-config route <pod-name> -n production
istioctl proxy-config cluster <pod-name> -n production
istioctl proxy-config endpoint <pod-name> -n production

# Check authorization policy application
istioctl x authz check <pod-name> -n production

# View sidecar access logs (must have accessLogFile set)
kubectl logs <pod-name> -n production -c istio-proxy --tail=100

# Check mTLS status between services
istioctl x describe pod <pod-name> -n production

Platform Engineering