Multi-cloud & Hybrid Infrastructure

Scope: This section covers multi-cloud strategy, hybrid connectivity patterns, landing zone governance, and the tooling required to operate workloads consistently across AWS, GCP, and on-premises environments.

Definitions and Differences

Multi-Cloud

The deliberate use of cloud services from two or more independent public cloud providers (e.g., AWS and GCP) to run distinct workloads or replicate workloads for resilience. Each cloud operates independently — there is no required network connectivity between them, though cross-cloud integration is common.

Key trait: Multiple public cloud providers, same or different workloads per provider.

Hybrid Cloud

An architecture that integrates at least one public cloud environment with on-premises infrastructure (private cloud or traditional data center) through persistent, private network connectivity. Applications and data can flow between environments as a unified operational model.

Key trait: On-premises + public cloud with deep network and identity integration.

Dimension Multi-Cloud Hybrid Cloud Both
Primary goal Best-of-breed services, resilience, vendor independence Integrate on-premises legacy with cloud scalability Data sovereignty, disaster recovery
Connectivity requirement Optional (workloads may be fully independent) Mandatory (dedicated link or VPN) Required for data federation scenarios
Primary driver Vendor lock-in avoidance, regulatory Legacy modernization, compliance, burst capacity M&A integration, edge computing
Operational complexity High (multiple control planes) High (network, identity bridging) Very High

Why Multi-Cloud?

Enterprises adopt multi-cloud for business, technical, and regulatory reasons — rarely for a single motive.

Avoid Vendor Lock-in

Dependence on a single provider's proprietary services (e.g., AWS DynamoDB, GCP BigQuery) makes migration expensive and slow. A multi-cloud strategy encourages use of portable standards (Kubernetes, Terraform, open data formats) and preserves negotiation leverage at contract renewal.

Best-of-Breed Services

No single cloud leads in every domain. GCP's BigQuery and Vertex AI are widely regarded as best-in-class for analytics and ML. AWS leads in breadth of services and global infrastructure maturity. Azure dominates in hybrid Active Directory integration and Microsoft workload licensing. Multi-cloud enables teams to use the optimal service for each workload.

Data Residency and Compliance

Regulations such as GDPR (EU), PDPA (Thailand/Singapore), and financial sector requirements mandate that data remain within specific geographic or political boundaries. Some countries are served by only one major cloud provider's local region — operating on a second cloud can satisfy residency requirements that a single provider cannot.

M&A Integration

Acquisitions frequently result in inheriting a different cloud provider's estate. Rather than forcing costly re-platforming, organizations operate the acquired company on its existing cloud while establishing governance bridges (federated identity, shared observability, cross-cloud networking) as an interim state during integration.

Resilience Against Provider Outages

A major cloud provider outage affecting an entire region or global control plane (as seen in historical incidents) is mitigated when critical workloads are distributed across independent providers with separate failure domains. Active-active multi-cloud deployments can maintain availability during such events.

Multi-Cloud Challenges

Operational Complexity

Each cloud provider has distinct APIs, CLIs, IAM models, networking primitives, and console UIs. Platform teams must maintain expertise and tooling across multiple stacks. Incident response procedures, runbooks, and on-call rotations all multiply in scope.

Skills and Staffing

Deep expertise in AWS and GCP simultaneously is rare. Training, certification, and hiring costs increase. Team silos can form around cloud providers, creating knowledge bottlenecks and inconsistent practices.

Networking Complexity

Cross-cloud traffic flows over the public internet unless dedicated interconnects are established. IP address management, BGP routing, DNS resolution, and firewall rules must be coordinated across entirely different network control planes.

Security Consistency

Applying uniform security policies (IAM least privilege, encryption standards, network micro-segmentation, vulnerability scanning) across AWS and GCP requires abstraction layers and consolidated tooling. A gap in one provider's posture creates risk for the whole estate.

Cost Visibility

Each provider has a separate billing system, cost taxonomy, and discount model (Reserved Instances, Committed Use Discounts). Achieving a unified FinOps view requires aggregation tooling (e.g., CloudHealth, Apptio Cloudability, or self-built via billing exports to BigQuery/S3).

Data Egress Costs

Transferring large data sets between cloud providers generates significant egress charges. Architectures must minimize cross-cloud data movement, keeping compute close to data and using replication only where justified by availability or compliance requirements.

Multi-Cloud Maturity Model

Stage Characteristics Governance Tooling Typical Org
1 — Cloud-First Single primary cloud; all new workloads go to cloud. On-premises being drained. Basic SCPs / Org Policies per provider. Manual compliance checks. Cloud-native CLI and console. Early Terraform adoption. SMB or early-stage enterprise with one strategic cloud vendor
2 — Multi-Cloud Aware Two or more clouds in production. Workloads segregated by cloud. Minimal cross-cloud integration. Landing zones defined per cloud. Separate IAM and policy management. Terraform for all clouds. Basic unified monitoring (Datadog or Grafana). Cloud SSO per provider. Enterprise post-acquisition, or teams adopting GCP ML alongside AWS production
3 — Multi-Cloud Optimized Workload placement is policy-driven. Cross-cloud networking and identity are federated. FinOps is unified. Policy-as-code (OPA/Sentinel). Unified SCP + Org Policy framework. Automated compliance scanning across providers. Platform engineering team. GitOps (ArgoCD). Anthos or Arc for Kubernetes. HashiCorp Vault for secrets. Unified cost dashboard. Large enterprise with a dedicated Platform Engineering or CCoE function

Workload Placement Decision Framework

Deciding which cloud to place a workload on requires evaluating multiple factors systematically. Apply the following criteria as a scoring model or decision gate.

Criterion Prefer AWS Prefer GCP Either / Neutral
ML / AI workload SageMaker (managed MLOps, broad model support) Vertex AI, TPU access, BigQuery ML (tighter data integration) Custom model on GPU VMs
Analytics / Data warehouse Redshift + Glue + Lake Formation BigQuery (serverless, petabyte-scale, lowest query latency) Open-source (Spark on either)
Kubernetes at scale EKS (mature managed control plane, broad add-on ecosystem) GKE Autopilot (lowest ops overhead, Autopilot mode) Either with Anthos/Arc unification
Microsoft workloads (AD, SQL Server) Strong (native AD integration, RDS for SQL Server) Possible but not optimal Azure would be preferred over both
Existing team expertise Team holds AWS certifications / significant AWS experience Team holds GCP certifications / GCP background Platform-agnostic Terraform / Kubernetes skills
Data residency (SEA/APAC) ap-southeast-1 (Singapore), ap-southeast-2 (Sydney), local zones asia-southeast1 (Singapore), asia-southeast2 (Jakarta) Evaluate specific country availability per provider
Serverless / Event-driven Lambda + SQS + EventBridge (mature, wide trigger support) Cloud Run + Pub/Sub + Eventarc (simpler scaling model) Both are production-grade
Cost at scale Savings Plans + Spot flexible; better for diverse instance mix CUDs + Spot VMs; compute pricing often lower for sustained workloads Negotiate enterprise agreements on both

Common Multi-Cloud Patterns

Active-Active

Production traffic is served simultaneously from two cloud providers. A global load balancer (e.g., Cloudflare, AWS Route 53 with latency routing, or GCP Cloud Load Balancing) routes users to the nearest healthy endpoint. Both clouds carry live traffic and are continuously synchronized. Requires distributed data strategy (CRDTs, multi-master databases, or event streaming with Kafka/Pub-Sub).

Use case: Mission-critical SaaS with SLA > 99.99%, global user base.

Active-Passive

Primary cloud handles all production traffic. Secondary cloud maintains a warm standby (replicated data, pre-provisioned infrastructure in a halted or minimal state). On failure, a failover procedure (automated or manual) promotes the secondary cloud. RTO is typically minutes; RPO depends on replication lag.

Use case: Disaster recovery, business continuity for tier-1 applications where active-active cost is not justified.

Cloud Bursting

Normal capacity runs on primary infrastructure (on-premises or primary cloud). During demand spikes, overflow workloads burst to a secondary cloud. Requires pre-provisioned network connectivity and identical container images or AMIs on the burst target. Kubernetes with cluster federation or KEDA-based autoscaling can orchestrate cross-cloud bursting.

Use case: Batch processing, seasonal e-commerce peaks, financial end-of-day processing.

Data Federation

Data is distributed across clouds or between cloud and on-premises. Query engines (BigQuery Omni, AWS Athena Federated Query, Starburst, Trino) access data in place without movement. Data catalog (DataPlex, AWS Glue Data Catalog) provides unified metadata. Minimizes egress costs while enabling cross-cloud analytics.

Use case: Analytics on data that must remain in a specific cloud due to compliance, but needs to be joined with data from another provider.

Tool Ecosystem

A consistent toolchain across cloud providers reduces cognitive overhead and enforces standard practices.

Tool Category Role in Multi-Cloud
Terraform Infrastructure as Code Single IaC language for AWS, GCP, and on-premises resources. Provider plugins abstract cloud-specific APIs. State management via S3/GCS backend enables shared infrastructure state across teams.
Google Anthos / Azure Arc Kubernetes Control Plane Extends GKE (Anthos) or Azure AKS (Arc) management to clusters running on AWS, on-premises, or edge. Unified policy, RBAC, and service mesh configuration across all registered clusters.
HashiCorp Vault Secrets Management Central secrets broker that speaks to both AWS IAM (dynamic credentials via AWS Secrets Engine) and GCP IAM (Service Account keys or Workload Identity). Applications retrieve short-lived credentials without storing secrets in config.
Datadog / Grafana Observability Unified dashboards ingesting metrics, logs, and traces from AWS CloudWatch, GCP Cloud Monitoring, and Kubernetes. Single pane of glass for cross-cloud SLO tracking and incident correlation.
Open Policy Agent (OPA) Policy as Code Language-agnostic policy engine. Enforce Terraform plan policies (via Conftest), Kubernetes admission (via Gatekeeper), and API authorization across environments with a single Rego policy library.
Crossplane Cloud Resource Composition Kubernetes-native control plane that provisions AWS and GCP resources via Custom Resource Definitions (CRDs). Teams provision cloud infrastructure through the same GitOps pipeline used for application deployment.
SOPS / External Secrets Operator Secrets Injection SOPS encrypts secrets in Git using AWS KMS or GCP KMS. External Secrets Operator syncs secrets from AWS Secrets Manager and GCP Secret Manager into Kubernetes Secrets at runtime.

Identity Federation: AWS IAM and GCP Workload Identity

Federated identity eliminates long-lived credential sharing between cloud environments. AWS and GCP both support OIDC-based Workload Identity Federation.

Concept: AWS to GCP Keyless Authentication

A workload running on AWS (e.g., an EC2 instance, ECS task, or Lambda) needs to call a GCP API without storing a GCP Service Account key. The flow uses OIDC tokens issued by AWS STS and trusted by GCP's Workload Identity Federation.

OIDC Token Flow (AWS workload calling GCP):
  1. AWS workload calls sts:AssumeRoleWithWebIdentity (or retrieves an OIDC token from the EC2 metadata endpoint).
  2. AWS STS issues a short-lived JWT (OIDC token) signed by the AWS identity provider.
  3. Workload exchanges the AWS OIDC token with GCP's Security Token Service endpoint: https://sts.googleapis.com/v1/token.
  4. GCP STS validates the token against the trusted AWS OIDC issuer configured in the Workload Identity Pool.
  5. GCP returns a short-lived access token (federated identity credential) scoped to the bound GCP Service Account.
  6. Workload uses the federated token to call GCP APIs. No GCP Service Account key was created or stored.

Terraform: GCP Workload Identity Pool for AWS

# Create a Workload Identity Pool
resource "google_iam_workload_identity_pool" "aws_pool" {
  project                   = var.gcp_project_id
  workload_identity_pool_id = "aws-workload-pool"
  display_name              = "AWS Workload Identity Pool"
  description               = "Pool for AWS workloads to authenticate to GCP"
}

# Add AWS as a trusted OIDC provider within the pool
resource "google_iam_workload_identity_pool_provider" "aws_provider" {
  project                            = var.gcp_project_id
  workload_identity_pool_id          = google_iam_workload_identity_pool.aws_pool.workload_identity_pool_id
  workload_identity_pool_provider_id = "aws-provider"
  display_name                       = "AWS Provider"

  aws {
    account_id = var.aws_account_id   # Trust tokens from this AWS account only
  }

  # Attribute mapping: map AWS caller identity to GCP subject
  attribute_mapping = {
    "google.subject"        = "assertion.arn"
    "attribute.aws_account" = "assertion.account"
    "attribute.aws_role"    = "assertion.arn.contains('assumed-role') ? assertion.arn.extract('assumed-role/{role}/') : ''"
  }

  # Condition: only allow tokens from a specific AWS role
  attribute_condition = "attribute.aws_role == 'my-app-role'"
}

# Bind the GCP Service Account to identities from the pool
resource "google_service_account_iam_member" "workload_binding" {
  service_account_id = google_service_account.app_sa.name
  role               = "roles/iam.workloadIdentityUser"
  member             = "principalSet://iam.googleapis.com/${google_iam_workload_identity_pool.aws_pool.name}/attribute.aws_role/my-app-role"
}

# The GCP Service Account used by the AWS workload
resource "google_service_account" "app_sa" {
  project      = var.gcp_project_id
  account_id   = "aws-app-service-account"
  display_name = "Service Account for AWS Application"
}

# Grant the SA only the permissions it needs (least privilege)
resource "google_project_iam_member" "app_sa_storage" {
  project = var.gcp_project_id
  role    = "roles/storage.objectViewer"
  member  = "serviceAccount:${google_service_account.app_sa.email}"
}

GCP to AWS: Using OIDC via GCP Workload Identity

# On the AWS side: create an OIDC identity provider for GCP
resource "aws_iam_openid_connect_provider" "gcp" {
  url             = "https://accounts.google.com"
  client_id_list  = ["sts.amazonaws.com"]
  thumbprint_list = ["08745487e891c19e3078c1f2a07e452950ef36f6"]
}

# IAM role trusted by the GCP OIDC provider
resource "aws_iam_role" "gcp_workload_role" {
  name = "gcp-workload-role"

  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Effect    = "Allow"
      Principal = { Federated = aws_iam_openid_connect_provider.gcp.arn }
      Action    = "sts:AssumeRoleWithWebIdentity"
      Condition = {
        StringEquals = {
          "accounts.google.com:sub" = "SERVICE_ACCOUNT_UNIQUE_ID"
        }
      }
    }]
  })
}

resource "aws_iam_role_policy_attachment" "gcp_workload_s3" {
  role       = aws_iam_role.gcp_workload_role.name
  policy_arn = "arn:aws:iam::aws:policy/AmazonS3ReadOnlyAccess"
}
Continue reading:
  • Landing Zone Design — Account/project structure, guardrails, Terraform modules for multi-account AWS and GCP
  • Hybrid Connectivity — VPN, Direct Connect, Cloud Interconnect, cross-cloud BGP, DNS resolution, AWS Outposts, Anthos