Multi-cloud & Hybrid Infrastructure
Definitions and Differences
Multi-Cloud
The deliberate use of cloud services from two or more independent public cloud providers (e.g., AWS and GCP) to run distinct workloads or replicate workloads for resilience. Each cloud operates independently — there is no required network connectivity between them, though cross-cloud integration is common.
Key trait: Multiple public cloud providers, same or different workloads per provider.
Hybrid Cloud
An architecture that integrates at least one public cloud environment with on-premises infrastructure (private cloud or traditional data center) through persistent, private network connectivity. Applications and data can flow between environments as a unified operational model.
Key trait: On-premises + public cloud with deep network and identity integration.
| Dimension | Multi-Cloud | Hybrid Cloud | Both |
|---|---|---|---|
| Primary goal | Best-of-breed services, resilience, vendor independence | Integrate on-premises legacy with cloud scalability | Data sovereignty, disaster recovery |
| Connectivity requirement | Optional (workloads may be fully independent) | Mandatory (dedicated link or VPN) | Required for data federation scenarios |
| Primary driver | Vendor lock-in avoidance, regulatory | Legacy modernization, compliance, burst capacity | M&A integration, edge computing |
| Operational complexity | High (multiple control planes) | High (network, identity bridging) | Very High |
Why Multi-Cloud?
Enterprises adopt multi-cloud for business, technical, and regulatory reasons — rarely for a single motive.
Avoid Vendor Lock-in
Dependence on a single provider's proprietary services (e.g., AWS DynamoDB, GCP BigQuery) makes migration expensive and slow. A multi-cloud strategy encourages use of portable standards (Kubernetes, Terraform, open data formats) and preserves negotiation leverage at contract renewal.
Best-of-Breed Services
No single cloud leads in every domain. GCP's BigQuery and Vertex AI are widely regarded as best-in-class for analytics and ML. AWS leads in breadth of services and global infrastructure maturity. Azure dominates in hybrid Active Directory integration and Microsoft workload licensing. Multi-cloud enables teams to use the optimal service for each workload.
Data Residency and Compliance
Regulations such as GDPR (EU), PDPA (Thailand/Singapore), and financial sector requirements mandate that data remain within specific geographic or political boundaries. Some countries are served by only one major cloud provider's local region — operating on a second cloud can satisfy residency requirements that a single provider cannot.
M&A Integration
Acquisitions frequently result in inheriting a different cloud provider's estate. Rather than forcing costly re-platforming, organizations operate the acquired company on its existing cloud while establishing governance bridges (federated identity, shared observability, cross-cloud networking) as an interim state during integration.
Resilience Against Provider Outages
A major cloud provider outage affecting an entire region or global control plane (as seen in historical incidents) is mitigated when critical workloads are distributed across independent providers with separate failure domains. Active-active multi-cloud deployments can maintain availability during such events.
Multi-Cloud Challenges
Each cloud provider has distinct APIs, CLIs, IAM models, networking primitives, and console UIs. Platform teams must maintain expertise and tooling across multiple stacks. Incident response procedures, runbooks, and on-call rotations all multiply in scope.
Deep expertise in AWS and GCP simultaneously is rare. Training, certification, and hiring costs increase. Team silos can form around cloud providers, creating knowledge bottlenecks and inconsistent practices.
Cross-cloud traffic flows over the public internet unless dedicated interconnects are established. IP address management, BGP routing, DNS resolution, and firewall rules must be coordinated across entirely different network control planes.
Applying uniform security policies (IAM least privilege, encryption standards, network micro-segmentation, vulnerability scanning) across AWS and GCP requires abstraction layers and consolidated tooling. A gap in one provider's posture creates risk for the whole estate.
Each provider has a separate billing system, cost taxonomy, and discount model (Reserved Instances, Committed Use Discounts). Achieving a unified FinOps view requires aggregation tooling (e.g., CloudHealth, Apptio Cloudability, or self-built via billing exports to BigQuery/S3).
Transferring large data sets between cloud providers generates significant egress charges. Architectures must minimize cross-cloud data movement, keeping compute close to data and using replication only where justified by availability or compliance requirements.
Multi-Cloud Maturity Model
| Stage | Characteristics | Governance | Tooling | Typical Org |
|---|---|---|---|---|
| 1 — Cloud-First | Single primary cloud; all new workloads go to cloud. On-premises being drained. | Basic SCPs / Org Policies per provider. Manual compliance checks. | Cloud-native CLI and console. Early Terraform adoption. | SMB or early-stage enterprise with one strategic cloud vendor |
| 2 — Multi-Cloud Aware | Two or more clouds in production. Workloads segregated by cloud. Minimal cross-cloud integration. | Landing zones defined per cloud. Separate IAM and policy management. | Terraform for all clouds. Basic unified monitoring (Datadog or Grafana). Cloud SSO per provider. | Enterprise post-acquisition, or teams adopting GCP ML alongside AWS production |
| 3 — Multi-Cloud Optimized | Workload placement is policy-driven. Cross-cloud networking and identity are federated. FinOps is unified. | Policy-as-code (OPA/Sentinel). Unified SCP + Org Policy framework. Automated compliance scanning across providers. | Platform engineering team. GitOps (ArgoCD). Anthos or Arc for Kubernetes. HashiCorp Vault for secrets. Unified cost dashboard. | Large enterprise with a dedicated Platform Engineering or CCoE function |
Workload Placement Decision Framework
Deciding which cloud to place a workload on requires evaluating multiple factors systematically. Apply the following criteria as a scoring model or decision gate.
| Criterion | Prefer AWS | Prefer GCP | Either / Neutral |
|---|---|---|---|
| ML / AI workload | SageMaker (managed MLOps, broad model support) | Vertex AI, TPU access, BigQuery ML (tighter data integration) | Custom model on GPU VMs |
| Analytics / Data warehouse | Redshift + Glue + Lake Formation | BigQuery (serverless, petabyte-scale, lowest query latency) | Open-source (Spark on either) |
| Kubernetes at scale | EKS (mature managed control plane, broad add-on ecosystem) | GKE Autopilot (lowest ops overhead, Autopilot mode) | Either with Anthos/Arc unification |
| Microsoft workloads (AD, SQL Server) | Strong (native AD integration, RDS for SQL Server) | Possible but not optimal | Azure would be preferred over both |
| Existing team expertise | Team holds AWS certifications / significant AWS experience | Team holds GCP certifications / GCP background | Platform-agnostic Terraform / Kubernetes skills |
| Data residency (SEA/APAC) | ap-southeast-1 (Singapore), ap-southeast-2 (Sydney), local zones | asia-southeast1 (Singapore), asia-southeast2 (Jakarta) | Evaluate specific country availability per provider |
| Serverless / Event-driven | Lambda + SQS + EventBridge (mature, wide trigger support) | Cloud Run + Pub/Sub + Eventarc (simpler scaling model) | Both are production-grade |
| Cost at scale | Savings Plans + Spot flexible; better for diverse instance mix | CUDs + Spot VMs; compute pricing often lower for sustained workloads | Negotiate enterprise agreements on both |
Common Multi-Cloud Patterns
Active-Active
Production traffic is served simultaneously from two cloud providers. A global load balancer (e.g., Cloudflare, AWS Route 53 with latency routing, or GCP Cloud Load Balancing) routes users to the nearest healthy endpoint. Both clouds carry live traffic and are continuously synchronized. Requires distributed data strategy (CRDTs, multi-master databases, or event streaming with Kafka/Pub-Sub).
Use case: Mission-critical SaaS with SLA > 99.99%, global user base.
Active-Passive
Primary cloud handles all production traffic. Secondary cloud maintains a warm standby (replicated data, pre-provisioned infrastructure in a halted or minimal state). On failure, a failover procedure (automated or manual) promotes the secondary cloud. RTO is typically minutes; RPO depends on replication lag.
Use case: Disaster recovery, business continuity for tier-1 applications where active-active cost is not justified.
Cloud Bursting
Normal capacity runs on primary infrastructure (on-premises or primary cloud). During demand spikes, overflow workloads burst to a secondary cloud. Requires pre-provisioned network connectivity and identical container images or AMIs on the burst target. Kubernetes with cluster federation or KEDA-based autoscaling can orchestrate cross-cloud bursting.
Use case: Batch processing, seasonal e-commerce peaks, financial end-of-day processing.
Data Federation
Data is distributed across clouds or between cloud and on-premises. Query engines (BigQuery Omni, AWS Athena Federated Query, Starburst, Trino) access data in place without movement. Data catalog (DataPlex, AWS Glue Data Catalog) provides unified metadata. Minimizes egress costs while enabling cross-cloud analytics.
Use case: Analytics on data that must remain in a specific cloud due to compliance, but needs to be joined with data from another provider.
Tool Ecosystem
A consistent toolchain across cloud providers reduces cognitive overhead and enforces standard practices.
| Tool | Category | Role in Multi-Cloud |
|---|---|---|
| Terraform | Infrastructure as Code | Single IaC language for AWS, GCP, and on-premises resources. Provider plugins abstract cloud-specific APIs. State management via S3/GCS backend enables shared infrastructure state across teams. |
| Google Anthos / Azure Arc | Kubernetes Control Plane | Extends GKE (Anthos) or Azure AKS (Arc) management to clusters running on AWS, on-premises, or edge. Unified policy, RBAC, and service mesh configuration across all registered clusters. |
| HashiCorp Vault | Secrets Management | Central secrets broker that speaks to both AWS IAM (dynamic credentials via AWS Secrets Engine) and GCP IAM (Service Account keys or Workload Identity). Applications retrieve short-lived credentials without storing secrets in config. |
| Datadog / Grafana | Observability | Unified dashboards ingesting metrics, logs, and traces from AWS CloudWatch, GCP Cloud Monitoring, and Kubernetes. Single pane of glass for cross-cloud SLO tracking and incident correlation. |
| Open Policy Agent (OPA) | Policy as Code | Language-agnostic policy engine. Enforce Terraform plan policies (via Conftest), Kubernetes admission (via Gatekeeper), and API authorization across environments with a single Rego policy library. |
| Crossplane | Cloud Resource Composition | Kubernetes-native control plane that provisions AWS and GCP resources via Custom Resource Definitions (CRDs). Teams provision cloud infrastructure through the same GitOps pipeline used for application deployment. |
| SOPS / External Secrets Operator | Secrets Injection | SOPS encrypts secrets in Git using AWS KMS or GCP KMS. External Secrets Operator syncs secrets from AWS Secrets Manager and GCP Secret Manager into Kubernetes Secrets at runtime. |
Identity Federation: AWS IAM and GCP Workload Identity
Federated identity eliminates long-lived credential sharing between cloud environments. AWS and GCP both support OIDC-based Workload Identity Federation.
Concept: AWS to GCP Keyless Authentication
A workload running on AWS (e.g., an EC2 instance, ECS task, or Lambda) needs to call a GCP API without storing a GCP Service Account key. The flow uses OIDC tokens issued by AWS STS and trusted by GCP's Workload Identity Federation.
- AWS workload calls
sts:AssumeRoleWithWebIdentity(or retrieves an OIDC token from the EC2 metadata endpoint). - AWS STS issues a short-lived JWT (OIDC token) signed by the AWS identity provider.
- Workload exchanges the AWS OIDC token with GCP's Security Token Service endpoint:
https://sts.googleapis.com/v1/token. - GCP STS validates the token against the trusted AWS OIDC issuer configured in the Workload Identity Pool.
- GCP returns a short-lived access token (federated identity credential) scoped to the bound GCP Service Account.
- Workload uses the federated token to call GCP APIs. No GCP Service Account key was created or stored.
Terraform: GCP Workload Identity Pool for AWS
# Create a Workload Identity Pool
resource "google_iam_workload_identity_pool" "aws_pool" {
project = var.gcp_project_id
workload_identity_pool_id = "aws-workload-pool"
display_name = "AWS Workload Identity Pool"
description = "Pool for AWS workloads to authenticate to GCP"
}
# Add AWS as a trusted OIDC provider within the pool
resource "google_iam_workload_identity_pool_provider" "aws_provider" {
project = var.gcp_project_id
workload_identity_pool_id = google_iam_workload_identity_pool.aws_pool.workload_identity_pool_id
workload_identity_pool_provider_id = "aws-provider"
display_name = "AWS Provider"
aws {
account_id = var.aws_account_id # Trust tokens from this AWS account only
}
# Attribute mapping: map AWS caller identity to GCP subject
attribute_mapping = {
"google.subject" = "assertion.arn"
"attribute.aws_account" = "assertion.account"
"attribute.aws_role" = "assertion.arn.contains('assumed-role') ? assertion.arn.extract('assumed-role/{role}/') : ''"
}
# Condition: only allow tokens from a specific AWS role
attribute_condition = "attribute.aws_role == 'my-app-role'"
}
# Bind the GCP Service Account to identities from the pool
resource "google_service_account_iam_member" "workload_binding" {
service_account_id = google_service_account.app_sa.name
role = "roles/iam.workloadIdentityUser"
member = "principalSet://iam.googleapis.com/${google_iam_workload_identity_pool.aws_pool.name}/attribute.aws_role/my-app-role"
}
# The GCP Service Account used by the AWS workload
resource "google_service_account" "app_sa" {
project = var.gcp_project_id
account_id = "aws-app-service-account"
display_name = "Service Account for AWS Application"
}
# Grant the SA only the permissions it needs (least privilege)
resource "google_project_iam_member" "app_sa_storage" {
project = var.gcp_project_id
role = "roles/storage.objectViewer"
member = "serviceAccount:${google_service_account.app_sa.email}"
}
GCP to AWS: Using OIDC via GCP Workload Identity
# On the AWS side: create an OIDC identity provider for GCP
resource "aws_iam_openid_connect_provider" "gcp" {
url = "https://accounts.google.com"
client_id_list = ["sts.amazonaws.com"]
thumbprint_list = ["08745487e891c19e3078c1f2a07e452950ef36f6"]
}
# IAM role trusted by the GCP OIDC provider
resource "aws_iam_role" "gcp_workload_role" {
name = "gcp-workload-role"
assume_role_policy = jsonencode({
Version = "2012-10-17"
Statement = [{
Effect = "Allow"
Principal = { Federated = aws_iam_openid_connect_provider.gcp.arn }
Action = "sts:AssumeRoleWithWebIdentity"
Condition = {
StringEquals = {
"accounts.google.com:sub" = "SERVICE_ACCOUNT_UNIQUE_ID"
}
}
}]
})
}
resource "aws_iam_role_policy_attachment" "gcp_workload_s3" {
role = aws_iam_role.gcp_workload_role.name
policy_arn = "arn:aws:iam::aws:policy/AmazonS3ReadOnlyAccess"
}
- Landing Zone Design — Account/project structure, guardrails, Terraform modules for multi-account AWS and GCP
- Hybrid Connectivity — VPN, Direct Connect, Cloud Interconnect, cross-cloud BGP, DNS resolution, AWS Outposts, Anthos