Cloud Overview

Cloud infrastructure — design for scalability, security, and cost efficiency. Build systems that grow with demand, protect data at every layer, and eliminate waste.

What is Cloud Computing?

Cloud computing delivers on-demand IT resources over the internet with pay-as-you-go pricing. Instead of buying and maintaining physical data centers, you access technology services from a cloud provider. The primary service models differ in how much of the stack is managed for you.

Model Full Name You Manage Provider Manages Examples
IaaS Infrastructure as a Service OS, runtime, data, apps Servers, storage, networking, virtualization AWS EC2, GCP Compute Engine, Azure VMs
PaaS Platform as a Service Data and apps OS, runtime, middleware, servers, storage AWS Elastic Beanstalk, GCP App Engine, Heroku
SaaS Software as a Service Configuration and data input Everything Gmail, Salesforce, Slack, Datadog
FaaS Function as a Service Code logic only Infrastructure, scaling, OS, runtime AWS Lambda, GCP Cloud Functions, Azure Functions

Multi-Cloud vs Hybrid Cloud

Multi-Cloud Strategy

Multi-cloud uses services from two or more public cloud providers (e.g., AWS + GCP) simultaneously for different workloads or as redundancy.

Pros:
  • Avoid vendor lock-in
  • Use best-of-breed services per cloud
  • Geographic redundancy and compliance
  • Leverage competitive pricing
  • Resilience against cloud provider outages
Cons:
  • Operational complexity increases significantly
  • Skill requirements multiply
  • Data egress costs between clouds
  • Inconsistent tooling and APIs
  • Harder to optimize costs holistically

Hybrid Cloud Strategy

Hybrid cloud connects on-premises infrastructure with one or more public clouds, enabling data and applications to move between environments.

Use Cases:
  • Regulatory compliance (data residency)
  • Legacy systems modernization
  • Burst capacity for peak demand
  • Disaster recovery to cloud
Anti-Patterns:
  • Treating cloud as "just another data center"
  • Lift-and-shift without optimization
  • Ignoring cloud-native services
  • No unified identity management

AWS Well-Architected Framework — 6 Pillars

The AWS Well-Architected Framework provides architectural best practices across six pillars. Each pillar includes design principles, questions, and best practices to help evaluate and improve cloud architectures.

1. Operational Excellence

Focus on running and monitoring systems to deliver business value and continually improve processes. Key practices include performing operations as code (IaC), annotating documentation, making frequent small reversible changes, refining operations procedures frequently, and anticipating failure.

Key services: AWS CloudFormation, Systems Manager, CloudWatch, X-Ray, CodePipeline

2. Security

Protect information, systems, and assets while delivering business value through risk assessments and mitigation strategies. Apply security at all layers, enable traceability, implement least-privilege access, secure data in transit and at rest, and automate security best practices.

Key services: IAM, KMS, CloudTrail, Config, GuardDuty, Security Hub, WAF, Shield

3. Reliability

Ensure a workload performs its intended function correctly and consistently when expected. Recover from failures automatically, scale horizontally to increase aggregate workload availability, stop guessing capacity, manage change with automation.

Key services: Route 53, ELB, Auto Scaling, CloudWatch, S3, Glacier

4. Performance Efficiency

Use IT and computing resources efficiently to meet system requirements, and maintain that efficiency as demand changes and technologies evolve. Democratize advanced technologies, go global in minutes, use serverless architectures, experiment more often.

Key services: EC2 (right-sizing), ElastiCache, CloudFront, Lambda, EKS

5. Cost Optimization

Run systems to deliver business value at the lowest price point. Adopt a consumption model, measure overall efficiency, stop spending money on undifferentiated heavy lifting, and analyze and attribute expenditure.

Key services: Cost Explorer, Budgets, Trusted Advisor, Reserved Instances, Savings Plans, Spot Instances

6. Sustainability

Minimize the environmental impact of running cloud workloads. Understand your impact, establish sustainability goals, maximize utilization, anticipate and adopt new hardware/software, use managed services, and reduce the downstream impact of your cloud workloads.

Key services: Compute Optimizer, Graviton instances, Lambda, Fargate

GCP Architecture Framework — Pillar Mapping

GCP's Architecture Framework aligns closely with AWS's model, providing the same 6 categories adapted to Google's ecosystem.

Pillar AWS WA Equivalent Key GCP Services
Operational Excellence Operational Excellence Cloud Deployment Manager, Cloud Build, Cloud Monitoring, Error Reporting
Security, Privacy, Compliance Security IAM, Cloud KMS, Security Command Center, Cloud Armor, VPC Service Controls
Reliability Reliability Cloud Load Balancing, Cloud DNS, Cloud Storage, Spanner (global), GKE Autopilot
Performance Optimization Performance Efficiency Cloud CDN, Memorystore, Bigtable, BigQuery, Cloud Trace, Profiler
Cost Optimization Cost Optimization Billing Reports, Recommender, Committed Use Discounts, Spot VMs, Budgets
Sustainability Sustainability Carbon Footprint dashboard, Tau T2D (ARM), serverless products, region selection

Cloud Design Patterns

Active-Active Multi-AZ

Traffic is distributed across multiple availability zones simultaneously. All instances serve production traffic. Failure of one AZ does not cause downtime — load balancers automatically route to healthy AZs. Best for high-throughput, low-latency applications.

Active-Passive Failover

Primary AZ/region handles all traffic; secondary stands by in a warm or hot standby state. On failure, DNS failover or a load balancer health check promotes the passive to active. Simpler and cheaper than active-active but has a failover lag (typically 30–120 seconds for DNS TTL).

Multi-Region Deployment

Application and data are replicated across geographically separate cloud regions. Serves users from the nearest region for latency reduction. Critical for global compliance, data residency requirements, and ultra-high availability (SLA > 99.99%).

Blue-Green Deployment

Two identical production environments run in parallel (blue = current, green = new). Deploy the new version to green, validate, then switch traffic via DNS or load balancer. Zero-downtime deployments with instant rollback by switching traffic back to blue.

Cloud Migration Strategies — The 6 Rs

When moving workloads to the cloud, choose the migration strategy based on business goals, time constraints, and technical complexity.

1. Rehost (Lift and Shift)

Move applications to the cloud without changes — same OS, same configuration. Fast and low-risk but misses cloud-native benefits. Use for quick migrations under deadline pressure or for legacy apps that can't be easily modified.

Example: Move an on-premises RHEL MySQL server to an AWS EC2 instance with the same configuration using VM Import/Export.

2. Replatform (Lift, Tinker, and Shift)

Make a few cloud optimizations without changing core architecture. Examples: migrate MySQL to Amazon RDS, move app server to Elastic Beanstalk, use managed container services instead of self-managed containers.

Example: Move a self-managed PostgreSQL server to Cloud SQL (GCP) with automated backups and HA.

3. Repurchase (Drop and Shop)

Replace existing application with a different product, usually a SaaS solution. Common for CRM, ERP, email systems where the cloud SaaS equivalent is more cost-effective and feature-rich.

Example: Replace on-premises Exchange with Microsoft 365 or replace Jira Data Center with Jira Cloud.

4. Refactor / Re-architect

Redesign the application using cloud-native capabilities. Decompose monolith into microservices, adopt serverless, event-driven architecture, or containerization. Highest effort but greatest long-term benefit.

Example: Break a monolithic e-commerce app into services — order service (Lambda), product catalog (DynamoDB + API Gateway), authentication (Cognito).

5. Retire

Decommission applications that are no longer needed or are redundant after evaluating the portfolio. Reduces cost and complexity of managing unnecessary workloads.

Example: Identify 20% of the application portfolio as rarely used during assessment and shut them down post-migration.

6. Retain

Keep applications on-premises or in current state if migration doesn't make business sense yet. Usually for apps with upcoming major version updates, pending compliance approvals, or too costly to migrate at this time.

Example: Retain a legacy COBOL mainframe application that will be replaced by a SaaS vendor in 18 months.

Landing Zone Concept

A Landing Zone is a pre-configured, secure, multi-account cloud environment based on best practices. It provides the foundation upon which all cloud workloads are deployed.

Account / Project Structure

# AWS Organization Structure
Root (Management Account)
├── Security OU
│   ├── Audit Account        # CloudTrail, Config aggregation
│   └── Log Archive Account  # Centralized logging (S3)
├── Infrastructure OU
│   ├── Network Account      # Transit Gateway, Direct Connect
│   └── Shared Services      # Active Directory, DNS, AMI catalog
├── Workloads OU
│   ├── Production OU
│   │   ├── Prod-AppA Account
│   │   └── Prod-AppB Account
│   ├── Staging OU
│   │   └── Staging-AppA Account
│   └── Dev OU
│       └── Dev-Sandbox Account
└── Sandbox OU
    └── Individual developer sandboxes

Network Topology

# Hub-and-Spoke with Transit Gateway (AWS)
Network Account (Hub)
  ├── Transit Gateway
  ├── Shared VPC (10.0.0.0/16)
  │   ├── VPN / Direct Connect endpoints
  │   └── Centralized Egress (NAT Gateway)
  └── DNS VPC (Route 53 Resolver)

Spoke VPCs (per workload account)
  ├── Prod VPC: 10.10.0.0/16 → TGW attachment
  ├── Staging VPC: 10.20.0.0/16 → TGW attachment
  └── Dev VPC: 10.30.0.0/16 → TGW attachment

# GCP Shared VPC equivalent
Host Project (VPC host)
  └── Shared VPC → subnets delegated to service projects
Service Project A (prod workloads) → uses shared subnets
Service Project B (dev workloads) → uses shared subnets

IAM Hierarchy

# AWS IAM hierarchy via Organizations
- SCPs (Service Control Policies) at OU level: deny guardrails
- Permission Boundaries: limit max IAM permissions per role
- Role: specific task roles (Developer, ReadOnly, Admin)
- IAM Identity Center (SSO): federated access across accounts

# GCP IAM hierarchy
Organization → Folders → Projects → Resources
- Org Policies: constraints applied top-down
- IAM bindings at each level (inherited downward)
- Service Accounts: workload identity at resource level
- Workload Identity Federation: keyless auth for CI/CD

FinOps Integration — Cost from Day 1

Principle: Cost visibility and optimization should be embedded from the architecture design phase, not bolted on after deployment.

Tagging Strategy

# Required tags for every resource (enforce via SCPs / Org Policies)
Environment:    prod | staging | dev | sandbox
Owner:          team-platform | team-backend | team-data
Project:        project-name or cost-center-code
Application:    app-name
ManagedBy:      terraform | manual | cloudformation
CostCenter:     CC-12345

# AWS Tag Policy example (JSON)
{
  "tags": {
    "Environment": {
      "tag_key": { "@@assign": "Environment" },
      "tag_value": {
        "@@assign": ["prod", "staging", "dev", "sandbox"]
      },
      "enforced_for": {
        "@@assign": ["ec2:instance", "rds:db", "s3:bucket"]
      }
    }
  }
}

# Terraform tagging (locals block)
locals {
  common_tags = {
    Environment = var.environment
    Owner       = var.team
    Project     = var.project
    ManagedBy   = "terraform"
    CostCenter  = var.cost_center
  }
}

Key Tooling Ecosystem

Tool Category Purpose
Terraform IaC Provision and manage cloud infrastructure declaratively across AWS, GCP, Azure
Ansible Configuration Management Configure OS, install packages, manage files on cloud VMs and on-premises servers
Helm Package Manager Deploy and manage Kubernetes applications using versioned charts
ArgoCD GitOps / CD Continuous delivery for Kubernetes — sync cluster state with Git repository
Prometheus Monitoring Metrics collection and alerting for cloud-native and Kubernetes workloads
CloudWatch Monitoring (AWS) AWS-native metrics, logs, alarms, dashboards, and tracing (X-Ray integration)
Cloud Monitoring Monitoring (GCP) GCP-native metrics collection, alerting policies, and uptime checks
Grafana Visualization Unified dashboards for Prometheus, CloudWatch, and Cloud Monitoring data sources
Next Steps: Dive deeper into each cloud provider: