Skip to content

Cloud Infrastructure and SRE Engineering Roadmap

Overview

This roadmap guides you from zero to production-ready Site Reliability Engineer across 36 months. The path is deliberately sequential: cloud operations without Linux fluency leads to cargo-culting, and SRE practices without observability fundamentals are empty process. Each phase builds on the last. Monthly milestones are calibrated for someone studying 10–15 hours per week alongside a full-time job; dedicated full-time study can compress each phase by roughly half.


Phase 1 — Foundations (Months 0–3)

Goal: Operate confidently in a Linux terminal, understand how packets move through a network, provision cloud resources, and automate repetitive tasks with bash.

Linux Fundamentals (LPIC-1 Level)

Start with the Linux command line before touching a cloud console. The ability to navigate the filesystem, manage processes, edit files, and read logs is a prerequisite for every subsequent skill. Work through the following areas:

  • Filesystem hierarchy: /proc, /sys, /dev, /etc, /var/log, mount points, inodes, permissions.
  • Process management: ps, top/htop, kill, signals, nice/renice, /proc/PID/.
  • User and group management, sudo, PAM basics, SSH key authentication.
  • Package management on Debian (apt) and Red Hat (dnf/yum) systems.
  • Systemd: units, targets, systemctl, journalctl, writing a simple service file.
  • File permissions, ACLs, chmod, chown, umask.
  • Storage: lsblk, fdisk/parted, LVM basics, df, du, mount.

Target certification: LPIC-1 (LPI Linux Essentials or CompTIA Linux+). These exams are not required but studying their syllabi ensures comprehensive coverage.

Monthly milestone (Month 1): Set up a personal Linux VM (Ubuntu LTS on VirtualBox or Multipass). Can SSH in, install packages, write a systemd unit, and diagnose a failed service using journalctl.

Networking Basics

A cloud engineer who cannot read a packet capture is permanently limited. Focus on:

  • TCP/IP model: IP addressing, subnets (CIDR notation), routing tables, default gateways.
  • DNS: recursive vs authoritative, A/AAAA/CNAME/MX/TXT records, TTL, dig and nslookup.
  • HTTP/HTTPS: request/response cycle, status codes, headers, TLS handshake, certificates.
  • Firewalls and security groups: iptables rules, stateful vs stateless filtering.
  • Tools: ping, traceroute, netstat/ss, tcpdump, curl -v.

Monthly milestone (Month 2): Capture a TLS handshake with tcpdump and identify the ClientHello, ServerHello, and certificate exchange. Explain what happens at each step.

AWS or GCP Fundamentals

Pick one cloud provider and go deep before touching the other. AWS has the broader job market; GCP has cleaner networking primitives. Either is fine.

  • Core compute: EC2/Compute Engine, instance types, AMIs/images, user data scripts.
  • Storage: S3/Cloud Storage (buckets, object lifecycle), EBS/persistent disks, EFS/Filestore.
  • Networking: VPC, subnets (public/private), internet gateway, NAT gateway, route tables, security groups.
  • IAM: users, roles, policies, least-privilege principle, instance profiles.
  • CLI fluency: aws or gcloud CLI for all operations — avoid clicking in the console once the concept is understood.

Target certification: AWS Cloud Practitioner (CCP) or GCP Cloud Digital Leader. These are entry-level but valuable for establishing vocabulary.

Monthly milestone (Month 3): Deploy a two-tier web application (web server + database) in a VPC with public/private subnet separation, a NAT gateway for private subnet egress, and appropriate security groups. No console clicks — automate with the CLI or CloudFormation/Deployment Manager.

Bash Scripting

Write production-quality scripts from the start:

  • Shebang, set -euo pipefail, quoting variables, $() vs backticks.
  • Control flow: if/elif/else, for/while/until, case.
  • Functions, local variables, return codes.
  • String manipulation: grep, sed, awk, parameter expansion.
  • Error handling: checking exit codes, trap for cleanup.
  • Working with JSON: jq for parsing cloud CLI output.

Book: "The Linux Command Line" by William Shotts (free at linuxcommand.org).


Phase 2 — Core Platform Skills (Months 3–9)

Goal: Operate Kubernetes clusters, write infrastructure as code, build an observability stack, and respond to production incidents.

Kubernetes (CKA Exam Path)

Kubernetes is the de facto deployment platform. Learn it from the inside out:

  • Architecture: API server, etcd, scheduler, controller manager, kubelet, kube-proxy.
  • Core objects: Pod, Deployment, ReplicaSet, StatefulSet, DaemonSet, Job, CronJob.
  • Networking: Services (ClusterIP, NodePort, LoadBalancer), Endpoints, DNS (CoreDNS), Ingress.
  • Storage: PersistentVolume, PersistentVolumeClaim, StorageClass, CSI.
  • Configuration: ConfigMap, Secret, resource requests/limits.
  • RBAC: ClusterRole, Role, ClusterRoleBinding, ServiceAccount.
  • Cluster operations: kubectl proficiency, node drain/cordon, rolling updates, rollbacks.

Target certification: Certified Kubernetes Administrator (CKA). This is a hands-on exam; practice with killer.sh and the CKA practice environments.

Monthly milestone (Month 5): Deploy a stateful application (PostgreSQL with a PVC) behind a ClusterIP service, with a web front end exposed via an Ingress controller. Perform a rolling update with zero downtime and verify it with kubectl rollout status.

Book: "Kubernetes in Action" by Marko Luksa (Manning, 2nd ed.). This is the best in-depth treatment of how Kubernetes internals work.

Terraform Infrastructure as Code

Avoid clicking in cloud consoles; define infrastructure as versioned code:

  • HCL syntax: resources, variables, outputs, locals, data sources.
  • State management: terraform.tfstate, remote backends (S3 + DynamoDB lock, GCS), terraform import.
  • Modules: writing reusable modules, the Terraform Registry.
  • Workspaces vs directory-based environment separation.
  • terraform plan / apply / destroy workflow.
  • Drift detection and terraform refresh.

Monthly milestone (Month 6): Write a Terraform module that provisions a production-ready VPC with public/private subnets, a Kubernetes cluster (EKS or GKE), and a managed database (RDS or Cloud SQL). Store state in a remote backend. Parameterize environment (dev/staging/prod) via variable files.

Observability Stack (Prometheus + Grafana + Loki)

You cannot operate what you cannot observe:

  • Prometheus: metrics collection, PromQL query language, recording rules, alerting rules, Alertmanager routing.
  • Grafana: dashboards, panels, variables, alert annotations, data source configuration.
  • Loki: log aggregation, LogQL, label strategy, Promtail/agent configuration.
  • Instrumentation: exposing /metrics endpoints in applications, prometheus_client libraries, histogram vs summary vs counter vs gauge.
  • Tracing concepts: distributed tracing with OpenTelemetry, trace context propagation (though full tracing implementation is Phase 3).

Project: Build an observability stack from scratch on a Kubernetes cluster. Deploy Prometheus via the kube-prometheus-stack Helm chart, configure scrape targets for all cluster components, write three custom dashboards (node resources, application RED metrics, database lag), and configure PagerDuty or Slack alerts for critical conditions.

Monthly milestone (Month 8): Stack is running; you can answer "what is the p99 latency of the checkout service over the last 6 hours?" and "which pod is generating the most log errors?" without SSH-ing into any machine.

Incident Response Basics

Learn the mechanics before you're on call:

  • Incident severity classification (P1–P4 definitions and response SLAs).
  • On-call rotation structure, escalation paths, paging tools (PagerDuty, Opsgenie).
  • Incident command: roles (IC, comms lead, scribe), blameless culture.
  • Runbooks: anatomy of a good runbook, when to use them, keeping them current.
  • Post-mortems: 5-whys analysis, action items with owners and due dates, sharing findings.

Monthly milestone (Month 9): Write three runbooks for services you operate. Run a game-day exercise: intentionally break something (kill a pod, exhaust disk on a node) and practice detecting, diagnosing, and resolving it while writing an incident timeline.


Phase 3 — SRE Principles (Months 9–18)

Goal: Implement SLI/SLO frameworks, apply chaos engineering, and understand distributed systems theory.

SRE Principles (Google SRE Book)

Book: "Site Reliability Engineering" by Beyer, Jones, Petoff, Murphy (O'Reilly, free at sre.google). Read cover to cover; pay special attention to chapters on eliminating toil, SLO-based alerting, and the production readiness review process.

Key concepts to internalize and implement:

  • Toil identification: quantify toil as a percentage of on-call work; set a target below 50%.
  • Error budgets: derive from SLOs; when the budget is spent, reliability work takes priority over feature work.
  • SLI/SLO/SLA hierarchy: SLIs are metrics (request success rate); SLOs are targets (99.9% over 28 days); SLAs are contractual commitments.
  • Release engineering: hermetic builds, progressive rollouts (canarying), automated rollback triggers.
  • Capacity planning: load testing at 120% of peak, N+2 redundancy targets.

Monthly milestone (Month 11): Implement SLOs for one service you operate. Define three SLIs (availability, latency, error rate), set SLO targets, create a Grafana dashboard showing current SLO compliance and remaining error budget, and configure alerts to fire when the burn rate exceeds 5x (fast-burn) or 1x (slow-burn).

Book: "The Site Reliability Workbook" (companion to SRE book, also free at sre.google) — practical implementation guide for the concepts in the main book.

Chaos Engineering

Chaos engineering is the discipline of injecting controlled failures to verify that systems behave as expected under real-world degraded conditions:

  • Principles: define steady state, hypothesize about behavior, run experiments in production (or staging), minimize blast radius.
  • Tools: Chaos Monkey (AWS), LitmusChaos (Kubernetes), Gremlin (SaaS), tc (network traffic control) for manual experiments.
  • Experiment types: pod/node failure, network partition, latency injection, CPU/memory pressure, dependency failure.

Monthly milestone (Month 14): Run five chaos experiments on a staging cluster. For each: document the hypothesis, inject the fault, observe system behavior via dashboards, compare against steady state, fix any gaps in resilience or observability discovered.

Distributed Systems Fundamentals

Book: "Designing Data-Intensive Applications" (DDIA) by Martin Kleppmann (O'Reilly). This is essential reading — arguably the single most important technical book for anyone operating distributed systems. Read it linearly; it builds systematically from data models through replication, partitioning, transactions, and consistency guarantees.

Key topics to understand deeply:

  • Replication: single-leader, multi-leader, leaderless; replication lag and its consequences.
  • Consensus: why it is hard, Paxos intuition, Raft overview.
  • Consistency models: eventual, read-your-writes, monotonic reads, causal, linearizability, serializability.
  • CAP theorem and PACELC: what the theorems actually say vs. the common misunderstandings.
  • Distributed transactions: two-phase commit, its failure modes, why most systems avoid it.

Monthly milestone (Month 18): Can whiteboard-explain why distributed transactions are hard, what "linearizable" means and when you need it, and how Raft achieves consensus. These questions appear routinely in senior SRE interviews.


Phase 4 — Advanced Production Skills (Months 18–36)

Goal: Operate at scale — advanced Kubernetes, cloud networking, FinOps, multi-region design, production databases.

Advanced Kubernetes Internals

Go beyond user-level operations:

  • API server request lifecycle: admission webhooks, CRDs, the controller pattern, client-go informers.
  • Scheduler: scoring/filtering plugins, pod affinity/anti-affinity, topology spread constraints.
  • CNI deep dive: understand how VXLAN overlay networks work, compare Calico (BGP-based), Cilium (eBPF-based).
  • Security: Pod Security Standards, OPA/Gatekeeper policies, Falco runtime security, supply chain (Sigstore/Cosign, SBOM).
  • Cluster upgrades: control plane upgrade sequence, node drain strategy, etcd backup/restore.

Target certification: Certified Kubernetes Security Specialist (CKS) or Kubernetes Application Developer (CKAD) as a complement to CKA.

Cloud Networking Deep Dive

  • VPC peering, Transit Gateway (AWS) / Cloud Interconnect (GCP): when to use each, cost implications.
  • PrivateLink / Private Service Connect: service exposure without public IPs.
  • BGP fundamentals: AS numbers, route advertisement, communities — needed for direct cloud interconnects.
  • Load balancers: L4 (NLB) vs L7 (ALB/Application Load Balancer); health checks, connection draining, sticky sessions.
  • Service mesh: Istio or Linkerd — mTLS, traffic splitting, circuit breaking, observability at the mesh layer.

FinOps

Cloud bills grow unbounded without deliberate cost management:

  • Tagging strategy: enforced via IaC, used for cost allocation by team/service/environment.
  • Reserved Instances / Committed Use Discounts vs Spot/Preemptible: breakeven analysis.
  • Right-sizing: identify over-provisioned instances with CloudWatch/Cloud Monitoring metrics.
  • Storage tiering: lifecycle policies to move objects to infrequent-access and archive tiers.
  • Cost anomaly detection: set up AWS Cost Anomaly Detection or GCP budgets with alert thresholds.

Target certification: FinOps Certified Practitioner (FOCP).

Multi-Region Architecture

Target certification: AWS Solutions Architect — Associate (SAA) or GCP Professional Cloud Architect (PCA). These exams test multi-region design patterns directly.

Key patterns:

  • Active-active vs active-passive: RTO/RPO implications, traffic routing with Route 53 / Cloud DNS health checks.
  • Data replication: cross-region S3 replication, RDS read replicas, Aurora Global Database.
  • Failure domain isolation: why multi-AZ is not multi-region, cascading failure risks.

Monthly milestone (Month 28): Design a multi-AZ deployment for a stateful application. Write an architecture document covering: traffic routing (Route 53 health checks), database failover (RDS Multi-AZ), stateless compute scaling (Auto Scaling Group), and the runbook for a regional failover drill.

Production Database Operations

SREs are often the last line of defense before database incidents become customer-visible:

  • PostgreSQL operations: replication slots, WAL archiving, pg_stat_activity, vacuum, bloat, index usage.
  • Connection pooling: PgBouncer, connection storms, max_connections tuning.
  • Backup and restore: pg_dump, WAL-G continuous archiving, PITR (point-in-time recovery) testing.
  • MySQL/Aurora: binlog-based replication, GTID, read replica promotion.
  • Redis: persistence (RDB vs AOF), cluster mode, eviction policies, key expiry monitoring.

Monthly milestone (Month 36): Perform a chaos experiment against a production-like database: kill the primary, observe replication failover, validate that application traffic resumes within your defined RTO. Measure actual RTO vs target and document gaps.


Book Phase Priority
The Linux Command Line — William Shotts 1 Essential
Kubernetes in Action — Marko Luksa 2 Essential
Site Reliability Engineering — Beyer et al. 3 Essential
The Site Reliability Workbook — Beyer et al. 3 Essential
Designing Data-Intensive Applications — Kleppmann 3 Essential
Cloud Native Patterns — Cornelia Davis 2 Recommended
Production Kubernetes — Josh Rosso et al. 4 Recommended
Database Reliability Engineering — Campbell & Majors 4 Recommended

Certification Sequence

Certification Phase Approximate Study Time
AWS Cloud Practitioner (CCP) 1 40 hours
Certified Kubernetes Administrator (CKA) 2 80 hours
AWS Solutions Architect — Associate (SAA) 4 80 hours
GCP Professional Cloud Engineer (GCP PE) 4 100 hours
FinOps Certified Practitioner (FOCP) 4 30 hours

Summary of Key Projects

  1. Two-tier VPC deployment (Phase 1): automated, no console clicks, demonstrates networking and IAM basics.
  2. Observability stack from scratch (Phase 2): full Prometheus + Grafana + Loki on Kubernetes with custom dashboards and alerts.
  3. SLO implementation (Phase 3): three SLIs, Grafana dashboard with error budget burn rate, fast/slow burn alerts.
  4. Chaos engineering campaign (Phase 3): five documented experiments with hypothesis, injection, observation, and remediation.
  5. Multi-AZ deployment design (Phase 4): architecture document, Terraform code, failover runbook, RTO/RPO validation.