Skip to content

13 — Service Discovery and Coordination

Technical Overview

Service discovery is the mechanism by which services in a distributed system find each other. In a static system, you hardcode IP addresses and ports. In a dynamic cloud environment with autoscaling, container orchestration, and rolling deployments, service locations change constantly. Service discovery provides a registry that maps service names to current locations, automatically updated as services start and stop. Coordination services extend this with distributed primitives: distributed locks, leader election, queues, and barriers — building blocks that other distributed algorithms require.


Prerequisites

  • Distributed systems fundamentals (01-distributed-systems-fundamentals.md)
  • Leader election (07-leader-election.md)
  • Paxos and Raft (05-paxos.md, 06-raft.md)
  • Basic networking: DNS, load balancers, TCP
  • Containerization concepts (helpful but not required)

Core Content

The Service Discovery Problem

Static Configuration (the old way):

  service_a.conf:
    backend_host = "192.168.1.100"
    backend_port = 8080

  Problems:
  - What if 192.168.1.100 goes down? Config must be manually updated.
  - What if you scale out to 10 backend instances? Can't hardcode all 10.
  - What if the instance IP changes after a restart?
  - In containers: IP addresses are ephemeral and assigned at runtime.

Dynamic Service Discovery (the new way):

  Service registers: "I am backend-service, I'm at 10.0.1.15:8080, I'm healthy"
  Registry: stores current service locations
  Client: "Where is backend-service?" → Registry → "10.0.1.15:8080"

  When backend-service moves: deregisters old, registers new location
  When backend-service fails: health check fails, removed from registry
  Client: "Where is backend-service?" → Registry → "10.0.1.22:8080" (new)

DNS-Based Service Discovery

DNS is the original service discovery mechanism. SRV records extend DNS for service location:

DNS SRV Record:
  _http._tcp.backend-service.example.com. 300 IN SRV 0 5 8080 10.0.1.15.
  _http._tcp.backend-service.example.com. 300 IN SRV 0 5 8080 10.0.1.22.
  _http._tcp.backend-service.example.com. 300 IN SRV 0 5 8080 10.0.1.30.

  Format: priority weight port target

  Client: DNS lookup → get all SRV records → pick one (or load balance)
  TTL: 300 seconds → DNS cache for 5 minutes

Limitations:
  - DNS caching: up to 5 minutes before clients see changes
  - Limited metadata: no health status, no weights, no datacenter info
  - No watch/push: clients must poll
  - Negative caching: "service not found" cached even after service appears

Kubernetes uses DNS via kube-dns/CoreDNS:
  my-service.my-namespace.svc.cluster.local → ClusterIP
  my-pod.my-namespace.pod.cluster.local    → Pod IP (headless services)

Kubernetes CoreDNS: In Kubernetes, every service gets a DNS name automatically. CoreDNS (the default DNS resolver in Kubernetes 1.12+) reads from the Kubernetes API server and serves DNS records. It uses the API server's watch mechanism (not gossip) to receive service change notifications. CoreDNS is not strongly consistent — there's a brief lag between service creation and DNS availability.

Client-Side vs Server-Side Load Balancing

Server-Side Load Balancing (traditional):

  Client → [Load Balancer (e.g., HAProxy, AWS ALB)] → [Backend 1]
                                                      → [Backend 2]
                                                      → [Backend 3]

  Pros: Simple client; load balancer has full visibility.
  Cons: Load balancer is a bottleneck and single point of failure.
        Additional network hop. LB must understand application protocol.

Client-Side Load Balancing (service mesh style):

  Client [knows about Backend 1, 2, 3 via discovery]
    Client itself picks backend using round-robin, random, or EWMA.
    Client → Backend 1 (directly, no LB in the path)

  Pros: No bottleneck; clients can use smarter algorithms.
        Better for gRPC (connection-level load balancing).
  Cons: Load balancing logic in every client.
        Service registry must be consistent for all clients.

  Examples: Netflix Ribbon, gRPC client-side LB, Envoy as a sidecar.

ZooKeeper: The Coordination Service

Apache ZooKeeper (Hunt, Konar, Junqueira, Reed, 2010) is a coordination service for distributed applications. It provides a hierarchical namespace (like a filesystem) with strong consistency (sequential consistency) for small metadata objects.

ZooKeeper Data Model:

  /
  ├── /zookeeper
  │    └── /quota
  ├── /myapp
  │    ├── /myapp/config
  │    │    ├── /myapp/config/db_host  → "primary.db.example.com"
  │    │    └── /myapp/config/db_port  → "5432"
  │    ├── /myapp/workers
  │    │    ├── /myapp/workers/worker-0001 (ephemeral, sequential)
  │    │    └── /myapp/workers/worker-0002 (ephemeral, sequential)
  │    └── /myapp/leader             (ephemeral)

Znode types:
  Persistent:          Survives client disconnect. Standard data node.
  Ephemeral:           Deleted when creating client's session ends.
                       Used for leader election, service registration.
  Sequential:          ZooKeeper appends a monotonic number to the name.
  Ephemeral+Sequential: Combination. Core of ZooKeeper recipe for locks/election.

ZooKeeper Watches:

A client can set a "watch" on a znode. When the znode changes (data changed, znode created/deleted, children changed), the client receives a notification.

import kazoo.client as kz

client = kz.KazooClient(hosts='zk1:2181,zk2:2181,zk3:2181')
client.start()

# Watch a configuration value
@client.DataWatch("/myapp/config/db_host")
def watch_db_host(data, stat, event=None):
    if data:
        print(f"DB host changed to: {data.decode()}")
    elif event and event.type == "DELETED":
        print("DB host deleted!")

# Watch children (service instances)
@client.ChildrenWatch("/myapp/workers")
def watch_workers(children):
    print(f"Active workers: {children}")
    # children = ["worker-0001", "worker-0003"] (worker-0002 disappeared)

Watches are one-shot: A watch fires once and must be re-registered. If you miss events between re-registration, you might miss a change. The solution: always re-read the state when setting a watch (watch + read atomically).

ZooKeeper Recipes (distributed primitives built on ZooKeeper):

Distributed Lock:

1. Create ephemeral+sequential node: /locks/lock-0001
2. List /locks and get all children.
3. If I have the lowest number: I have the lock.
4. Else: watch the node just below mine (watch /locks/lock-0000 if I have 0001)
5. When watched node disappears: go back to step 2.

Distributed Queue:

1. Producer: Create sequential znode /queue/task-0001 with data.
2. Consumer: List /queue, take the lowest numbered task.
3. Consumer watches /queue if empty.
4. On watch fire: check for new tasks.

Service Registration with ZooKeeper:

Service starts:
  Creates ephemeral znode: /services/backend/backend-01 with data {ip, port, metadata}
  ZooKeeper session is maintained with heartbeats.

Service discovery client:
  Lists /services/backend/ → gets all current backend instances
  Watches /services/backend/ for children changes
  On change: refresh the backend list

Service dies or crashes:
  ZooKeeper session expires → ephemeral znode is deleted
  Watch fires for all clients watching /services/backend/
  Clients refresh and remove the dead instance

ZooKeeper's consistency model: Sequential consistency. Reads may be slightly stale (from a follower that hasn't received the latest update). For strong consistency reads, use sync() before reading, which flushes the follower to the leader's latest state.

etcd: Kubernetes' Backbone

etcd is a distributed key-value store built on Raft for strong consistency. It is the persistent store for all Kubernetes cluster state.

etcd Architecture:

  ┌─────────────────────────────────────────────────────┐
  │                  etcd Cluster                       │
  │                                                     │
  │  ┌─────────┐    ┌─────────┐    ┌─────────┐         │
  │  │  etcd1  │    │  etcd2  │    │  etcd3  │         │
  │  │(Leader) │←──→│(Follow) │←──→│(Follow) │         │
  │  └────┬────┘    └─────────┘    └─────────┘         │
  │       │                                             │
  │     Raft consensus for all writes                   │
  └───────┬─────────────────────────────────────────────┘
          │
  ┌───────┴─────────────────────────┐
  │    Kubernetes API Server        │
  │    Reads/writes cluster state   │
  │    Scheduler, controllers watch │
  └─────────────────────────────────┘

etcd stores:
  /registry/pods/default/my-pod       → Pod spec
  /registry/services/default/my-svc  → Service spec
  /registry/nodes/my-node             → Node status
  /registry/secrets/default/my-secret → Secret data (encrypted)
  ...

etcd MVCC and Watch:

etcd uses MVCC (Multi-Version Concurrency Control). Every write creates a new version (revision number). Clients can: - Read at a specific revision (consistent snapshot) - Watch for changes from a specific revision (stream-based, not poll-based)

// etcd v3 client in Go
cli, _ := clientv3.New(clientv3.Config{Endpoints: []string{"localhost:2379"}})

// Put a key
cli.Put(ctx, "/services/backend/instance-01", `{"ip":"10.0.1.15","port":8080}`)

// Get with revision
resp, _ := cli.Get(ctx, "/services/backend/", clientv3.WithPrefix())
revision := resp.Header.Revision  // Current revision

// Watch from this revision (persistent, streaming)
watchChan := cli.Watch(ctx, "/services/backend/", 
    clientv3.WithPrefix(),
    clientv3.WithRev(revision))

for wresp := range watchChan {
    for _, ev := range wresp.Events {
        switch ev.Type {
        case clientv3.EventTypePut:
            fmt.Printf("New/updated: %s → %s\n", ev.Kv.Key, ev.Kv.Value)
        case clientv3.EventTypeDelete:
            fmt.Printf("Deleted: %s\n", ev.Kv.Key)
        }
    }
}

etcd Leases: Similar to ZooKeeper ephemeral nodes. A lease has a TTL; all keys attached to the lease are deleted when the lease expires.

// Service registration with etcd lease
lease, _ := cli.Grant(ctx, 10)  // 10-second TTL

// Register service
cli.Put(ctx, "/services/backend/instance-01", myInfo, 
    clientv3.WithLease(lease.ID))

// Keep alive (refreshes the lease)
keepaliveCh, _ := cli.KeepAlive(ctx, lease.ID)
// If KeepAlive stops (process crashes), lease expires, key is deleted.

etcd transactions: etcd supports multi-key transactions with compare-and-swap semantics:

// Atomic compare-and-swap
txn := cli.Txn(ctx).
    If(clientv3.Compare(clientv3.Value("/election/leader"), "=", "")).
    Then(clientv3.OpPut("/election/leader", myID, clientv3.WithLease(lease.ID))).
    Else(clientv3.OpGet("/election/leader"))

resp, _ := txn.Commit()
if resp.Succeeded {
    fmt.Println("I am the leader!")
} else {
    current := resp.Responses[0].GetResponseRange().Kvs[0].Value
    fmt.Printf("Leader is: %s\n", current)
}

Consul: Service Catalog + Health Checking + KV

HashiCorp Consul combines service discovery, health checking, and a KV store in one system.

Consul Architecture:

  ┌──────────────────────────────────────────────────────┐
  │  Consul Cluster (3 or 5 server nodes)               │
  │  Uses Raft for consensus                            │
  └──────────────────────────────────────────────────────┘
         ↑ API calls from agents

  ┌────────────┐ ┌────────────┐ ┌────────────┐
  │ Consul     │ │ Consul     │ │ Consul     │
  │ Agent      │ │ Agent      │ │ Agent      │
  │(on each VM)│ │(on each VM)│ │(on each VM)│
  └────────────┘ └────────────┘ └────────────┘
         ↑ Gossip (memberlist)          ↑ Health checks (local)

  Three components:
  1. Gossip layer: memberlist for node membership + failure detection
  2. Raft layer: strong consistency for service catalog
  3. Health checking: agents run checks locally, report to Raft cluster

Consul Service Registration:

// service.json
{
  "service": {
    "name": "backend",
    "id": "backend-01",
    "address": "10.0.1.15",
    "port": 8080,
    "tags": ["v2", "production"],
    "check": {
      "http": "http://10.0.1.15:8080/health",
      "interval": "10s",
      "timeout": "5s",
      "deregister_critical_service_after": "30s"
    }
  }
}

Consul DNS: Consul provides a DNS interface. Service A can look up backend.service.consul via DNS and get all healthy backend instances.

Consul DNS:
  backend.service.consul         → All healthy backend instances (A records)
  backend.service.dc1.consul     → Backends in dc1 specifically
  _http._tcp.backend.service.consul → SRV records with port info
  v2.backend.service.consul      → Backends tagged "v2"

Consul KV: Consul includes a key-value store (backed by Raft) for distributed configuration management. Common use: store configuration that services need to start, store feature flags, store service metadata.

Consul Connect (service mesh): Consul can act as a service mesh control plane, automatically injecting Envoy proxies and managing mTLS certificates between services.

Distributed Configuration Management

Beyond service discovery, coordination services are widely used for distributed configuration:

Distributed Configuration Patterns:

1. Externalized Configuration (12-Factor App):
   Configuration stored in environment variables or remote config service.
   Services read config on start; restart to pick up changes.
   Simple but requires restart for config changes.

2. Dynamic Configuration (push):
   Config service pushes changes to all interested services via watch/callback.
   No restart needed.
   Services watch for config changes and reconfigure dynamically.

   ZooKeeper watch: node sees config change, reconfigures itself
   etcd watch: rolling config update propagates via watch stream
   Consul KV watch: long-poll for changes

3. Feature Flags:
   Boolean config values that enable/disable features.
   LaunchDarkly, Split, etcd-based feature flags.
   Updated dynamically; changes visible within milliseconds.
   Critical for canary deployments and A/B testing.

4. Service Mesh Configuration:
   Envoy's xDS (discovery service) protocol streams routing configuration.
   Istio control plane pushes route/load-balancing/TLS config via xDS.
   Dynamic; no restarts needed for routing changes.

Service Mesh for Discovery

A service mesh is an infrastructure layer that handles service-to-service communication. It provides discovery, load balancing, mTLS, retries, and observability.

Service Mesh Architecture (Istio):

  ┌────────────────────────────────────────────────────────────────────┐
  │  Control Plane (Istiod)                                           │
  │  - Service registry (via Kubernetes API)                          │
  │  - Certificate authority (mTLS cert issuance)                     │
  │  - xDS (Envoy Discovery Service) for configuration distribution   │
  └───────────────────────────────────┬───────────────────────────────┘
                                      │ xDS push
  ┌───────────────────────────────────▼───────────────────────────────┐
  │  Data Plane: Envoy Sidecars (injected into each Pod)             │
  │                                                                   │
  │  [Service A] ←→ [Envoy Proxy A] ────TLS──→ [Envoy Proxy B] ←→ [Service B]
  │                        ↑                          ↑               │
  │                 local traffic             routing + LB             │
  │                 interception              + retry + circuit break  │
  └───────────────────────────────────────────────────────────────────┘

  Envoy handles:
  - Service discovery (from control plane xDS stream)
  - Load balancing (round-robin, least-connections, EWMA)
  - mTLS (mutual TLS, cert rotation)
  - Retries and circuit breaking
  - Distributed tracing (headers injected automatically)
  - Traffic management (canary, A/B, shadow traffic)

Linkerd is the other major service mesh. Lighter-weight than Istio; uses Rust-based proxy (not Envoy). Better suited for small-to-medium deployments. Strong focus on simplicity and observability (automatic Prometheus metrics for all services).

Kubernetes Service Discovery

Kubernetes combines DNS, environment variables, and the API server for discovery:

Kubernetes Service Discovery Mechanisms:

1. Environment Variables (legacy):
   Every pod gets env vars for all services in the namespace:
   MYSERVICE_SERVICE_HOST=10.96.0.1
   MYSERVICE_SERVICE_PORT=8080
   Simple, but requires pod restart when new service added.

2. DNS (primary):
   kube-dns/CoreDNS resolves:
   <service-name>.<namespace>.svc.cluster.local → ClusterIP

   ClusterIP: a stable virtual IP for the service.
   kube-proxy (on each node) maintains iptables/IPVS rules to
   forward traffic from ClusterIP → actual pod IPs.

3. Headless Services (statefulsets, direct pod access):
   DNS returns pod IPs directly (no ClusterIP).
   Used for databases where you need to address specific pods.
   my-db-0.my-db.default.svc.cluster.local → Pod 0's IP
   my-db-1.my-db.default.svc.cluster.local → Pod 1's IP

4. Kubernetes API Watch (for custom controllers):
   Controllers watch the API for service/endpoint changes.
   More responsive than DNS (no TTL).
   Used by kube-proxy, Istio, and other infrastructure components.

kube-proxy modes:
  iptables mode: creates iptables rules for each service → each backend pod.
  IPVS mode: uses Linux IPVS (IP Virtual Server) for O(1) rule lookup 
             vs O(N) for iptables (N = number of services).
  eBPF mode (Cilium): programs eBPF maps in kernel; fastest and most flexible.
Kubernetes Service Lifecycle:

  1. Deploy my-app (3 pods created, IPs: 10.0.0.1, 10.0.0.2, 10.0.0.3)
  2. Create Service: name=my-app, selector={app: my-app}, port=8080
  3. API server creates Endpoints object: {10.0.0.1:8080, 10.0.0.2:8080, 10.0.0.3:8080}
  4. CoreDNS watches Endpoints → serves my-app.default.svc.cluster.local → ClusterIP
  5. kube-proxy watches Endpoints → updates iptables/IPVS rules

  Pod 10.0.0.2 crashes:
  6. kubelet detects failure → updates Pod status to Failed
  7. Endpoint controller removes 10.0.0.2 from Endpoints
  8. kube-proxy removes iptables rules for 10.0.0.2 → traffic no longer routed there

  Delay: steps 6→7→8 typically take 5-15 seconds.
  During this window, some requests to ClusterIP may be sent to 10.0.0.2 and fail.
  Solution: readiness probes (mark pod not-ready before it actually dies);
            client retry logic (retry on connection failure).

Historical Context

1983: DNS (Domain Name System) is designed by Paul Mockapetris and Jon Postel (RFC 882, RFC 883). Originally intended as a scalable replacement for the hosts.txt file.

1990s: Service registries and configuration management emerge as enterprise problems. CORBA naming service, JNDI (Java Naming and Directory Interface).

2006: Apache ZooKeeper begins development at Yahoo! for distributed coordination of Hadoop clusters. Open-sourced as an Apache project in 2008.

2010: Hunt, Konar, Junqueira, and Reed publish "ZooKeeper: Wait-free Coordination for Internet-scale Systems" at USENIX ATC, establishing ZooKeeper as the standard coordination service.

2013: etcd is created by CoreOS for configuration management in their fleet cluster scheduler. Becomes the Kubernetes state store in 2014.

2014: HashiCorp releases Consul, combining service discovery, health checking, and KV in one tool.

2015: Kubernetes reaches 1.0, with etcd and kube-dns as the discovery backbone. The concept of "service mesh" begins to emerge.

2018: Istio 1.0 is released, establishing the service mesh pattern as an industry standard. Envoy proxy (created at Lyft by Matt Klein, 2015, open-sourced 2016) becomes the de facto data plane.


Production Examples

Netflix Eureka + Ribbon: Netflix's original service discovery stack. Eureka is a Java-based service registry; Ribbon is a client-side load balancer. Ribbon fetches the service registry from Eureka and load-balances directly. Both are being sunset in favor of service mesh approaches.

Lyft + Envoy: Lyft created Envoy as their service proxy. Their architecture uses Envoy sidecar proxies for service discovery, load balancing, and circuit-breaking. Envoy's xDS API became the standard for service mesh control planes.

Uber + Ringpop: Uber's Ringpop is a gossip-based service discovery library for Node.js. Services use consistent hashing to shard work; Ringpop gossips membership to all nodes. Used for Uber's real-time dispatch system.

Google internal (Stubby/gRPC + Borgmaster): Google's internal systems use gRPC (open-sourced from Stubby) with client-side load balancing. Service locations are resolved via Google's internal name resolution service (backed by Borg, their cluster manager). Externally, they open-sourced gRPC's name resolution API for integration with external service registries.

Kubernetes at Monzo: Monzo (UK challenger bank) uses Kubernetes with Linkerd service mesh. All service-to-service communication goes through Linkerd proxies. Discovery is via Kubernetes Services + Linkerd's traffic management. They have 1500+ microservices, all discovered via the service mesh.


Debugging Notes

Service discovery debugging workflow:

# Kubernetes: debug service discovery
kubectl get endpoints my-service  # Are there healthy endpoints?
kubectl describe service my-service  # Check selector, ports
kubectl get pods -l app=my-app      # Are pods running?

# CoreDNS lookup
kubectl run -it --rm debug --image=alpine --restart=Never -- sh
  nslookup my-service.default.svc.cluster.local
  # Should return ClusterIP

# Consul: debug service registration
consul catalog services  # List registered services
consul health service backend  # Show health status per instance
consul members  # Gossip membership view

# etcd: debug service state
etcdctl get /registry/services/default/my-service
etcdctl get /registry/endpoints/default/my-service

Common issues: 1. Stale endpoints: Pod died but endpoints not updated. Check endpoint controller health; check kubelet's ability to update pod status. 2. DNS negative caching: A service that doesn't exist is DNS-queried, gets NXDOMAIN, is cached for TTL seconds. Even after the service is created, old instances cache the NXDOMAIN. 3. ZooKeeper session expiration: If the ZK session expires (network partition, GC pause), the client must re-register. During re-registration, the service is absent from the registry. 4. Consul deregistration delay: deregister_critical_service_after (default: none in older versions) controls when a failed service is deregistered. Without this, a crashed service remains in the registry indefinitely.


Security Implications

  • Service registry as a target: The service registry is the single source of truth for where services live. If compromised, an attacker can redirect traffic to malicious services. Protect the registry with authentication, authorization, and network segmentation.
  • mTLS as the correct answer: Service mesh mTLS (mutual TLS) provides both encryption and service identity verification. Each service has a certificate issued by the mesh's CA; they verify each other's identity before communicating. This prevents man-in-the-middle attacks via poisoned service registries.
  • etcd encryption: Kubernetes stores secrets, tokens, and certificates in etcd. etcd encryption-at-rest is critical; without it, anyone with etcd access has all cluster secrets.
  • ZooKeeper ACLs: ZooKeeper supports ACLs (Access Control Lists) on znodes. Without ACLs, any client can read or modify any znode, including sensitive configuration or ephemeral locks.
  • DNS spoofing: DNS-based service discovery is vulnerable to DNS spoofing. Use DNSSEC for external resolution; for internal Kubernetes use, trust the cluster's CoreDNS (it reads directly from the API server, not from external DNS).

Performance Implications

Service discovery latency (first request to a new service):

DNS (cached):      <1ms  (in-process or OS DNS cache)
DNS (uncached):    1-5ms (CoreDNS lookup + iptables traversal)
Consul DNS:        1-5ms (Consul DNS agent)
ZooKeeper watch:  <1ms  (watch fires immediately on change)
etcd watch:       <1ms  (watch stream delivers event)
Envoy xDS:        <1ms  (streaming update from Istiod)

Service registry scalability: - ZooKeeper: handles ~100,000 znodes, ~60,000 watches comfortably. Beyond this, watch storms (many clients watching the same popular znode) create problems. - etcd: handles ~8GB total data; ~100K keys. Large numbers of watch connections (>10K) can create memory pressure on the etcd leader. - Consul: scales to ~5000 service instances. Gossip-based membership is O(N log N) messages; KV store (Raft-backed) is limited by Raft write throughput (~10K writes/sec). - Kubernetes: CoreDNS handles millions of DNS queries per second in large clusters with caching.


Failure Modes and Real Incidents

2016, ZooKeeper Watch Storm at LinkedIn: LinkedIn's Kafka cluster (1000+ brokers) had all brokers watching the same ZooKeeper path for controller election. When the ZooKeeper leader changed, all 1000+ watchers were notified simultaneously, each re-registering watches. The surge of re-registration requests overwhelmed ZooKeeper, causing it to be temporarily unavailable. Kafka's controller election failed repeatedly. Fix: implement backoff in ZooKeeper watch re-registration.

2018, Kubernetes Service Discovery Lag: A large Kubernetes deployment had 2000+ pods and a slow endpoint update pipeline. When a deployment rolled out, there was a 15-30 second window where some pods had already started (and were receiving traffic via the new deployment) but their Endpoints hadn't been added to the service yet. Traffic was failing for requests that needed to be routed to the old pods during this window. Fix: tune readiness probe timing and use preStop lifecycle hooks to delay removal from the endpoint list.

2019, Consul Gossip Partition: A network event caused a subset of Consul agents to be unable to communicate with the Consul server cluster. The gossip failure detector (memberlist) marked the server nodes as failed. Consul agents in the affected partition started serving stale service catalog data from their local cache. Services were routing to deregistered instances. Fix: Consul's anti-entropy mechanism re-synced with the server cluster when the partition healed; operator action was needed to flush stale cache entries.

2020, etcd Memory OOM Under Watch Storm: A Kubernetes API server bug caused it to open an unlimited number of watches against etcd (one per informer, per resync). After 24 hours, etcd had 50,000+ open watch streams and ran out of memory. Kubernetes controllers lost connectivity to etcd, causing a control plane outage. Fix: patch the API server to limit watch reconnection rate; upgrade etcd to a version with watch stream memory optimization.


Modern Usage

The service discovery landscape in 2024:

  • Kubernetes + CoreDNS: The default for containerized workloads. DNS-based discovery with automatic service registration.
  • Service mesh (Istio, Linkerd): Adds mTLS, traffic management, and observability on top of Kubernetes discovery.
  • Consul Service Mesh: Multi-datacenter, multi-platform service mesh (works on bare metal, VMs, and Kubernetes).
  • AWS Cloud Map: AWS-native service registry. Integrates with ECS, EKS, EC2. Supports DNS and API-based discovery.
  • Azure Service Fabric: Microsoft's microservice platform with built-in service discovery.
  • Kubernetes Gateway API: Successor to Ingress, providing richer service mesh-like routing configuration via Kubernetes CRDs.

Future Directions

  • eBPF-native service discovery: Cilium (eBPF-based Kubernetes network plugin) implements service discovery at the kernel level. No kube-proxy; no iptables; O(1) service lookup via eBPF maps. The future of Kubernetes networking is eBPF.
  • WebAssembly for service mesh sidecars: Envoy supports Wasm filters. Service mesh logic could be distributed as Wasm modules, reducing the need for sidecar proxy restarts when policy changes.
  • Zero-trust service discovery: Rather than trusting that the service registry is correct, zero-trust architectures validate service identity at every request (via SPIFFE/SPIRE, a standard for workload identity). Discovery and identity become inseparable.
  • Multi-cloud service mesh: Connecting services across multiple cloud providers (AWS, GCP, Azure) with a unified discovery namespace. HashiCorp Consul and Google Anthos provide early examples.

Exercises

  1. ZooKeeper Recipes: Using Kazoo (Python ZooKeeper client), implement: (a) a distributed lock using sequential ephemeral nodes, (b) a service registry where "services" register/deregister, (c) a leader election. Test all three by running 5 concurrent clients and verifying correct behavior (no two locks held simultaneously, etc.).

  2. etcd Watch and Lease: Build a service registry using etcd. Each "service instance" creates a key with a lease. A "service consumer" watches for key changes and maintains a current list of healthy instances. Simulate an instance crashing (stop the KeepAlive). Measure how long until the consumer detects the failure (lease expiry).

  3. Consul Health Checking: Deploy 3 "service instances" (simple HTTP servers) with Consul agents. Register them with HTTP health checks. Kill one instance. Observe how long Consul takes to mark it critical and deregister it. Experiment with different interval, timeout, and deregister_critical_service_after settings.

  4. DNS vs etcd Watch Latency: Build a benchmark that measures: (a) time for a new Kubernetes service to become resolvable via DNS, (b) time for an etcd watch to fire when a key changes. Compare the two. At what TTL does DNS become competitive with etcd watch? When would you choose each?

  5. Service Mesh Traffic Shaping: Deploy two versions of a service (v1 and v2) behind Istio or Linkerd. Configure traffic splitting: 90% to v1, 10% to v2 (canary deployment). Verify the split is correct by sending 1000 requests and counting responses from each version. Then shift to 50/50. Measure the time for the traffic split to stabilize after changing the configuration.


References

  1. Hunt, P., Konar, M., Junqueira, F. P., & Reed, B. (2010). "ZooKeeper: Wait-free Coordination for Internet-scale Systems." USENIX ATC 2010.
  2. Burns, B., Grant, B., Oppenheimer, D., Brewer, E., & Wilkes, J. (2016). "Borg, Omega, and Kubernetes." ACM Queue, 14(1).
  3. Klein, M. (2015). "Envoy Proxy: Building a Service Mesh." Lyft Engineering Blog.
  4. Beyer, B., Jones, C., Petoff, J., & Murphy, R. (2016). Site Reliability Engineering: How Google Runs Production Systems. O'Reilly. (Chapter 22: Addressing Cascading Failures.)
  5. Richardson, C. (2018). Microservices Patterns. Manning. (Chapter 3: Interprocess Communication.)
  6. Kelsey Hightower, Joe Beda, & Brendan Burns. (2019). Kubernetes Up and Running (2nd ed.). O'Reilly.
  7. Junqueira, F., & Reed, B. (2013). ZooKeeper: Distributed Process Coordination. O'Reilly.
  8. Burrows, M. (2006). "The Chubby Lock Service for Loosely-Coupled Distributed Systems." OSDI 2006.