Why Service Discovery Matters
Why Service Discovery Matters
The Problem: A monolith binds to a port and lives at one address forever. A microservice fleet has hundreds of instances, each one a short-lived container with an ephemeral IP that the orchestrator assigns at random. By the time you wrote the IP into a config file, the pod is already gone.
The Solution: A discovery layer that answers one question well: “where are the healthy instances of service X right now?” Everything else — load balancing, failover, traffic shifting — is built on top of that answer.
Real Impact: Kubernetes redeploys can churn through every pod in a service inside 60 seconds. Without discovery, the deploy is the outage.
Real-World Analogy
Think of how phone numbers worked before phone books:
- No phone book = you have to memorize every number, and when someone moves you have no way to find them again
- Phone book = central directory: name → current number, updated regularly
- Directory assistance (411) = ask an operator who looks it up for you — you never see the book yourself
- Speed dial = local cache of the numbers you call most, refreshed when one stops working
Service discovery is the phone book, the operator, and the speed dial — all rolled into one. Hardcoding an IP address is the equivalent of memorizing every number in a city that re-numbers itself every Tuesday.
Discovery exists because the assumptions of the network changed. In a static datacenter you knew, when you deployed, exactly which box ran which service and on which port. You wrote that into /etc/hosts, an HAProxy config, or a wiki page, and it stayed true for months.
Containers broke every part of that. Pods are scheduled on whatever node has capacity. They get a new IP each time they restart. A horizontal scale-up triples the instance count in 30 seconds; a rolling deploy replaces every one of them inside two minutes. Auto-healing kills a sick instance and the replacement comes up somewhere else entirely. There is no longer a stable place where order-service lives. There is only a stable name — order-service — and a system that knows, right now, which IPs answer to that name.
What “discovery” actually means
The job decomposes into three responsibilities. Different systems wire them up differently, but the responsibilities are universal:
| Responsibility | What It Does | Failure Mode If Missing |
|---|---|---|
| Registration | A new instance announces “I exist at 10.0.4.23:8080” | Traffic never reaches new pods |
| Health checking | The system continuously verifies which instances are alive | Traffic keeps going to dead pods |
| Lookup | A caller asks “give me the healthy IPs of service X” | Callers cannot route at all |
| De-registration | A shutting-down instance gets removed from the pool | 5xx during every deploy |
Client-Side vs Server-Side Discovery
Why the Distinction Matters
The Problem: Once you have a registry, somebody still has to use it — pick an instance, balance load, handle the failover when the picked instance is dead.
The Solution: Two architectural choices: clients query the registry themselves and load-balance in-process, or clients send to a stable load balancer address and let the LB do the discovery.
Client-side discovery puts intelligence in the caller. The caller pulls the registry list, caches it, picks an instance, and retries on failure. Netflix Eureka was the canonical example: the Java client embedded the registry list and rebuilt it from the Eureka server every 30 seconds.
Server-side discovery hides the registry behind a stable address. The caller talks to one well-known endpoint — an AWS ELB, an NGINX, a Kubernetes Service IP — and the LB does the lookup. The caller knows nothing about pods, instances, or registries.
| Aspect | Client-Side | Server-Side |
|---|---|---|
| Where the logic lives | Inside every client (library / SDK) | Inside the LB / proxy / mesh |
| Network hops | 1 hop (client → instance) | 2 hops (client → LB → instance) |
| Polyglot cost | Need a discovery client per language | Free — everyone speaks HTTP/DNS |
| Failure of the discovery layer | Cached list keeps working | LB outage = service outage |
| Custom routing logic | Trivial (it’s your code) | Limited to what the LB exposes |
| Examples | Netflix Eureka + Ribbon, Finagle | AWS ALB, Kubernetes Services, NGINX, Envoy |
The industry has converged on server-side — specifically the service-mesh flavor of it — for one reason: you don’t want to ship a Eureka client for every language your team writes in. The mesh sidecar gives every pod “client-side” smarts without requiring any language-specific library.
Service Registries
Why a Registry at All
The Problem: Something has to be the source of truth for “which instances exist and which are healthy.” Distribute that knowledge naively and you’ll have stale state somewhere causing 5xx.
The Solution: A registry — usually a strongly-consistent KV store — that everyone reads from and writes to. The hard part isn’t storing the data; it’s keeping it accurate.
The classic service registries:
| Registry | Backed By | Built For | Notes |
|---|---|---|---|
| HashiCorp Consul | Raft consensus | Multi-DC service discovery + KV + ACL | The general-purpose pick. Built-in health checks, DNS interface, mesh. |
| etcd | Raft consensus | Coordination + config | What Kubernetes itself uses internally for the API server’s state. |
| Apache ZooKeeper | ZAB consensus | Coordination, leader election | The grandparent. Still common in Kafka, HBase, older Java stacks. |
| Netflix Eureka | AP-style replication | Cloud-scale, partition-tolerant | Trades consistency for availability — was deliberate at Netflix. |
| Kubernetes API + etcd | etcd | K8s-native discovery via Endpoints | You don’t install it; you get it. |
| AWS Cloud Map | Route 53 + AWS internals | Managed AWS-native discovery | Pairs with ECS Service Discovery. Zero infra to run. |
Self-registration vs third-party registration
Two ways to get an instance into the registry:
- Self-registration: The service registers itself on startup and de-registers on shutdown. Simple, but every service needs the registry client baked in, and a hard crash skips de-registration.
- Third-party registration: An external agent — the K8s controller, an ECS scheduler, a Consul agent — watches the orchestrator and registers/de-registers on the service’s behalf. More moving parts; cleaner failure semantics.
Modern stacks default to third-party. Kubernetes does it for you: when a pod becomes Ready, the kubelet posts to the API server and the Endpoints object updates automatically. You write zero code for that.
Consul self-registration via JSON
// /etc/consul.d/order-service.json
// Picked up by the local Consul agent on startup.
{
"service": {
"id": "order-service-7f3a",
"name": "order-service",
"address": "10.0.4.23",
"port": 8080,
"tags": ["v2", "prod", "region=us-east-1"],
"meta": {
"version": "2.14.0",
"git_sha": "a91b3cf"
},
"checks": [
{
"name": "http-health",
"http": "http://10.0.4.23:8080/health",
"interval": "10s",
"timeout": "2s",
"deregister_critical_service_after": "1m"
},
{
"name": "tcp-port",
"tcp": "10.0.4.23:8080",
"interval": "5s"
}
]
}
}
Two checks because they fail for different reasons. The TCP check catches “the process is wedged but the port is still bound.” The HTTP check catches “the port answers but the app is broken.” deregister_critical_service_after is the safety net — if the instance has been failing for a minute it gets removed from lookups even if the agent didn’t hear about a clean shutdown.
DNS-Based Discovery
Why DNS Is Awkward Here
The Problem: DNS is the universal name-resolution protocol — every language, every runtime, every OS speaks it. So why not use it for discovery?
The Solution: Use it — but understand its baggage. DNS caches aggressively at every layer (the OS resolver, the JVM, the LB), and it has no concept of “this instance is unhealthy.” You make it work by setting tiny TTLs, controlling the resolver, and accepting some delay during failover.
Kubernetes’ built-in discovery is DNS-based. Every Service object becomes a DNS record inside the cluster:
# A normal ClusterIP Service: stable virtual IP, hides individual pods.
# DNS: order-service.payments.svc.cluster.local → 10.96.42.1
apiVersion: v1
kind: Service
metadata:
name: order-service
namespace: payments
spec:
selector:
app: order-service
ports:
- name: http
port: 80
targetPort: 8080
type: ClusterIP
---
# A *headless* Service: no virtual IP, DNS returns one A record per pod.
# DNS: order-service-headless.payments.svc.cluster.local →
# 10.0.4.23, 10.0.4.24, 10.0.4.25
apiVersion: v1
kind: Service
metadata:
name: order-service-headless
namespace: payments
spec:
clusterIP: None # the magic flag
selector:
app: order-service
ports:
- name: http
port: 80
targetPort: 8080
The headless variant is what you use when the client wants to do its own load balancing — gRPC clients, StatefulSet members talking to each other, sharded systems where the caller needs to know exactly which pod owns which key.
The TTL trap
DNS caches lie to you for as long as you let them
If a pod dies and you haven’t been careful, traffic keeps going to its old IP for as long as the DNS TTL says it’s valid. The Java DNS cache defaults to caching forever in some JVM versions. The OS resolver respects TTL only if the upstream resolver does. If you set TTL=60s, plan for failover to take 60s on a bad day — sometimes longer.
Kubernetes’ CoreDNS uses a 5-second TTL for service records, which is why pod failover feels nearly instant inside a cluster. But the moment you point an external client at service.example.com with a 5-minute TTL on Route 53, that 5-second illusion disappears.
ECS Service Discovery is the AWS-managed equivalent. You attach a service to an AWS Cloud Map namespace; ECS keeps Route 53 records in sync with the running tasks. Same trade-offs, plus the AWS bill.
SRV records vs A records
Plain A records give you an IP. SRV records give you IP and port plus weight and priority — useful when ports are dynamic, common in Consul setups (_order-service._tcp.service.consul). Most modern clients still prefer A + a known port; SRV support is uneven across language stdlibs.
Service Mesh Discovery
Why Sidecars Won
The Problem: Client-side discovery libraries are great until you have eight languages in production. Server-side LBs are great until you need fine-grained routing — canary, mTLS, retries, traffic mirroring — that the LB doesn’t support.
The Solution: Run a proxy as a sidecar next to every pod. The application talks to localhost; the proxy handles discovery, load balancing, retries, and security. Your code knows nothing.
A service mesh splits in two:
- Data plane: the sidecars (Envoy in Istio, linkerd2-proxy in Linkerd). Every byte of service-to-service traffic flows through them. They handle discovery, load balancing, mTLS, retries, circuit breaking.
- Control plane: the brain (istiod, Linkerd’s control plane). It watches the orchestrator, computes routing tables, and pushes them to every sidecar via xDS or its equivalent.
Discovery becomes a non-issue from the application’s point of view. You write http://order-service/orders/123. The sidecar intercepts it, checks its locally cached endpoint table (pushed from the control plane), picks an instance, opens an mTLS connection, and proxies the request. The pod IP changes? The control plane notices, recomputes, pushes — usually within a second. Your code never knew there was a problem.
Linkerd traffic split
# Send 90% of traffic to v1, 10% to v2 — discovery + weighted routing
# done by the mesh, completely transparent to clients.
apiVersion: split.smi-spec.io/v1alpha1
kind: TrafficSplit
metadata:
name: order-service-rollout
namespace: payments
spec:
service: order-service # the apex service clients call
backends:
- service: order-service-v1
weight: 90
- service: order-service-v2
weight: 10
The application keeps calling order-service. The mesh splits the traffic. To migrate to v2 you change the weights; no redeploy of any caller.
What the mesh gives you that DNS can’t
- Per-request load balancing instead of per-connection (HTTP/2 doesn’t pool the way HTTP/1.1 did, so DNS-based LB skews badly).
- Endpoint-level health awareness — if instance B is slow, take it out of rotation; DNS has no way to express that.
- mTLS between every pair of services, automatic.
- Retries, timeouts, circuit breaking as policy — not as application code.
- Observability for free: metrics for every request between every pair of services.
Health Checks
Why “Healthy” Is Not One Question
The Problem: A pod can be running, listening on a port, returning 200 from /health — and still be unable to do useful work because its database connection pool is exhausted. Conversely, a pod can be loading 4 GB of model weights at startup and look unhealthy for two minutes when it’s actually fine.
The Solution: Three different probes that ask three different questions. Confusing them is one of the most common Kubernetes outage patterns.
| Probe | Question | Failure Action |
|---|---|---|
| Startup | “Is the app done initializing?” | Keep waiting; don’t kill yet |
| Liveness | “Is the process wedged?” | Restart the container |
| Readiness | “Should I receive traffic right now?” | Remove from Service endpoints |
# Each probe answers a different question — wire them up correctly.
apiVersion: v1
kind: Pod
metadata:
name: order-service
spec:
containers:
- name: app
image: orderservice:2.14.0
ports: [{ containerPort: 8080 }]
startupProbe: # app needs <= 60s to load caches
httpGet: { path: /healthz, port: 8080 }
periodSeconds: 5
failureThreshold: 12 # 12 * 5s = 60s grace
livenessProbe: # cheap, port-level — just “am I alive”
tcpSocket: { port: 8080 }
periodSeconds: 10
failureThreshold: 3
readinessProbe: # real work — am I able to serve?
httpGet:
path: /ready # checks DB pool, downstream deps, queue depth
port: 8080
periodSeconds: 5
failureThreshold: 2
successThreshold: 1
Application-level vs port-level checks
A tcpSocket check is cheap and tells you the process is up. An HTTP check on a real path tells you the process is up and the framework is routing requests and the handler is returning. Port-level for liveness, application-level for readiness, is a sane default.
The “healthy but degraded” problem
The most dangerous check is the one that returns 200 because the check itself is trivial. A common bug: /health just returns 200 with no body. Meanwhile the database is gone, the cache is gone, the downstream is timing out — and your pod happily keeps receiving traffic because its health endpoint is “working.”
A real readiness probe checks the things that, if broken, make this pod useless: DB connectivity, downstream availability, queue depth below threshold, free disk above threshold. But beware the inverse trap — a probe that flips to unhealthy when a single downstream slows down can turn a small problem into a cluster-wide failover stampede.
Load Balancing Algorithms
Why Algorithm Choice Matters
The Problem: Round-robin is the obvious default and it’s wrong for at least three common cases: HTTP/2 long-lived connections, cache-locality workloads, and multi-AZ deployments where cross-AZ bytes cost real money.
The Solution: Pick the algorithm to match the workload. Most modern proxies let you swap the algorithm with one line of config.
| Algorithm | Picks Instance By | Best For | Watch Out For |
|---|---|---|---|
| Round-robin | Next in the list | Stateless, uniform requests | HTTP/2 + few clients = uneven load |
| Least-connections | Fewest active connections | Variable request duration | Needs accurate connection counts; tricky behind a NAT |
| Least-request | Fewest in-flight requests | HTTP/2 multiplexing | Slightly more state to maintain |
| Weighted | Operator-assigned weights | Heterogeneous instance sizes, canary rollouts | Weights drift if you forget to update them |
| Random / power-of-two | Random pick (or best of two random) | Stateless at large fleet size | Surprisingly good; underused |
| Consistent hashing | Hash of a key (user ID, cache key) | Cache locality, sticky sessions | Hot keys still hit one instance hard |
| Zone/region affinity | Closest healthy instance | Cost reduction, latency | Single-zone failure can overload the next zone |
Consistent hashing for cache locality
If your service has a per-user cache, sending the same user to the same instance every time gives you a hot cache instead of three cold ones. Consistent hashing on user_id does that, and survives instance churn without invalidating most of the cache. Envoy and HAProxy both support it natively.
Zone affinity for cost
Cross-AZ traffic on AWS is billed at $0.01/GB each way. For a service that does 10 TB/day of internal calls, that’s $200/day in network fees alone. Zone-aware load balancing — prefer a same-AZ instance unless none is healthy — cuts that to a fraction. Istio’s locality load balancing and Linkerd’s topology-aware routing both implement this with a single config flag.
Eureka client config (Java)
// application.yml — Spring Cloud Netflix Eureka client
spring:
application:
name: order-service
eureka:
client:
serviceUrl:
defaultZone: http://eureka-1:8761/eureka/,http://eureka-2:8761/eureka/
registryFetchIntervalSeconds: 30 # pull registry every 30s
instanceInfoReplicationIntervalSeconds: 30
fetchRegistry: true
registerWithEureka: true
instance:
preferIpAddress: true
leaseRenewalIntervalInSeconds: 10 // heartbeat every 10s
leaseExpirationDurationInSeconds: 30 // expire after 30s of silence
metadataMap:
version: "2.14.0"
zone: us-east-1a
ribbon:
NFLoadBalancerRuleClassName: com.netflix.loadbalancer.ZoneAvoidanceRule
Failure Modes
Why Discovery Fails Catastrophically
The Problem: The discovery layer becomes a single point of failure for everything that talks to anything. When it’s wrong, the whole mesh is wrong — and the failure modes are subtle.
The Solution: Understand the failure shapes. Cache aggressively. Build for graceful degradation. Treat the registry like the production database it really is.
Split-brain in the registry
Strongly consistent registries (Consul, etcd, ZooKeeper) use a Raft or Paxos quorum. Lose quorum — usually because of a network partition that splits the cluster — and writes stop. New pods can’t register. Health checks can’t update state. Old reads still work, which is its own problem because you’re routing on stale data.
Eventually consistent registries (Eureka by design, Cloud Map in some modes) prefer to keep working with stale data over going read-only. That means you’ll occasionally route to a dead instance during a partition — but the failure is local, not global. Netflix chose this trade-off explicitly.
The registry as a single point of failure
Never make the registry your single point of failure
If a Consul outage takes down every service that does a registry lookup at request time, you have built a more fragile system than you started with. Mitigations:
- Cache locally and use stale. Every client should cache the endpoint list and fall back to it when the registry is unreachable. Stale routing is almost always better than no routing.
- Cache at the proxy. Envoy/Linkerd sidecars do this by default — the control plane can be down for minutes while the data plane keeps routing.
- Run the registry redundantly. 3 or 5 nodes for Raft, never 1, never 2. Spread across AZs.
- Don’t put the registry behind itself. If clients need to discover the registry through DNS, your bootstrap had better not depend on services the registry advertises.
Stampedes when a popular service flaps
Imagine a popular downstream — auth-service — whose health check briefly flaps from healthy to unhealthy and back. Every caller that was holding a connection drops it. Every caller that needs auth opens a new connection. The replacement instances get hit by the entire fleet’s reconnection load, half of them buckle under the spike, more flapping ensues. This is a stampede.
Defenses: stagger reconnect with jitter, cap the rate of endpoint-table updates pushed to clients, prefer slow-start load balancing (give a newly-Ready instance a couple of seconds to warm up before it gets full load), and don’t mark a downstream as unhealthy on a single failed probe — require N consecutive failures.
Real-World Examples
Netflix Eureka. The reference for client-side discovery at cloud scale. Eureka servers cluster in a peer-to-peer ring, replicate registrations eventually, and explicitly choose availability over consistency. Eureka’s self-preservation mode is the famous quirk: if too many heartbeats stop arriving (network partition vs. mass instance death), it stops expiring registrations rather than risk emptying the registry. Eureka is in maintenance now; the patterns — client-side LB with Ribbon, hystrix circuit breakers in front — live on in Spring Cloud and Resilience4j.
Airbnb’s SmartStack (deprecated). A clever pre-mesh design: Nerve registered services to ZooKeeper, and Synapse on each host watched ZooKeeper and rewrote local HAProxy configs. Apps talked to localhost:port via HAProxy. It was a sidecar pattern before sidecars had a name. Airbnb retired it as Kubernetes and Envoy made it redundant — but it’s the clearest historical example of why the mesh model won.
HashiCorp Consul at scale. Consul scales by federating: each datacenter runs its own Consul cluster, and they gossip. Cross-DC service lookups are first-class. Big shops — Roblox, Cruise, Cloudflare — run Consul as the source of truth for tens of thousands of services across many datacenters and clouds. Consul Connect adds mTLS and an Envoy sidecar, turning Consul into a full mesh.
Kubernetes’ built-in discovery. The default for new systems. The kubelet, kube-proxy, and CoreDNS together implement registration, health, lookup, and load balancing without any third-party software. Service of type ClusterIP gets a stable virtual IP backed by iptables/IPVS rules; CoreDNS resolves the service name to that VIP; kube-proxy round-robins to the current pod IPs. It’s not the most flexible system, but for “I just want service A to call service B”, it’s zero-friction.
AWS Cloud Map + ECS Service Discovery. The managed-AWS equivalent. ECS tasks auto-register into a Cloud Map namespace; Cloud Map keeps Route 53 records (or instance attributes for an API-based lookup) in sync. Pairs naturally with App Mesh or Service Connect when you want sidecar-style discovery without running Istio yourself.
Best Practices
The short list
- Use names, never IPs. Anywhere a hostname can go, put a hostname. Reserve IPs for the registry itself and a handful of cluster-bootstrap addresses.
- Cache the registry locally. Every client (or sidecar) should keep a local copy and route from it. Treat the registry as the source of truth, not as a request-time dependency.
- Use stale data over no data. If the registry is unreachable, route from your last known good list. A wrong route is recoverable; no route is an outage.
- Probe the things that matter. A liveness check that returns 200 unconditionally is worse than no check. A readiness check that fails on every minor downstream blip causes stampedes. Tune both.
- Three replicas of the registry, minimum. Across AZs. Use Raft-based consensus for anything strongly consistent. Run chaos tests against losing one of them.
- Pick the load-balancing algorithm to match the workload. Round-robin for stateless and uniform; least-request for HTTP/2; consistent hashing for cache locality; zone-aware for cost.
- Pair discovery with resilience patterns. Discovery tells you where to send the request. Circuit breakers, retries, and timeouts handle what happens when you get there. Both layers are mandatory.
- Default to the mesh. If you’re building greenfield on Kubernetes, use the built-in discovery and add Linkerd or Istio when you need policy. Don’t roll your own client-side library unless you have a very specific reason.
The single most useful sentence about discovery
Service discovery is not a feature you add — it’s a property of your deployment system. The day you stop being able to find services is the day you stop being able to operate services. Build the discovery layer with the same care you’d build the database, because functionally it is one.