Service Discovery

In a world where containers come and go every few minutes, hardcoded IP addresses are a liability. Service discovery is how a fleet of microservices keeps finding each other while the ground keeps shifting underneath them.

Medium25 min read

Why Service Discovery Matters

Why Service Discovery Matters

The Problem: A monolith binds to a port and lives at one address forever. A microservice fleet has hundreds of instances, each one a short-lived container with an ephemeral IP that the orchestrator assigns at random. By the time you wrote the IP into a config file, the pod is already gone.

The Solution: A discovery layer that answers one question well: “where are the healthy instances of service X right now?” Everything else — load balancing, failover, traffic shifting — is built on top of that answer.

Real Impact: Kubernetes redeploys can churn through every pod in a service inside 60 seconds. Without discovery, the deploy is the outage.

Real-World Analogy

Think of how phone numbers worked before phone books:

  • No phone book = you have to memorize every number, and when someone moves you have no way to find them again
  • Phone book = central directory: name → current number, updated regularly
  • Directory assistance (411) = ask an operator who looks it up for you — you never see the book yourself
  • Speed dial = local cache of the numbers you call most, refreshed when one stops working

Service discovery is the phone book, the operator, and the speed dial — all rolled into one. Hardcoding an IP address is the equivalent of memorizing every number in a city that re-numbers itself every Tuesday.

Discovery exists because the assumptions of the network changed. In a static datacenter you knew, when you deployed, exactly which box ran which service and on which port. You wrote that into /etc/hosts, an HAProxy config, or a wiki page, and it stayed true for months.

Containers broke every part of that. Pods are scheduled on whatever node has capacity. They get a new IP each time they restart. A horizontal scale-up triples the instance count in 30 seconds; a rolling deploy replaces every one of them inside two minutes. Auto-healing kills a sick instance and the replacement comes up somewhere else entirely. There is no longer a stable place where order-service lives. There is only a stable nameorder-service — and a system that knows, right now, which IPs answer to that name.

What “discovery” actually means

The job decomposes into three responsibilities. Different systems wire them up differently, but the responsibilities are universal:

ResponsibilityWhat It DoesFailure Mode If Missing
RegistrationA new instance announces “I exist at 10.0.4.23:8080”Traffic never reaches new pods
Health checkingThe system continuously verifies which instances are aliveTraffic keeps going to dead pods
LookupA caller asks “give me the healthy IPs of service X”Callers cannot route at all
De-registrationA shutting-down instance gets removed from the pool5xx during every deploy

Client-Side vs Server-Side Discovery

Why the Distinction Matters

The Problem: Once you have a registry, somebody still has to use it — pick an instance, balance load, handle the failover when the picked instance is dead.

The Solution: Two architectural choices: clients query the registry themselves and load-balance in-process, or clients send to a stable load balancer address and let the LB do the discovery.

Client-side discovery puts intelligence in the caller. The caller pulls the registry list, caches it, picks an instance, and retries on failure. Netflix Eureka was the canonical example: the Java client embedded the registry list and rebuilt it from the Eureka server every 30 seconds.

Server-side discovery hides the registry behind a stable address. The caller talks to one well-known endpoint — an AWS ELB, an NGINX, a Kubernetes Service IP — and the LB does the lookup. The caller knows nothing about pods, instances, or registries.

Client-Side vs Server-Side Discovery Client-Side Client Registry (Eureka/Consul) instance A instance B instance C 1. lookup 2. call (client picks) Server-Side Client Load Balancer (ELB/NGINX/K8s) instance A instance B instance C 1. call 2. LB picks
AspectClient-SideServer-Side
Where the logic livesInside every client (library / SDK)Inside the LB / proxy / mesh
Network hops1 hop (client → instance)2 hops (client → LB → instance)
Polyglot costNeed a discovery client per languageFree — everyone speaks HTTP/DNS
Failure of the discovery layerCached list keeps workingLB outage = service outage
Custom routing logicTrivial (it’s your code)Limited to what the LB exposes
ExamplesNetflix Eureka + Ribbon, FinagleAWS ALB, Kubernetes Services, NGINX, Envoy

The industry has converged on server-side — specifically the service-mesh flavor of it — for one reason: you don’t want to ship a Eureka client for every language your team writes in. The mesh sidecar gives every pod “client-side” smarts without requiring any language-specific library.

Service Registries

Why a Registry at All

The Problem: Something has to be the source of truth for “which instances exist and which are healthy.” Distribute that knowledge naively and you’ll have stale state somewhere causing 5xx.

The Solution: A registry — usually a strongly-consistent KV store — that everyone reads from and writes to. The hard part isn’t storing the data; it’s keeping it accurate.

The classic service registries:

RegistryBacked ByBuilt ForNotes
HashiCorp ConsulRaft consensusMulti-DC service discovery + KV + ACLThe general-purpose pick. Built-in health checks, DNS interface, mesh.
etcdRaft consensusCoordination + configWhat Kubernetes itself uses internally for the API server’s state.
Apache ZooKeeperZAB consensusCoordination, leader electionThe grandparent. Still common in Kafka, HBase, older Java stacks.
Netflix EurekaAP-style replicationCloud-scale, partition-tolerantTrades consistency for availability — was deliberate at Netflix.
Kubernetes API + etcdetcdK8s-native discovery via EndpointsYou don’t install it; you get it.
AWS Cloud MapRoute 53 + AWS internalsManaged AWS-native discoveryPairs with ECS Service Discovery. Zero infra to run.

Self-registration vs third-party registration

Two ways to get an instance into the registry:

Modern stacks default to third-party. Kubernetes does it for you: when a pod becomes Ready, the kubelet posts to the API server and the Endpoints object updates automatically. You write zero code for that.

Consul self-registration via JSON

// /etc/consul.d/order-service.json
// Picked up by the local Consul agent on startup.
{
  "service": {
    "id":      "order-service-7f3a",
    "name":    "order-service",
    "address": "10.0.4.23",
    "port":    8080,
    "tags":    ["v2", "prod", "region=us-east-1"],

    "meta": {
      "version": "2.14.0",
      "git_sha": "a91b3cf"
    },

    "checks": [
      {
        "name":     "http-health",
        "http":     "http://10.0.4.23:8080/health",
        "interval": "10s",
        "timeout":  "2s",
        "deregister_critical_service_after": "1m"
      },
      {
        "name":     "tcp-port",
        "tcp":      "10.0.4.23:8080",
        "interval": "5s"
      }
    ]
  }
}

Two checks because they fail for different reasons. The TCP check catches “the process is wedged but the port is still bound.” The HTTP check catches “the port answers but the app is broken.” deregister_critical_service_after is the safety net — if the instance has been failing for a minute it gets removed from lookups even if the agent didn’t hear about a clean shutdown.

DNS-Based Discovery

Why DNS Is Awkward Here

The Problem: DNS is the universal name-resolution protocol — every language, every runtime, every OS speaks it. So why not use it for discovery?

The Solution: Use it — but understand its baggage. DNS caches aggressively at every layer (the OS resolver, the JVM, the LB), and it has no concept of “this instance is unhealthy.” You make it work by setting tiny TTLs, controlling the resolver, and accepting some delay during failover.

Kubernetes’ built-in discovery is DNS-based. Every Service object becomes a DNS record inside the cluster:

# A normal ClusterIP Service: stable virtual IP, hides individual pods.
# DNS: order-service.payments.svc.cluster.local → 10.96.42.1
apiVersion: v1
kind: Service
metadata:
  name: order-service
  namespace: payments
spec:
  selector:
    app: order-service
  ports:
    - name: http
      port: 80
      targetPort: 8080
  type: ClusterIP
---
# A *headless* Service: no virtual IP, DNS returns one A record per pod.
# DNS: order-service-headless.payments.svc.cluster.local →
#      10.0.4.23, 10.0.4.24, 10.0.4.25
apiVersion: v1
kind: Service
metadata:
  name: order-service-headless
  namespace: payments
spec:
  clusterIP: None              # the magic flag
  selector:
    app: order-service
  ports:
    - name: http
      port: 80
      targetPort: 8080

The headless variant is what you use when the client wants to do its own load balancing — gRPC clients, StatefulSet members talking to each other, sharded systems where the caller needs to know exactly which pod owns which key.

The TTL trap

DNS caches lie to you for as long as you let them

If a pod dies and you haven’t been careful, traffic keeps going to its old IP for as long as the DNS TTL says it’s valid. The Java DNS cache defaults to caching forever in some JVM versions. The OS resolver respects TTL only if the upstream resolver does. If you set TTL=60s, plan for failover to take 60s on a bad day — sometimes longer.

Kubernetes’ CoreDNS uses a 5-second TTL for service records, which is why pod failover feels nearly instant inside a cluster. But the moment you point an external client at service.example.com with a 5-minute TTL on Route 53, that 5-second illusion disappears.

ECS Service Discovery is the AWS-managed equivalent. You attach a service to an AWS Cloud Map namespace; ECS keeps Route 53 records in sync with the running tasks. Same trade-offs, plus the AWS bill.

SRV records vs A records

Plain A records give you an IP. SRV records give you IP and port plus weight and priority — useful when ports are dynamic, common in Consul setups (_order-service._tcp.service.consul). Most modern clients still prefer A + a known port; SRV support is uneven across language stdlibs.

Service Mesh Discovery

Why Sidecars Won

The Problem: Client-side discovery libraries are great until you have eight languages in production. Server-side LBs are great until you need fine-grained routing — canary, mTLS, retries, traffic mirroring — that the LB doesn’t support.

The Solution: Run a proxy as a sidecar next to every pod. The application talks to localhost; the proxy handles discovery, load balancing, retries, and security. Your code knows nothing.

A service mesh splits in two:

Discovery becomes a non-issue from the application’s point of view. You write http://order-service/orders/123. The sidecar intercepts it, checks its locally cached endpoint table (pushed from the control plane), picks an instance, opens an mTLS connection, and proxies the request. The pod IP changes? The control plane notices, recomputes, pushes — usually within a second. Your code never knew there was a problem.

Linkerd traffic split

# Send 90% of traffic to v1, 10% to v2 — discovery + weighted routing
# done by the mesh, completely transparent to clients.
apiVersion: split.smi-spec.io/v1alpha1
kind: TrafficSplit
metadata:
  name: order-service-rollout
  namespace: payments
spec:
  service: order-service          # the apex service clients call
  backends:
  - service: order-service-v1
    weight: 90
  - service: order-service-v2
    weight: 10

The application keeps calling order-service. The mesh splits the traffic. To migrate to v2 you change the weights; no redeploy of any caller.

What the mesh gives you that DNS can’t

  • Per-request load balancing instead of per-connection (HTTP/2 doesn’t pool the way HTTP/1.1 did, so DNS-based LB skews badly).
  • Endpoint-level health awareness — if instance B is slow, take it out of rotation; DNS has no way to express that.
  • mTLS between every pair of services, automatic.
  • Retries, timeouts, circuit breaking as policy — not as application code.
  • Observability for free: metrics for every request between every pair of services.

Health Checks

Why “Healthy” Is Not One Question

The Problem: A pod can be running, listening on a port, returning 200 from /health — and still be unable to do useful work because its database connection pool is exhausted. Conversely, a pod can be loading 4 GB of model weights at startup and look unhealthy for two minutes when it’s actually fine.

The Solution: Three different probes that ask three different questions. Confusing them is one of the most common Kubernetes outage patterns.

ProbeQuestionFailure Action
Startup“Is the app done initializing?”Keep waiting; don’t kill yet
Liveness“Is the process wedged?”Restart the container
Readiness“Should I receive traffic right now?”Remove from Service endpoints
# Each probe answers a different question — wire them up correctly.
apiVersion: v1
kind: Pod
metadata:
  name: order-service
spec:
  containers:
  - name: app
    image: orderservice:2.14.0
    ports: [{ containerPort: 8080 }]

    startupProbe:               # app needs <= 60s to load caches
      httpGet: { path: /healthz, port: 8080 }
      periodSeconds: 5
      failureThreshold: 12      # 12 * 5s = 60s grace

    livenessProbe:              # cheap, port-level — just “am I alive”
      tcpSocket: { port: 8080 }
      periodSeconds: 10
      failureThreshold: 3

    readinessProbe:             # real work — am I able to serve?
      httpGet:
        path: /ready            # checks DB pool, downstream deps, queue depth
        port: 8080
      periodSeconds: 5
      failureThreshold: 2
      successThreshold: 1

Application-level vs port-level checks

A tcpSocket check is cheap and tells you the process is up. An HTTP check on a real path tells you the process is up and the framework is routing requests and the handler is returning. Port-level for liveness, application-level for readiness, is a sane default.

The “healthy but degraded” problem

The most dangerous check is the one that returns 200 because the check itself is trivial. A common bug: /health just returns 200 with no body. Meanwhile the database is gone, the cache is gone, the downstream is timing out — and your pod happily keeps receiving traffic because its health endpoint is “working.”

A real readiness probe checks the things that, if broken, make this pod useless: DB connectivity, downstream availability, queue depth below threshold, free disk above threshold. But beware the inverse trap — a probe that flips to unhealthy when a single downstream slows down can turn a small problem into a cluster-wide failover stampede.

Load Balancing Algorithms

Why Algorithm Choice Matters

The Problem: Round-robin is the obvious default and it’s wrong for at least three common cases: HTTP/2 long-lived connections, cache-locality workloads, and multi-AZ deployments where cross-AZ bytes cost real money.

The Solution: Pick the algorithm to match the workload. Most modern proxies let you swap the algorithm with one line of config.

AlgorithmPicks Instance ByBest ForWatch Out For
Round-robinNext in the listStateless, uniform requestsHTTP/2 + few clients = uneven load
Least-connectionsFewest active connectionsVariable request durationNeeds accurate connection counts; tricky behind a NAT
Least-requestFewest in-flight requestsHTTP/2 multiplexingSlightly more state to maintain
WeightedOperator-assigned weightsHeterogeneous instance sizes, canary rolloutsWeights drift if you forget to update them
Random / power-of-twoRandom pick (or best of two random)Stateless at large fleet sizeSurprisingly good; underused
Consistent hashingHash of a key (user ID, cache key)Cache locality, sticky sessionsHot keys still hit one instance hard
Zone/region affinityClosest healthy instanceCost reduction, latencySingle-zone failure can overload the next zone

Consistent hashing for cache locality

If your service has a per-user cache, sending the same user to the same instance every time gives you a hot cache instead of three cold ones. Consistent hashing on user_id does that, and survives instance churn without invalidating most of the cache. Envoy and HAProxy both support it natively.

Zone affinity for cost

Cross-AZ traffic on AWS is billed at $0.01/GB each way. For a service that does 10 TB/day of internal calls, that’s $200/day in network fees alone. Zone-aware load balancing — prefer a same-AZ instance unless none is healthy — cuts that to a fraction. Istio’s locality load balancing and Linkerd’s topology-aware routing both implement this with a single config flag.

Eureka client config (Java)

// application.yml — Spring Cloud Netflix Eureka client
spring:
  application:
    name: order-service

eureka:
  client:
    serviceUrl:
      defaultZone: http://eureka-1:8761/eureka/,http://eureka-2:8761/eureka/
    registryFetchIntervalSeconds: 30     # pull registry every 30s
    instanceInfoReplicationIntervalSeconds: 30
    fetchRegistry: true
    registerWithEureka: true

  instance:
    preferIpAddress: true
    leaseRenewalIntervalInSeconds: 10      // heartbeat every 10s
    leaseExpirationDurationInSeconds: 30   // expire after 30s of silence
    metadataMap:
      version: "2.14.0"
      zone: us-east-1a

ribbon:
  NFLoadBalancerRuleClassName: com.netflix.loadbalancer.ZoneAvoidanceRule

Failure Modes

Why Discovery Fails Catastrophically

The Problem: The discovery layer becomes a single point of failure for everything that talks to anything. When it’s wrong, the whole mesh is wrong — and the failure modes are subtle.

The Solution: Understand the failure shapes. Cache aggressively. Build for graceful degradation. Treat the registry like the production database it really is.

Split-brain in the registry

Strongly consistent registries (Consul, etcd, ZooKeeper) use a Raft or Paxos quorum. Lose quorum — usually because of a network partition that splits the cluster — and writes stop. New pods can’t register. Health checks can’t update state. Old reads still work, which is its own problem because you’re routing on stale data.

Eventually consistent registries (Eureka by design, Cloud Map in some modes) prefer to keep working with stale data over going read-only. That means you’ll occasionally route to a dead instance during a partition — but the failure is local, not global. Netflix chose this trade-off explicitly.

The registry as a single point of failure

Never make the registry your single point of failure

If a Consul outage takes down every service that does a registry lookup at request time, you have built a more fragile system than you started with. Mitigations:

  • Cache locally and use stale. Every client should cache the endpoint list and fall back to it when the registry is unreachable. Stale routing is almost always better than no routing.
  • Cache at the proxy. Envoy/Linkerd sidecars do this by default — the control plane can be down for minutes while the data plane keeps routing.
  • Run the registry redundantly. 3 or 5 nodes for Raft, never 1, never 2. Spread across AZs.
  • Don’t put the registry behind itself. If clients need to discover the registry through DNS, your bootstrap had better not depend on services the registry advertises.

Stampedes when a popular service flaps

Imagine a popular downstream — auth-service — whose health check briefly flaps from healthy to unhealthy and back. Every caller that was holding a connection drops it. Every caller that needs auth opens a new connection. The replacement instances get hit by the entire fleet’s reconnection load, half of them buckle under the spike, more flapping ensues. This is a stampede.

Defenses: stagger reconnect with jitter, cap the rate of endpoint-table updates pushed to clients, prefer slow-start load balancing (give a newly-Ready instance a couple of seconds to warm up before it gets full load), and don’t mark a downstream as unhealthy on a single failed probe — require N consecutive failures.

Real-World Examples

Netflix Eureka. The reference for client-side discovery at cloud scale. Eureka servers cluster in a peer-to-peer ring, replicate registrations eventually, and explicitly choose availability over consistency. Eureka’s self-preservation mode is the famous quirk: if too many heartbeats stop arriving (network partition vs. mass instance death), it stops expiring registrations rather than risk emptying the registry. Eureka is in maintenance now; the patterns — client-side LB with Ribbon, hystrix circuit breakers in front — live on in Spring Cloud and Resilience4j.

Airbnb’s SmartStack (deprecated). A clever pre-mesh design: Nerve registered services to ZooKeeper, and Synapse on each host watched ZooKeeper and rewrote local HAProxy configs. Apps talked to localhost:port via HAProxy. It was a sidecar pattern before sidecars had a name. Airbnb retired it as Kubernetes and Envoy made it redundant — but it’s the clearest historical example of why the mesh model won.

HashiCorp Consul at scale. Consul scales by federating: each datacenter runs its own Consul cluster, and they gossip. Cross-DC service lookups are first-class. Big shops — Roblox, Cruise, Cloudflare — run Consul as the source of truth for tens of thousands of services across many datacenters and clouds. Consul Connect adds mTLS and an Envoy sidecar, turning Consul into a full mesh.

Kubernetes’ built-in discovery. The default for new systems. The kubelet, kube-proxy, and CoreDNS together implement registration, health, lookup, and load balancing without any third-party software. Service of type ClusterIP gets a stable virtual IP backed by iptables/IPVS rules; CoreDNS resolves the service name to that VIP; kube-proxy round-robins to the current pod IPs. It’s not the most flexible system, but for “I just want service A to call service B”, it’s zero-friction.

AWS Cloud Map + ECS Service Discovery. The managed-AWS equivalent. ECS tasks auto-register into a Cloud Map namespace; Cloud Map keeps Route 53 records (or instance attributes for an API-based lookup) in sync. Pairs naturally with App Mesh or Service Connect when you want sidecar-style discovery without running Istio yourself.

Best Practices

The short list

  • Use names, never IPs. Anywhere a hostname can go, put a hostname. Reserve IPs for the registry itself and a handful of cluster-bootstrap addresses.
  • Cache the registry locally. Every client (or sidecar) should keep a local copy and route from it. Treat the registry as the source of truth, not as a request-time dependency.
  • Use stale data over no data. If the registry is unreachable, route from your last known good list. A wrong route is recoverable; no route is an outage.
  • Probe the things that matter. A liveness check that returns 200 unconditionally is worse than no check. A readiness check that fails on every minor downstream blip causes stampedes. Tune both.
  • Three replicas of the registry, minimum. Across AZs. Use Raft-based consensus for anything strongly consistent. Run chaos tests against losing one of them.
  • Pick the load-balancing algorithm to match the workload. Round-robin for stateless and uniform; least-request for HTTP/2; consistent hashing for cache locality; zone-aware for cost.
  • Pair discovery with resilience patterns. Discovery tells you where to send the request. Circuit breakers, retries, and timeouts handle what happens when you get there. Both layers are mandatory.
  • Default to the mesh. If you’re building greenfield on Kubernetes, use the built-in discovery and add Linkerd or Istio when you need policy. Don’t roll your own client-side library unless you have a very specific reason.

The single most useful sentence about discovery

Service discovery is not a feature you add — it’s a property of your deployment system. The day you stop being able to find services is the day you stop being able to operate services. Build the discovery layer with the same care you’d build the database, because functionally it is one.