Performance & Scalability

Latency budgets exist whether you set them or not. The teams that ship fast services are the ones that decide where every millisecond goes — before users do it for them.

Hard~30 min read

Why Performance & Scalability Matter

Why Performance & Scalability Matter

The Problem: A monolith’s performance is a single curve you can profile. A microservices request fans out across five, ten, sometimes thirty internal calls — and the user only sees the slowest path. Adding capacity at the wrong service does nothing; adding capacity at the right one is the difference between a working checkout and a 30-second spinner.

The Solution: Treat performance as a budget you allocate, scalability as a property you design for, and both as numbers you measure continuously. Set latency SLOs, give every service a slice, cache aggressively, push slow work off the request path, and autoscale on the metric that actually drives queueing.

Real Impact: Amazon famously found that every 100 ms of latency costs ~1% in sales. Google measured that an extra 500 ms of search latency dropped traffic 20%. Stripe’s API SLO is single-digit milliseconds at P99 because every basis point of latency is money paid into the wrong pocket.

Real-World Analogy

Think of a highway at rush hour:

  • Vertical scaling = raising the speed limit. A single car goes faster, but the road still jams at the same number of vehicles.
  • Horizontal scaling = adding more lanes. The same speed limit, but many more cars move through per minute.
  • Latency = how long your trip takes from on-ramp to off-ramp.
  • Throughput = how many cars cross the bridge per minute.
  • Caching = letting most cars skip the bridge by using a side road they’ve already mapped.

Performance work is choosing which lever applies, in which order, and at what cost. Adding lanes is expensive; raising the speed limit has hard physical limits; teaching everyone the side road is cheap once and then pays forever.

Failures of performance are not always crashes. More often they are brownouts: requests still complete, but slowly enough that timeouts fire upstream, retries pile up, queues fill, and the user sees a half-loaded page. Brownouts are harder to diagnose than outages because everything is technically “up.” A throughput SLO (“500 RPS sustained”) and a latency SLO (“P99 < 250 ms”) together define the contract that prevents brownouts from becoming the steady state.

Two numbers that matter

MetricWhat It MeasuresWhy It Matters
LatencyHow long a single request takesWhat the user feels
ThroughputHow many requests per second the system absorbsWhat capacity planning drives
ConcurrencyRequests in flight at any instantLittle’s Law: concurrency = throughput × latency
SaturationHow full the bottleneck resource isThe leading indicator of latency cliff

Little’s Law is the only equation in this tutorial you have to memorize. If a service handles 1,000 RPS at 200 ms average latency, it has 200 requests in flight on average. Push latency to 2 seconds and that becomes 2,000 in flight — ten times the threads, ten times the memory, often ten times the database connections. Latency and throughput are not independent. They are two sides of the same equation.

Latency Budgets

Why Decompose the Budget

The Problem: You can’t hit a P99 of 250 ms at the edge if any internal service takes 400 ms. The slowest hop in the critical path is a hard ceiling.

The Solution: Pick an end-to-end target, decompose it across every hop on the critical path, give each service a sub-budget, and track each service against its own slice. The budget is the contract.

A latency budget starts with the number the user feels. For an interactive web request that’s typically 250–500 ms at P95. For a mobile API call that backs a UI animation it might be 50–100 ms. For a payment authorization with a hardware terminal in the loop it might be 2 s at P99. The number is non-negotiable; it falls out of human perception research and product requirements.

Once you have the end-to-end number, you decompose it. Suppose checkout has a 300 ms P95 budget and the request fans out as: edge → auth → cart → pricing → inventory → payment → database.

HopBudget (P95)Notes
Edge / TLS / routing20 msCDN-terminated TLS, regional gateway
Auth verification15 msJWT verify, token cache
Cart service40 msOne DB read
Pricing service50 msCache hit path; cache miss path is much slower
Inventory check40 msParallel with pricing where possible
Payment authorization120 msExternal call to payment processor
Reserve / serialize / network15 msSlack for variance

Notice three things. First, payment is the dominant slice — an external dependency you don’t control. Second, inventory and pricing can be parallel; a sequential implementation would blow the budget. Third, there is deliberate slack at the bottom. A budget with no slack is a budget you will miss every time someone runs a GC.

P50, P95, P99, and the long tail

Percentiles, in plain terms

  • P50 (median): half your requests are faster than this. Useful for “what does it normally feel like?”
  • P95: 95 of every 100 requests beat this number. The boundary between “normal” and “slow.”
  • P99: 99 of 100. The boundary between “slow” and “the user reloaded the page.”
  • P99.9: 1 in 1,000. At a million requests per day, 1,000 users hit it daily. The CEO is in this group.

Averages lie. A P50 of 80 ms with a P99 of 4 s is a system in trouble — the average is fine and a meaningful slice of users hate it. Track P95 and P99 as first-class metrics; track the average only as a sanity check.

The other reason percentiles matter for microservices: tail latencies amplify on fan-out. If one downstream call has a P99 of 1%, and your request makes ten parallel calls, the probability that at least one of those calls hits the P99 tail is roughly 1 - 0.99^10 ≈ 9.6%. Your effective P99 at the request level is closer to your downstreams’ P90. The more services on the critical path, the more aggressively you must trim tails.

Horizontal vs Vertical Scaling

Why Microservices Favor Horizontal

The Problem: Vertical scaling — bigger boxes — hits a ceiling. The biggest VM you can rent has a price that scales worse than linearly, and a single failure takes the whole thing offline.

The Solution: Stateless services scale horizontally by adding replicas. A load balancer fans traffic across them, and a single instance dying is a non-event because the others absorb the load.

DimensionVertical (scale up)Horizontal (scale out)
MechanismBigger CPU / RAM / disk on one nodeMore replicas behind a load balancer
Best forStateful workloads, in-memory analytics, big DBsStateless services, web tiers, workers
Failure domainOne large blast radiusOne small replica
Cost curveSuper-linear after a pointRoughly linear with some overhead
CapLargest available SKUCapped by coordination cost & downstreams
Provisioning speedReboot — minutes, sometimes downtimeAdd a pod — seconds

Microservices are designed to favor horizontal scaling because every service is small enough to hold in memory on commodity hardware. The work of horizontal scaling is in the architecture: keep services stateless so that any replica can serve any request, push state into databases, caches, and object stores, and make sure routing doesn’t pin users to a specific replica unless you genuinely need session affinity.

Stateful services scale differently

Databases, caches, and message brokers are stateful. Adding a replica doesn’t double capacity for free — the replica has to either replicate the existing data (read replicas, easy) or take ownership of a partition of the keyspace (sharding, hard). The right model:

Stateless does not mean memoryless

A “stateless” service can still hold per-process caches, connection pools, and warm JIT code. Killing a replica throws all of that away. If your service takes 60 seconds to warm up after a restart, your autoscaler will look like it works in tests but spike P99 every time it scales. Either pre-warm new replicas, hold them out of rotation until ready, or accept that “stateless” is a deployment property, not a runtime one.

Caching Strategies

Why Caching Is the First Lever

The Problem: Most services in most companies serve the same data over and over. Every request that hits the database for data that hasn’t changed in an hour is wasted CPU, wasted network, and wasted latency.

The Solution: A tiered cache between the user and the database. Browser cache, CDN edge cache, in-memory app cache, and a distributed cache (Redis, Memcached) before you finally hit the source of truth. Each layer that hits is one fewer downstream call.

Caching is the cheapest performance work you can do per dollar — and the easiest to get wrong. The patterns to know:

PatternHow It WorksTrade-off
Cache aside (lazy)App reads cache; on miss, reads DB and populates cacheSimple; first request after expiry is slow
Read throughCache library transparently reads DB on missCleaner code; couples cache and DB
Write throughWrites go to cache, which writes to DB synchronouslyCache always fresh; writes are slower
Write behindWrites go to cache, which flushes to DB asynchronouslyFast writes; data loss if cache dies before flush
Refresh aheadCache proactively refreshes hot keys before TTL expiresBest latency; wasted refresh on cold keys

Cache aside in Python with Redis

import json
import random
import redis
from typing import Optional

r = redis.Redis(host="redis-prod.svc", port=6379, decode_responses=True)

def get_product(product_id: str) -> Optional[dict]:
    key = f"product:{product_id}"

    # 1. Try cache
    cached = r.get(key)
    if cached is not None:
        if cached == "__miss__":
            return None
        return json.loads(cached)

    # 2. Miss — load from DB
    product = db.fetch_product(product_id)
    if product is None:
        # Negative caching: remember misses too, briefly,
        # so a hot 404 doesn’t hammer the DB.
        r.setex(key, 30, "__miss__")
        return None

    # 3. Populate cache with TTL + small jitter to avoid stampedes
    ttl = 300 + random.randint(0, 30)
    r.setex(key, ttl, json.dumps(product))
    return product

Two details that separate a working cache from a production cache: a small jitter on TTLs so that 10,000 keys created in the same second don’t all expire in the same second, and negative caching so a flood of requests for a deleted product doesn’t become a flood of database lookups for nothing.

Cache stampede and request coalescing

The Thundering Herd

The Problem: A hot key expires at 12:00:00. Between 12:00:00 and 12:00:01, ten thousand requests miss the cache, all hit the database, and the database falls over. The cache was the only thing protecting it.

The Solution: Request coalescing. The first miss takes a lock and does the work; the others wait for the result instead of duplicating it. Combined with TTL jitter and refresh-ahead for the hottest keys, the herd becomes a single request.

import threading

_locks: dict[str, threading.Lock] = {}
_locks_guard = threading.Lock()

def get_with_coalescing(key: str):
    cached = r.get(key)
    if cached is not None:
        return json.loads(cached)

    # One lock per key; only one thread per process loads.
    with _locks_guard:
        lock = _locks.setdefault(key, threading.Lock())

    with lock:
        cached = r.get(key)                # double-check after acquiring
        if cached is not None:
            return json.loads(cached)
        value = db.fetch(key)
        r.setex(key, 300, json.dumps(value))
        return value

For coalescing across processes, you need a distributed lock (Redis SET NX EX) or a queue-based approach where the first miss writes a sentinel like __loading__ and others poll briefly. Either way, the rule is: one origin call per key per refresh window, no matter how many concurrent misses you get.

The cache hierarchy

From farthest from origin to closest

  1. Browser cache — controlled by Cache-Control headers. Free hits; users never even reach your network.
  2. CDN / edge cache — Cloudflare, Fastly, CloudFront. Hit rates of 90%+ on static assets and many API responses.
  3. Reverse proxy / gateway cache — nginx, Varnish, the API gateway. Catches anything the CDN missed.
  4. In-process cache — Caffeine (JVM), functools.lru_cache, Ristretto (Go). Microsecond hits, no network.
  5. Distributed cache — Redis, Memcached. Sub-millisecond hits, shared across replicas.
  6. Database — the source of truth. The thing every other layer exists to protect.

The closer to the user a cache lives, the cheaper the hit. The closer to the source, the easier it is to keep fresh. In practice you use multiple layers and accept that invalidation gets harder with each one you add.

Async Processing and Backpressure

Why Move Work Off the Request Path

The Problem: A user uploads an image and waits while the service resizes, virus-scans, OCRs, and copies to three storage tiers. Three seconds. They reload, then leave.

The Solution: The request returns as soon as the image is durably persisted. Everything else — resizing, scanning, indexing — happens on a queue. The user gets 100 ms; the work still happens.

Anything that doesn’t need to complete before you respond to the user should not run on the request path. The basic shape:

  1. Request handler does the minimum — validate, persist, enqueue.
  2. Worker pool drains the queue at its own pace.
  3. Client polls for status, or the system pushes a webhook / WebSocket update when done.

This converts a latency problem into a throughput problem — which is much easier to scale. Workers can batch, debounce, and run at full CPU without anyone watching a spinner.

Backpressure: when the producer outruns the consumer

If producers can submit faster than consumers can drain, the queue grows without bound until you run out of memory or disk. Backpressure is the mechanism that pushes the “slow down” signal back to the producer.

StrategyWhat It DoesWhen to Use
BlockProducer waits when queue is fullInternal pipelines where back-pressure is correctness
Reject (load-shed)Producer gets 429 / 503 immediatelyPublic APIs — better fast-fail than slow-success
Drop oldestDiscard the front of the queueTelemetry — freshness beats completeness
Drop newestDiscard the most recent submissionIdempotent retried work
Spill to diskQueue overflows to durable storageBrokers (Kafka) — you genuinely cannot lose anything

The rule: bounded queues only. An unbounded in-memory queue is a memory leak with a polite name. Pick a bound, pick a strategy for what to do when you hit it, and surface a metric so you know when you’re hitting it.

Reactive frameworks — Project Reactor, RxJava, Akka Streams — build backpressure into their type system: a downstream operator requests N items from upstream, and upstream is forbidden from emitting more. Plain queues require you to enforce this manually. Batching and debouncing on the consumer side compound the savings: a worker that flushes every 100 ms or every 500 messages, whichever comes first, can sustain 10× the throughput of one that processes one message at a time.

Database Performance

Why the Database Is Usually the Bottleneck

The Problem: Application servers are stateless and scale horizontally in seconds. Databases are stateful, scale slowly, and have a ceiling. When something is slow, it’s almost always the database or a downstream of the database.

The Solution: Treat database performance as a first-class engineering concern: connection pooling, indexes that match access patterns, eliminate N+1 queries, read replicas for read load, sharding for write load, and watch out for hot keys.

Connection pooling

A database connection is expensive to create — TCP handshake, TLS, authentication, session setup — and Postgres in particular allocates a process per connection. Opening a connection per request will destroy a database long before traffic does. Pool them.

# Python: SQLAlchemy connection pool tuned for a microservice
from sqlalchemy import create_engine

engine = create_engine(
    "postgresql+psycopg://user:pass@db.svc:5432/orders",
    pool_size=20,            # steady-state connections held open per process
    max_overflow=10,         # burst capacity above pool_size
    pool_timeout=3,          # seconds to wait for a connection — fail fast
    pool_recycle=1800,       # recycle every 30 min to dodge stale connections
    pool_pre_ping=True,      # validate before lending (cheap SELECT 1)
)

Math the pool against the database, not the app

If your service runs 50 pods, each with pool_size=20, you are reserving 1,000 connections on the database. Postgres defaults to max_connections = 100. You will exhaust it before lunch. Either lower the per-pod pool, run a connection multiplexer in front (PgBouncer in transaction mode is the standard), or both. The database’s max_connections is the real budget; everything in front of it has to fit.

The N+1 problem

The single most common slow-query bug. You list 100 orders, then for each order you look up the customer — that’s 1 query for the list and 100 queries for the customers, hence “N+1.” The fix is always the same: fetch the related data in one query (JOIN, IN (...), an ORM select_related / preload), or batch the lookups (DataLoader pattern).

# BAD — 1 + N queries
orders = session.query(Order).limit(100).all()
for o in orders:
    print(o.customer.name)        # lazy-loaded query per order

# GOOD — 1 query with join
from sqlalchemy.orm import joinedload
orders = (session.query(Order)
                 .options(joinedload(Order.customer))
                 .limit(100)
                 .all())

Indexes for the access pattern

Indexes make reads fast and writes a bit slower. The mistake is to index every column “just in case.” Index for the queries you actually run, in the order they actually filter and sort. A composite index on (tenant_id, created_at DESC) serves WHERE tenant_id = ? ORDER BY created_at DESC LIMIT 50 directly. The same index does almost nothing for WHERE created_at > ? alone, because Postgres can’t skip the leading column.

Read replicas, sharding, and hot keys

Autoscaling

Why Autoscale

The Problem: Static capacity is wrong twice a day — over-provisioned at 4 AM, under-provisioned at peak. Manual scaling is correct briefly and wrong the rest of the time.

The Solution: Let the platform add and remove replicas based on a leading indicator. On Kubernetes that’s the Horizontal Pod Autoscaler (HPA), driven by metrics that actually predict queueing.

HPA on a custom metric

The default HPA watches CPU. CPU is fine for CPU-bound work. For I/O-bound services — most microservices — CPU is a lagging indicator. By the time CPU climbs, your queue depth is already five seconds deep. Scale on the metric that predicts queueing.

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: order-worker
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: order-worker
  minReplicas: 3
  maxReplicas: 200
  metrics:
    # Primary signal: how many messages are sitting in the queue
    - type: External
      external:
        metric:
          name: kafka_consumergroup_lag
          selector:
            matchLabels:
              topic: "orders"
              consumergroup: "order-worker"
        target:
          type: AverageValue
          averageValue: "500"      # aim for ~500 messages of lag per pod
    # Safety net so we don’t pin CPUs at 100% under burst
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 75
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 0     # scale up fast
      policies:
        - type: Percent
          value: 100
          periodSeconds: 30
    scaleDown:
      stabilizationWindowSeconds: 300   # scale down slow
      policies:
        - type: Percent
          value: 10
          periodSeconds: 60

Two parts of this YAML deserve attention. First, the metric is queue lag, not CPU. For a worker, lag is what the user feels. Second, the behavior block makes scale-up fast and scale-down slow — an asymmetry that matters because adding a pod that you didn’t need costs cents, while removing a pod you did need costs an outage.

Never autoscale on CPU alone for I/O-bound workloads

If your service spends 80% of its time waiting on the database or another service, CPU stays low even when latency explodes. The HPA is happy; the user is not. For I/O-bound work, scale on queue depth, in-flight requests, latency P95, or downstream concurrency — never CPU alone. Tools like KEDA make custom metrics easy; use them.

Cold starts and predictive scaling

A new replica is not instantly useful. It pulls an image, starts the runtime, JIT-compiles, fills caches, opens connection pools. For a JVM service this can be 30–90 seconds; for serverless functions, often a few seconds; for a heavy Python service with model weights, minutes. During that warmup, the new pod is essentially a hot spare with bad latency.

Mitigations: readiness probes that hold the pod out of rotation until ready; over-provisioning to absorb traffic during scale-up; predictive autoscaling (Knative, KEDA Cron-based scalers, AWS predictive scaling) that scales ahead of known traffic patterns based on time of day rather than waiting for the metric to spike.

Profiling and Load Testing

Why Measure Before Optimizing

The Problem: Engineers will optimize the part of the code they recently looked at, not the part that’s actually slow. The result is faster code that doesn’t move the latency number.

The Solution: Profile to find where the time goes. Trace to find which service in the fan-out owns the tail. Load test to know how many requests per second you can take before you fall over — before production tells you.

Profiling: where does the time actually go

Load testing with k6

// k6 load test — ramp to 500 RPS, hold, then ramp down.
// k6 is the modern default; Locust, Gatling, Vegeta cover the rest.
import http from 'k6/http';
import { check, sleep } from 'k6';

export const options = {
  scenarios: {
    checkout: {
      executor: 'ramping-arrival-rate',
      startRate: 10,
      timeUnit: '1s',
      preAllocatedVUs: 100,
      maxVUs: 2000,
      stages: [
        { target: 100, duration: '2m'  },   // warm up to 100 RPS
        { target: 500, duration: '5m'  },   // climb to 500 RPS
        { target: 500, duration: '10m' },   // hold at 500 RPS
        { target: 0,   duration: '2m'  },   // ramp down
      ],
    },
  },
  thresholds: {
    'http_req_duration{status:200}': ['p(95)<250', 'p(99)<500'],
    'http_req_failed': ['rate<0.001'],   // fail the run if >0.1% errors
  },
};

export default function () {
  const r = http.post('https://api.example.com/checkout', JSON.stringify({
    cart_id: 'c-12345',
    payment_token: 'tok_test',
  }), { headers: { 'Content-Type': 'application/json' } });

  check(r, {
    'status is 200': (res) => res.status === 200,
    'body has order_id': (res) => res.json('order_id') !== undefined,
  });
  sleep(1);
}

The thresholds block is what makes a load test useful in CI. The test fails if P95 exceeds 250 ms or error rate exceeds 0.1%. Tie that to a pull-request check and performance regressions get caught at the same speed bug regressions do. Locust covers the same ground in Python; Gatling in Scala for higher concurrency on a single load generator; Vegeta for simple constant-rate hammering from the CLI.

Querying P99 in Prometheus

# P99 latency over the last 5 minutes, per service.
# Assumes a histogram named http_request_duration_seconds with le buckets.
histogram_quantile(
  0.99,
  sum by (service, le) (
    rate(http_request_duration_seconds_bucket[5m])
  )
)

# Burn-rate alert: P99 over the last 5m vs. the SLO of 250ms,
# multiplied by request volume so noisy low-traffic windows don’t fire.
(
  histogram_quantile(0.99, sum by (le) (rate(http_request_duration_seconds_bucket[5m])))
  > 0.25
) and (
  sum(rate(http_request_duration_seconds_count[5m])) > 10
)

Test in production with shadow traffic

Synthetic load tests cover the cases you imagined. Real traffic is weirder. The pattern: run a new version of a service in parallel with the old one, mirror a percentage of real production requests to it (Envoy and nginx both support this natively), and compare latency, errors, and outputs — without letting the shadow service’s response reach the user. You catch regressions on the actual workload before flipping any traffic to the new version.

Real-World Examples

Twitter’s timeline fan-out. When a user posts a tweet, Twitter doesn’t compute timelines on read. It pre-computes them on write — the “fan-out on write” model. Each follower’s home timeline is a Redis list; a new tweet is appended to every follower’s list. Reads become an O(1) lookup instead of an expensive query that joins follows and tweets. The wrinkle is celebrities: fanning out a Taylor Swift tweet to 90 million followers synchronously is impossible. Twitter switches to fan-out on read for high-degree accounts — a hybrid that trades write cost for the worst tail.

Discord’s Elixir-based fan-out. Discord runs millions of concurrent WebSocket connections per server using Elixir’s lightweight processes (BEAM). When a message is posted in a channel, the channel process broadcasts to every subscriber process — cheaply, because each “process” is a few KB instead of an OS thread. Their published numbers describe handling five million concurrent users on a relatively small fleet. The lesson: language and runtime choice can be a performance lever an order of magnitude bigger than micro-optimizations.

Netflix pre-warming for the Friday night peak. Netflix knows traffic is going to triple at 8pm Eastern. They don’t wait for the autoscaler to figure that out. They pre-scale the fleet ahead of the curve, pre-warm caches with predicted hot content, and run chaos experiments in the morning when scale-up is cheap. Predictive scaling combined with pre-warming turns a steep latency cliff into a gentle hill.

Cloudflare’s edge caching. Cloudflare serves a meaningful fraction of the public internet from edge nodes within ~50 ms of every user. Static assets, then API responses with sensible Cache-Control, then increasingly dynamic content via Workers and KV at the edge. The architecture is “cache as much as possible as close as possible to the user” taken to the global limit. The result: most requests never traverse the public internet to an origin.

Stripe’s API latency culture. Stripe publishes their P99 latency targets and treats regressions as outage-grade incidents. The discipline shows up everywhere: aggressive timeouts, idempotency keys for safe retries, hot-path code reviewed line by line, custom database routing to keep critical writes off congested shards. The aggregate effect is an API that feels boringly fast even at peak. Performance is a culture, not a project.

Best Practices

The short list

  • Define a latency SLO before you optimize. Without a target, every optimization is “faster than now” — which is unbounded work.
  • Decompose the budget per hop. The slowest service on the critical path is the ceiling for everyone.
  • Track P95 and P99, not averages. Averages hide the requests that lose you customers.
  • Cache before you optimize the database. A cache hit is always faster than the most cleverly-tuned query.
  • Bound every queue. Unbounded queues are memory leaks that show up as outages.
  • Pool connections; size the pool against the database, not the app. A thousand pods times twenty connections is a thousand reasons your DB is dying.
  • Autoscale on a leading indicator. Queue depth, in-flight requests, or latency — not CPU alone for I/O-bound work.
  • Make scale-up fast and scale-down slow. Adding capacity costs cents; removing it too soon costs incidents.
  • Profile in production-like environments. Laptop benchmarks lie about cache behavior, NUMA, and contention.
  • Run load tests in CI with thresholds. Performance regressions deserve the same blocking treatment as logic regressions.
  • Shadow real traffic for new versions. Synthetic tests can’t simulate the long tail of real-world inputs.
  • Treat tail latency as a fan-out problem. The more services on the critical path, the harder you have to trim each tail.

The single most useful sentence about performance

You don’t have a performance problem — you have a queue somewhere. Find the queue, measure how long requests sit in it, and either drain it faster (more capacity, faster work) or stop putting work into it (cache, async, shed load). Every other technique in this tutorial is a special case of those two moves.