Why Microservices Security Matters
Why Microservices Security Matters
The Problem: A monolith has one process, one auth boundary, one set of in-memory function calls. Decompose it into 200 services and you have 200 network endpoints, 200 credential stores, 200 places where TLS can be misconfigured, and an east-west traffic graph that no firewall has ever seen.
The Solution: Stop defending a perimeter. Authenticate every request, encrypt every hop, attach a workload identity to every service, and let a policy engine answer “is this allowed?” uniformly for the entire fleet.
Real Impact: Capital One’s 2019 breach cost ~$190M and exposed 100M records — one SSRF on one service combined with one over-broad IAM role. The blast radius of a single compromised microservice is whatever you let it touch.
Real-World Analogy
Old security thinking is a medieval castle: thick walls, one drawbridge, archers on the parapet. Once you’re inside the wall, you can wander freely.
- The castle = your monolith behind a corporate firewall
- The drawbridge = the perimeter API gateway
- Inside the wall = the “trusted internal network”
- Once a thief gets past the gate = they own everything inside, because nobody checks IDs
Microservices security is more like a bank with hundreds of separate vaults. Every door has a guard, every guard checks ID, and the ID itself is signed and short-lived. Even if the front door fails, the gold in vault #47 is still locked behind its own authentication. That’s zero trust — and it’s the only model that survives east-west traffic.
The decomposition that gave you independently deployable services also gave you an explosion in east-west traffic — calls between services inside the cluster. In a monolith, calls between modules were function calls in the same address space. In microservices, every one of those is now a network round trip. Each round trip is a place an attacker can intercept, replay, or impersonate.
What changed when you decomposed
| Concern | Monolith | Microservices |
|---|---|---|
| Authentication | One login, in-process session | Every hop must prove identity |
| Authorization | One codebase, one rule set | N services, one policy engine (if you’re lucky) |
| Secrets | One app.yml | Dynamic, short-lived, per-workload |
| Network | localhost | Hostile by default |
| Attack surface | One ingress, a few endpoints | Every service, every endpoint, every dependency |
| Blast radius | The whole app | Whatever the compromised pod can reach — ideally one cell |
The hard part is not any one of these — it’s that they all change at once.
Authentication Between Services
Why mTLS, Not Shared Secrets
The Problem: A shared API key in SERVICE_B_TOKEN environment variables means every pod that ever runs Service A can impersonate Service A forever. Rotation is a deploy, leakage is a breach, and audit is impossible.
The Solution: Cryptographic workload identity. Each pod gets a short-lived X.509 certificate or JWT-SVID issued by a trusted authority based on what the pod is, not a secret it was handed.
The standard for service-to-service authentication today is mutual TLS (mTLS). Both ends present certificates, both ends verify, and the connection is encrypted in transit. SPIFFE (the spec) and SPIRE (the implementation) define how those certificates are issued: a workload attests to the node attestor (“I’m running in this pod, in this namespace, with this service account”), and SPIRE returns a short-lived spiffe://<trust-domain>/<path> identity.
Letting the mesh do the work
You almost never want to implement mTLS in application code. A service mesh — Istio, Linkerd, Consul Connect — injects a sidecar proxy (Envoy in most cases) that terminates mTLS on behalf of the application, rotates certificates every hour or two, and exposes the verified peer identity to the app via a header.
# Istio: enforce strict mTLS for everything in the ‘payments’ namespace
apiVersion: security.istio.io/v1beta1
kind: PeerAuthentication
metadata:
name: default
namespace: payments
spec:
mtls:
mode: STRICT # reject any plaintext connection
---
# Then constrain who can call who, by SPIFFE identity
apiVersion: security.istio.io/v1beta1
kind: AuthorizationPolicy
metadata:
name: payments-allow-checkout
namespace: payments
spec:
selector:
matchLabels: { app: payments-api }
action: ALLOW
rules:
- from:
- source:
principals: ["cluster.local/ns/checkout/sa/checkout-api"]
to:
- operation:
methods: ["POST"]
paths: ["/v1/charges"]
That YAML does three things you should never do by hand: enforces mTLS, attaches a verified workload identity to every request, and authorizes by identity rather than by IP. Certificate rotation, SAN management, and CA trust are all somebody else’s problem now — specifically, the mesh control plane’s.
Stop passing user JWTs through service hops
A common antipattern: the gateway validates a user JWT, then forwards the same token down every internal hop. Every service must now know the user’s OIDC issuer, every service has access to the full token (including any sensitive claims), and any compromised internal service can replay the token. The fix is token exchange: the gateway swaps the user token for a narrowly-scoped internal token (or just propagates a signed claim header on top of mTLS), and downstream services trust the identity of the caller, not a token they cannot validate.
End-User Authentication
Why Validate at the Edge, Not Everywhere
The Problem: If every microservice independently downloads JWKS, validates JWTs, and re-implements OIDC, you have 200 places to ship a CVE fix and 200 places to misconfigure a clock skew.
The Solution: Validate the user token once, at the edge gateway. Translate it into a small, internal, signed identity that downstream services trust by virtue of mTLS — not by re-parsing the original JWT.
The protocols you actually use:
- OAuth 2.1 — the modern consolidation of OAuth 2.0 (PKCE for everything, no implicit flow, no resource-owner password). Authorization framework, not authentication.
- OIDC (OpenID Connect) — identity layer on top of OAuth 2.1. Adds the ID token (a JWT describing the user) and a
/userinfoendpoint. - JWT — signed, self-contained token. Stateless to validate; impossible to revoke immediately. Use short lifetimes (5–15 min).
- Refresh tokens — long-lived, stored server-side, used to mint new access tokens. Rotate on every use.
- Opaque tokens — random strings; require introspection against the issuer to validate. Slower, but revocable instantly. Pick opaque for highly sensitive APIs, JWT for high-throughput public ones.
JWT validation done correctly (Node, at the gateway)
import { createRemoteJWKSet, jwtVerify } from 'jose';
// Cache the JWKS in memory; jose handles refresh and key rotation.
const JWKS = createRemoteJWKSet(
new URL('https://auth.example.com/.well-known/jwks.json')
);
export async function verifyAccessToken(token) {
const { payload } = await jwtVerify(token, JWKS, {
issuer: 'https://auth.example.com/',
audience: 'https://api.example.com',
algorithms: ['RS256', 'ES256'], // pin algorithms; never accept ‘none’
clockTolerance: '30s',
});
// Defense-in-depth: explicitly check expiry and required claims.
if (!payload.sub) throw new Error('missing sub');
if (!payload.scope) throw new Error('missing scope');
return {
userId: payload.sub,
scopes: payload.scope.split(' '),
tenantId: payload.tid,
};
}
Three details that production code gets wrong constantly: pinning algorithms (the original JWT spec accepted alg: none, which has been the source of countless CVEs), validating aud (without it, an access token for service X is also valid for service Y), and caching JWKS with refresh (because the IdP will rotate keys without warning).
Token introspection for opaque tokens
# RFC 7662 introspection: opaque token in, JSON about it out
POST /oauth2/introspect HTTP/1.1
Host: auth.example.com
Authorization: Basic <client-creds>
Content-Type: application/x-www-form-urlencoded
token=d4Z9-Q8wXrK…&token_type_hint=access_token
HTTP/1.1 200 OK
{
"active": true,
"sub": "user_42",
"scope": "orders:read orders:write",
"client_id": "checkout-web",
"exp": 1736452820
}
JWT vs opaque, decided
- Use JWT when scale matters and you can live with up to access-token-lifetime seconds of revocation lag. Keep them short (5–15 min).
- Use opaque + introspection when immediate revocation matters more than throughput — admin sessions, payment flows, anything where “log out everywhere now” is a real requirement.
- Always rotate refresh tokens on use. A stolen refresh token used twice should trigger a session kill.
Authorization
Why Policy Belongs Outside the Code
The Problem: “If user is admin or owns the record or is in the same tenant…” sprinkled across 200 services. Auditing is impossible. Changing one rule is a fleet-wide deploy.
The Solution: A policy engine. Services ask “can subject S do action A on resource R?” and the engine answers yes or no. Policy lives in version control alongside code, but separate from it.
| Model | Decision based on | Good for | Limitation |
|---|---|---|---|
| RBAC | Roles assigned to users | Coarse-grained, internal tools | Role explosion at scale |
| ABAC | Attributes of subject, action, resource, environment | Multi-tenant SaaS, fine-grained | Harder to reason about |
| ReBAC | Relationships in a graph (Zanzibar-style) | Document/folder sharing, social graphs | Specialized infra |
| Policy as code | Declarative rules evaluated against context | All of the above, uniformly | Requires a policy engine |
The two engines you’ll see in production: OPA (Open Policy Agent) with its Rego language, deployed almost universally as a sidecar; and Cedar, AWS’s policy language used in Verified Permissions and AVP. Both compile policies once and evaluate per-request in microseconds.
OPA / Rego: a real policy
# package = the namespace this policy lives in
package orders.authz
import rego.v1
# Default: deny.
default allow := false
# Owners can always read their own orders.
allow if {
input.action == "read"
input.resource.kind == "order"
input.resource.owner == input.subject.user_id
}
# Tenant admins can read any order in their tenant.
allow if {
input.action == "read"
input.resource.kind == "order"
input.resource.tenant == input.subject.tenant_id
"tenant_admin" in input.subject.roles
}
# Refunds require the ‘orders:refund’ scope AND the order must be < 90 days old.
allow if {
input.action == "refund"
input.resource.kind == "order"
"orders:refund" in input.subject.scopes
time.now_ns() - input.resource.created_ns < 90 * 24 * 3600 * 1e9
}
The application asks OPA “can input.subject do input.action on input.resource?” and gets back {"allow": true} or false. The application contains zero authorization logic. Auditors read the Rego, not your code.
Centralized policy, decentralized enforcement
Policy definition should be central — a single repo, code-reviewed, signed, distributed. Policy evaluation should be decentralized — OPA as a sidecar, evaluating locally, no network call to a central PDP per request. The bundle API distributes policy; the sidecar evaluates it. You get auditable rules without coupling the request path to a remote service.
Zero Trust Architecture
Why “Trusted Internal Network” Is Dead
The Problem: Once an attacker is inside your VPC — through a phished employee, a compromised CI runner, an SSRF on one service — the “internal network” gives them implicit trust to talk to everything.
The Solution: Treat every network as hostile. Every request — including service-to-service inside your own cluster — must carry verifiable identity, must be authorized, must be encrypted, and must be logged.
Zero trust is not a product, it’s a posture. The five principles:
- Never trust, always verify. No implicit trust based on network location.
- Authenticate every request. mTLS for service-to-service, OIDC for users, on every hop.
- Authorize every request. A policy decision — not just “they have a token”.
- Least privilege. Each workload gets only the permissions and network reach it needs.
- Assume breach. Log everything, segment everything, rotate everything — so when (not if) a service is compromised, the blast radius is bounded.
The network is hostile by default
If your security model relies on “the firewall keeps bad people out”, you do not have a security model — you have hope. Cloud workloads, third-party SaaS, contractor laptops, vendor integrations, and CI runners all routinely traverse your “internal” network. The day someone owns one of them, the only thing standing between them and your data is per-request authentication and authorization. Build for that day.
What zero trust looks like in practice
| Layer | What enforces it | What you check |
|---|---|---|
| Transport | mTLS via service mesh | Both ends present valid SPIFFE certs |
| Workload identity | SPIRE / SPIFFE | Pod attestation matches expected SA + namespace |
| User identity | OIDC at the edge | Signature, issuer, audience, scopes, expiry |
| Authorization | OPA / Cedar | Policy decision per request |
| Network policy | Kubernetes NetworkPolicy / Cilium | L3–L7 allowlists, default-deny |
| Audit | Mesh access logs + SIEM | Every request, with subject and decision |
Secrets Management
Why Long-Lived Secrets Are a Liability
The Problem: A database password baked into a Docker image lives forever. If the image leaks, a contractor laptop is stolen, or the registry is misconfigured public, the password is gone. Rotation requires a fleet-wide redeploy.
The Solution: Dynamic, short-lived credentials, fetched at workload startup (and refreshed before expiry) from a secrets backend — HashiCorp Vault, AWS Secrets Manager, AWS KMS, GCP Secret Manager. The workload never sees a long-lived value.
The tools and what they’re for:
- HashiCorp Vault — the heavyweight: dynamic database creds, PKI, transit encryption, KV. Auth via Kubernetes service accounts, AWS IAM, JWT, etc.
- AWS Secrets Manager / Parameter Store — AWS-native, integrates with IAM. Use Secrets Manager for rotation, Parameter Store for cheap config.
- AWS KMS — envelope encryption. Don’t store ciphertext keys; let KMS hold the master key and wrap your data keys.
- External Secrets Operator / sealed-secrets — in GitOps you cannot commit plaintext secrets.
sealed-secretsencrypts with a cluster-only key; ESO syncs from Vault/ASM/GSM into Kubernetes Secrets at runtime. - cert-manager — same idea for X.509 certs.
Vault dynamic database credentials
# 1) Configure Vault to issue per-workload Postgres creds.
$ vault write database/config/orders-db \
plugin_name=postgresql-database-plugin \
allowed_roles="orders-app" \
connection_url="postgresql://{{username}}:{{password}}@db.internal:5432/orders" \
username="vault-admin" \
password="…"
$ vault write database/roles/orders-app \
db_name=orders-db \
creation_statements="CREATE ROLE \"{{name}}\" WITH LOGIN PASSWORD '{{password}}' VALID UNTIL '{{expiration}}'; \
GRANT SELECT, INSERT, UPDATE ON ALL TABLES IN SCHEMA app TO \"{{name}}\";" \
default_ttl="1h" \
max_ttl="24h"
# 2) The pod authenticates with its Kubernetes SA, then asks for creds.
$ vault write auth/kubernetes/login role=orders-app jwt=@/var/run/secrets/…/token
$ vault read database/creds/orders-app
# => { username: "v-orders-app-tH8…", password: "…", lease_duration: 3600 }
That credential is valid for one hour. It’s tied to one pod. If the pod is compromised the attacker has at most an hour. If the credential leaks into a log file, it’s expired before anyone reads it. The database, meanwhile, has a fresh role per workload — auditing actually works.
Never bake secrets into images
Not in ENV, not in ARG, not in build-time files, not in “dev only” images. Layers are cached, registries get mirrored, images get pulled by people who shouldn’t have them. Treat the image as public and inject every secret at runtime. The corollary: never commit secrets to Git, even in private repos — rotate any value that has ever been committed.
Supply Chain Security
Why Your Build Pipeline Is an Attack Surface
The Problem: SolarWinds, Codecov, event-stream, log4shell, xz-utils. Every one of these started with code you didn’t write running in your binary because of a transitive dependency or a compromised build.
The Solution: Sign what you build, generate provenance about how you built it, scan what you depend on, and verify all of it before deploying.
The pieces of a modern supply chain story:
- SBOM (Software Bill of Materials) — SPDX or CycloneDX, generated per build, lists every dependency and version. Without an SBOM you can’t answer “am I running affected log4j?”
- Image signing — cosign (Sigstore) signs container images, the signature lives in the registry, admission policy verifies before scheduling.
- SLSA (Supply chain Levels for Software Artifacts) — a four-level framework defining what build provenance you produce. SLSA 3 means hermetic builds with signed provenance.
- Dependency scanning — Dependabot, Snyk, Trivy, Grype. Scan on every PR and on a schedule for newly disclosed CVEs in already-shipped artifacts.
- Admission control — the cluster refuses to schedule unsigned or unscanned images. Kyverno, Gatekeeper, or Sigstore policy-controller.
Signing and verifying with cosign
# Keyless signing: cosign uses an OIDC identity (your CI’s OIDC token)
# and a short-lived Fulcio cert. Signature is logged in Rekor.
$ cosign sign \
--identity-token "$ACTIONS_ID_TOKEN_REQUEST_TOKEN" \
ghcr.io/acme/orders-api@sha256:9c4f…
# Generate and attach an SBOM
$ syft ghcr.io/acme/orders-api@sha256:9c4f… -o spdx-json > sbom.spdx.json
$ cosign attest --predicate sbom.spdx.json --type spdx \
ghcr.io/acme/orders-api@sha256:9c4f…
# In the cluster, refuse to schedule anything without a verifiable signature
# from the expected GitHub Actions workflow.
$ cosign verify \
--certificate-identity-regexp "https://github.com/acme/.+/.github/workflows/release.yml@.+" \
--certificate-oidc-issuer "https://token.actions.githubusercontent.com" \
ghcr.io/acme/orders-api@sha256:9c4f…
Three things this gets you that nothing else does: the image came from your CI workflow (not a developer’s laptop), it was built from a specific commit (provenance is in the Rekor transparency log), and the SBOM is cryptographically attached so you can answer the next CVE question in seconds, not days.
API Security
Why the OWASP API Top 10 Exists
The Problem: Most breaches in microservices aren’t exotic crypto attacks — they’re broken object-level authorization, mass assignment, missing rate limits, and unbounded resource consumption. The boring stuff.
The Solution: Defense in depth at the API edge: a WAF for known-bad shapes, schema validation for inputs, rate limits per identity (not per IP), idempotency keys for writes, and request-size caps for everything.
The 2023 OWASP API Security Top 10 — in priority order:
| # | Risk | What it is |
|---|---|---|
| API1 | BOLA | Broken object-level authorization — GET /orders/123 doesn’t check the order belongs to you |
| API2 | Broken authentication | Weak JWT validation, no rate limit on login, password reset flaws |
| API3 | BOPLA | Broken object property-level authorization — mass assignment lets you set isAdmin: true |
| API4 | Unrestricted resource consumption | No rate limits, no size caps, no query-complexity limits |
| API5 | Broken function-level authorization | Admin endpoint shipped without role check |
| API6 | Unrestricted access to sensitive flows | Account takeover, signup abuse, gift-card enumeration |
| API7 | SSRF | Server-side request forgery — the Capital One vector |
| API8 | Security misconfiguration | Default creds, verbose errors, missing security headers |
| API9 | Improper inventory management | Old /v1 endpoint nobody patched |
| API10 | Unsafe consumption of APIs | You trust an upstream you shouldn’t |
The non-negotiable controls at the edge
- Rate limit per identity, not per IP. One subject, one tenant, one API key — not the IP of the proxy in front of you.
- Schema validation. OpenAPI / JSON Schema at the gateway. Reject unknown fields (additionalProperties: false) to defeat mass assignment.
- Request size limits. Body size, header size, URL length, multipart parts. The default in most frameworks is “way too large.”
- Idempotency keys on every write.
Idempotency-Key: <uuid>— the server stores the response for 24h and replays on retry. Without this, a network blip plus a retry equals a duplicate charge. - WAF — CloudFront, Cloudflare, AWS WAF. Catches known-bad shapes (SQLi, XSS, LFI) before they reach your code. Not a substitute for input validation; a layer in front of it.
- Output filtering. Never return more than the caller asked for. Field-level allowlists at the serializer.
Idempotency in one paragraph
The client generates a UUID per logical operation and sends it as Idempotency-Key. The server stores (key, request hash, response) for some window (24h is standard). On a retry with the same key, the server returns the stored response — no double charge, no double order, no double anything. Stripe, AWS, and every payment processor on Earth implement this; you should too.
Real-World Examples
Google — BeyondCorp
Google removed its corporate VPN. Every internal tool is on the public internet, accessed through an identity-aware proxy. The decision to grant access is made per-request, based on user identity, device posture, and the resource being requested. There is no “inside the network” at Google — this is the foundational zero-trust paper, published in 2014, and the model the rest of the industry has slowly adopted as Cloudflare Access, Tailscale, AWS Verified Access, etc.
Netflix — service-to-service auth at scale
Netflix runs on the order of 1,000 services. Service-to-service authentication is mTLS via their internal mesh, with workload identities issued automatically. End-user tokens are translated at the edge (Zuul) into internal “Passport” tokens that downstream services trust. Every internal call carries a signed identity; no service has implicit trust because of where it runs. This is what zero trust looks like at scale.
Cloudflare Access
The commodified version of BeyondCorp. You point an internal app at Cloudflare, configure an identity provider, and write policy: “engineers in the platform group on a managed device can reach Grafana.” Cloudflare terminates TLS, authenticates the user, evaluates the policy, and forwards the request. Your app sees a verified identity header. No VPN, no IP allowlist, no shared bastion to compromise.
Capital One — the cautionary tale
2019: an attacker exploited an SSRF vulnerability in a misconfigured WAF (ModSecurity on an EC2 instance) to call the EC2 metadata service. The metadata service returned temporary IAM credentials — for a role with s3:ListAllMyBuckets and broad s3:GetObject. ~100 million records exfiltrated. Two failures stacked: SSRF (an application bug) and an over-privileged IAM role on a workload that didn’t need it. The fix is IMDSv2 (which requires a session token and blocks the SSRF pattern) and least-privilege IAM. Both have been the default ever since.
The lesson Capital One teaches
No single control is sufficient. The WAF was misconfigured, so input validation didn’t catch it. The metadata service was IMDSv1, so SSRF reached it. The IAM role was over-broad, so the credentials it returned could read S3. Defense in depth means any one of those being right would have stopped the breach. Build the system so that no single failure produces a breach.
Best Practices
The short list
- mTLS everywhere by default. Plaintext between services in production is malpractice. Let the mesh do it.
- One workload, one identity. SPIFFE/SPIRE or your cloud’s equivalent. Never a shared service account.
- Validate user tokens at the edge, exchange for internal identity. Don’t propagate user JWTs through every hop.
- Pin JWT algorithms; validate
iss,aud,exp,nbf. Rejectalg: noneand unrecognized algorithms. - Policy as code. OPA or Cedar; rules in version control, evaluated at the sidecar.
- Default-deny network policy. Then allowlist what each service actually needs.
- Short-lived, dynamic credentials. Vault or your cloud’s secrets manager. Hours, not months.
- Never bake secrets into images. Inject at runtime. Treat the image as public.
- Sign every image, attest every build, verify at admission. Cosign + Sigstore + Kyverno is the modern stack.
- SBOM per build. When the next log4shell drops, you want minutes-to-answer, not days.
- Rate limit per identity, not per IP. Idempotency keys on every write. Schema validation on every input.
- Assume breach. Log every authn and authz decision; ship to a SIEM; alert on anomalies.
- Run a tabletop exercise quarterly. “The CI signing key has leaked.” If you can’t answer in an hour, you have homework.
The mindset
The thread running through every section of this tutorial is the same: do not trust the network; do not trust the caller; do not trust the artifact. Verify identity, verify intent, verify provenance — on every request, every workload, every deploy. That is the entire job.
Security in microservices is not a feature you add at the end. It’s a property of how you compose identity, transport, policy, and supply chain from day one. Bolt-on security is the most expensive security — you pay for it twice, once for the bolts and once for the breach.
The single most useful sentence about microservices security
If a single compromised service can reach — or impersonate, or read the secrets of — any other service in your fleet, you do not have a microservices architecture. You have a monolith with extra latency and worse failure modes.