Security

In a monolith you defended a wall. In microservices there is no wall — every hop is the internet. Authenticate every request, sign every workload, scope every credential, and assume the network is hostile.

Hard30 min read

Why Microservices Security Matters

Why Microservices Security Matters

The Problem: A monolith has one process, one auth boundary, one set of in-memory function calls. Decompose it into 200 services and you have 200 network endpoints, 200 credential stores, 200 places where TLS can be misconfigured, and an east-west traffic graph that no firewall has ever seen.

The Solution: Stop defending a perimeter. Authenticate every request, encrypt every hop, attach a workload identity to every service, and let a policy engine answer “is this allowed?” uniformly for the entire fleet.

Real Impact: Capital One’s 2019 breach cost ~$190M and exposed 100M records — one SSRF on one service combined with one over-broad IAM role. The blast radius of a single compromised microservice is whatever you let it touch.

Real-World Analogy

Old security thinking is a medieval castle: thick walls, one drawbridge, archers on the parapet. Once you’re inside the wall, you can wander freely.

  • The castle = your monolith behind a corporate firewall
  • The drawbridge = the perimeter API gateway
  • Inside the wall = the “trusted internal network”
  • Once a thief gets past the gate = they own everything inside, because nobody checks IDs

Microservices security is more like a bank with hundreds of separate vaults. Every door has a guard, every guard checks ID, and the ID itself is signed and short-lived. Even if the front door fails, the gold in vault #47 is still locked behind its own authentication. That’s zero trust — and it’s the only model that survives east-west traffic.

The decomposition that gave you independently deployable services also gave you an explosion in east-west traffic — calls between services inside the cluster. In a monolith, calls between modules were function calls in the same address space. In microservices, every one of those is now a network round trip. Each round trip is a place an attacker can intercept, replay, or impersonate.

What changed when you decomposed

ConcernMonolithMicroservices
AuthenticationOne login, in-process sessionEvery hop must prove identity
AuthorizationOne codebase, one rule setN services, one policy engine (if you’re lucky)
SecretsOne app.ymlDynamic, short-lived, per-workload
NetworklocalhostHostile by default
Attack surfaceOne ingress, a few endpointsEvery service, every endpoint, every dependency
Blast radiusThe whole appWhatever the compromised pod can reach — ideally one cell

The hard part is not any one of these — it’s that they all change at once.

Authentication Between Services

Why mTLS, Not Shared Secrets

The Problem: A shared API key in SERVICE_B_TOKEN environment variables means every pod that ever runs Service A can impersonate Service A forever. Rotation is a deploy, leakage is a breach, and audit is impossible.

The Solution: Cryptographic workload identity. Each pod gets a short-lived X.509 certificate or JWT-SVID issued by a trusted authority based on what the pod is, not a secret it was handed.

The standard for service-to-service authentication today is mutual TLS (mTLS). Both ends present certificates, both ends verify, and the connection is encrypted in transit. SPIFFE (the spec) and SPIRE (the implementation) define how those certificates are issued: a workload attests to the node attestor (“I’m running in this pod, in this namespace, with this service account”), and SPIRE returns a short-lived spiffe://<trust-domain>/<path> identity.

Letting the mesh do the work

You almost never want to implement mTLS in application code. A service mesh — Istio, Linkerd, Consul Connect — injects a sidecar proxy (Envoy in most cases) that terminates mTLS on behalf of the application, rotates certificates every hour or two, and exposes the verified peer identity to the app via a header.

# Istio: enforce strict mTLS for everything in the ‘payments’ namespace
apiVersion: security.istio.io/v1beta1
kind: PeerAuthentication
metadata:
  name: default
  namespace: payments
spec:
  mtls:
    mode: STRICT      # reject any plaintext connection
---
# Then constrain who can call who, by SPIFFE identity
apiVersion: security.istio.io/v1beta1
kind: AuthorizationPolicy
metadata:
  name: payments-allow-checkout
  namespace: payments
spec:
  selector:
    matchLabels: { app: payments-api }
  action: ALLOW
  rules:
    - from:
        - source:
            principals: ["cluster.local/ns/checkout/sa/checkout-api"]
      to:
        - operation:
            methods: ["POST"]
            paths:   ["/v1/charges"]

That YAML does three things you should never do by hand: enforces mTLS, attaches a verified workload identity to every request, and authorizes by identity rather than by IP. Certificate rotation, SAN management, and CA trust are all somebody else’s problem now — specifically, the mesh control plane’s.

Stop passing user JWTs through service hops

A common antipattern: the gateway validates a user JWT, then forwards the same token down every internal hop. Every service must now know the user’s OIDC issuer, every service has access to the full token (including any sensitive claims), and any compromised internal service can replay the token. The fix is token exchange: the gateway swaps the user token for a narrowly-scoped internal token (or just propagates a signed claim header on top of mTLS), and downstream services trust the identity of the caller, not a token they cannot validate.

End-User Authentication

Why Validate at the Edge, Not Everywhere

The Problem: If every microservice independently downloads JWKS, validates JWTs, and re-implements OIDC, you have 200 places to ship a CVE fix and 200 places to misconfigure a clock skew.

The Solution: Validate the user token once, at the edge gateway. Translate it into a small, internal, signed identity that downstream services trust by virtue of mTLS — not by re-parsing the original JWT.

The protocols you actually use:

JWT validation done correctly (Node, at the gateway)

import { createRemoteJWKSet, jwtVerify } from 'jose';

// Cache the JWKS in memory; jose handles refresh and key rotation.
const JWKS = createRemoteJWKSet(
    new URL('https://auth.example.com/.well-known/jwks.json')
);

export async function verifyAccessToken(token) {
    const { payload } = await jwtVerify(token, JWKS, {
        issuer:   'https://auth.example.com/',
        audience: 'https://api.example.com',
        algorithms: ['RS256', 'ES256'],   // pin algorithms; never accept ‘none’
        clockTolerance: '30s',
    });

    // Defense-in-depth: explicitly check expiry and required claims.
    if (!payload.sub) throw new Error('missing sub');
    if (!payload.scope) throw new Error('missing scope');

    return {
        userId: payload.sub,
        scopes: payload.scope.split(' '),
        tenantId: payload.tid,
    };
}

Three details that production code gets wrong constantly: pinning algorithms (the original JWT spec accepted alg: none, which has been the source of countless CVEs), validating aud (without it, an access token for service X is also valid for service Y), and caching JWKS with refresh (because the IdP will rotate keys without warning).

Token introspection for opaque tokens

# RFC 7662 introspection: opaque token in, JSON about it out
POST /oauth2/introspect HTTP/1.1
Host: auth.example.com
Authorization: Basic <client-creds>
Content-Type: application/x-www-form-urlencoded

token=d4Z9-Q8wXrK…&token_type_hint=access_token

HTTP/1.1 200 OK
{
  "active":    true,
  "sub":       "user_42",
  "scope":     "orders:read orders:write",
  "client_id": "checkout-web",
  "exp":       1736452820
}

JWT vs opaque, decided

  • Use JWT when scale matters and you can live with up to access-token-lifetime seconds of revocation lag. Keep them short (5–15 min).
  • Use opaque + introspection when immediate revocation matters more than throughput — admin sessions, payment flows, anything where “log out everywhere now” is a real requirement.
  • Always rotate refresh tokens on use. A stolen refresh token used twice should trigger a session kill.

Authorization

Why Policy Belongs Outside the Code

The Problem: “If user is admin or owns the record or is in the same tenant…” sprinkled across 200 services. Auditing is impossible. Changing one rule is a fleet-wide deploy.

The Solution: A policy engine. Services ask “can subject S do action A on resource R?” and the engine answers yes or no. Policy lives in version control alongside code, but separate from it.

ModelDecision based onGood forLimitation
RBACRoles assigned to usersCoarse-grained, internal toolsRole explosion at scale
ABACAttributes of subject, action, resource, environmentMulti-tenant SaaS, fine-grainedHarder to reason about
ReBACRelationships in a graph (Zanzibar-style)Document/folder sharing, social graphsSpecialized infra
Policy as codeDeclarative rules evaluated against contextAll of the above, uniformlyRequires a policy engine

The two engines you’ll see in production: OPA (Open Policy Agent) with its Rego language, deployed almost universally as a sidecar; and Cedar, AWS’s policy language used in Verified Permissions and AVP. Both compile policies once and evaluate per-request in microseconds.

OPA / Rego: a real policy

# package = the namespace this policy lives in
package orders.authz

import rego.v1

# Default: deny.
default allow := false

# Owners can always read their own orders.
allow if {
    input.action == "read"
    input.resource.kind == "order"
    input.resource.owner == input.subject.user_id
}

# Tenant admins can read any order in their tenant.
allow if {
    input.action == "read"
    input.resource.kind == "order"
    input.resource.tenant == input.subject.tenant_id
    "tenant_admin" in input.subject.roles
}

# Refunds require the ‘orders:refund’ scope AND the order must be < 90 days old.
allow if {
    input.action == "refund"
    input.resource.kind == "order"
    "orders:refund" in input.subject.scopes
    time.now_ns() - input.resource.created_ns < 90 * 24 * 3600 * 1e9
}

The application asks OPA “can input.subject do input.action on input.resource?” and gets back {"allow": true} or false. The application contains zero authorization logic. Auditors read the Rego, not your code.

Centralized policy, decentralized enforcement

Policy definition should be central — a single repo, code-reviewed, signed, distributed. Policy evaluation should be decentralized — OPA as a sidecar, evaluating locally, no network call to a central PDP per request. The bundle API distributes policy; the sidecar evaluates it. You get auditable rules without coupling the request path to a remote service.

Zero Trust Architecture

Why “Trusted Internal Network” Is Dead

The Problem: Once an attacker is inside your VPC — through a phished employee, a compromised CI runner, an SSRF on one service — the “internal network” gives them implicit trust to talk to everything.

The Solution: Treat every network as hostile. Every request — including service-to-service inside your own cluster — must carry verifiable identity, must be authorized, must be encrypted, and must be logged.

Zero trust is not a product, it’s a posture. The five principles:

  1. Never trust, always verify. No implicit trust based on network location.
  2. Authenticate every request. mTLS for service-to-service, OIDC for users, on every hop.
  3. Authorize every request. A policy decision — not just “they have a token”.
  4. Least privilege. Each workload gets only the permissions and network reach it needs.
  5. Assume breach. Log everything, segment everything, rotate everything — so when (not if) a service is compromised, the blast radius is bounded.

The network is hostile by default

If your security model relies on “the firewall keeps bad people out”, you do not have a security model — you have hope. Cloud workloads, third-party SaaS, contractor laptops, vendor integrations, and CI runners all routinely traverse your “internal” network. The day someone owns one of them, the only thing standing between them and your data is per-request authentication and authorization. Build for that day.

What zero trust looks like in practice

LayerWhat enforces itWhat you check
TransportmTLS via service meshBoth ends present valid SPIFFE certs
Workload identitySPIRE / SPIFFEPod attestation matches expected SA + namespace
User identityOIDC at the edgeSignature, issuer, audience, scopes, expiry
AuthorizationOPA / CedarPolicy decision per request
Network policyKubernetes NetworkPolicy / CiliumL3–L7 allowlists, default-deny
AuditMesh access logs + SIEMEvery request, with subject and decision

Secrets Management

Why Long-Lived Secrets Are a Liability

The Problem: A database password baked into a Docker image lives forever. If the image leaks, a contractor laptop is stolen, or the registry is misconfigured public, the password is gone. Rotation requires a fleet-wide redeploy.

The Solution: Dynamic, short-lived credentials, fetched at workload startup (and refreshed before expiry) from a secrets backend — HashiCorp Vault, AWS Secrets Manager, AWS KMS, GCP Secret Manager. The workload never sees a long-lived value.

The tools and what they’re for:

Vault dynamic database credentials

# 1) Configure Vault to issue per-workload Postgres creds.
$ vault write database/config/orders-db \
    plugin_name=postgresql-database-plugin \
    allowed_roles="orders-app" \
    connection_url="postgresql://{{username}}:{{password}}@db.internal:5432/orders" \
    username="vault-admin" \
    password="…"

$ vault write database/roles/orders-app \
    db_name=orders-db \
    creation_statements="CREATE ROLE \"{{name}}\" WITH LOGIN PASSWORD '{{password}}' VALID UNTIL '{{expiration}}'; \
                          GRANT SELECT, INSERT, UPDATE ON ALL TABLES IN SCHEMA app TO \"{{name}}\";" \
    default_ttl="1h" \
    max_ttl="24h"

# 2) The pod authenticates with its Kubernetes SA, then asks for creds.
$ vault write auth/kubernetes/login role=orders-app jwt=@/var/run/secrets/…/token

$ vault read database/creds/orders-app
# => { username: "v-orders-app-tH8…", password: "…", lease_duration: 3600 }

That credential is valid for one hour. It’s tied to one pod. If the pod is compromised the attacker has at most an hour. If the credential leaks into a log file, it’s expired before anyone reads it. The database, meanwhile, has a fresh role per workload — auditing actually works.

Never bake secrets into images

Not in ENV, not in ARG, not in build-time files, not in “dev only” images. Layers are cached, registries get mirrored, images get pulled by people who shouldn’t have them. Treat the image as public and inject every secret at runtime. The corollary: never commit secrets to Git, even in private repos — rotate any value that has ever been committed.

Supply Chain Security

Why Your Build Pipeline Is an Attack Surface

The Problem: SolarWinds, Codecov, event-stream, log4shell, xz-utils. Every one of these started with code you didn’t write running in your binary because of a transitive dependency or a compromised build.

The Solution: Sign what you build, generate provenance about how you built it, scan what you depend on, and verify all of it before deploying.

The pieces of a modern supply chain story:

Signing and verifying with cosign

# Keyless signing: cosign uses an OIDC identity (your CI’s OIDC token)
# and a short-lived Fulcio cert. Signature is logged in Rekor.
$ cosign sign \
    --identity-token "$ACTIONS_ID_TOKEN_REQUEST_TOKEN" \
    ghcr.io/acme/orders-api@sha256:9c4f…

# Generate and attach an SBOM
$ syft ghcr.io/acme/orders-api@sha256:9c4f… -o spdx-json > sbom.spdx.json
$ cosign attest --predicate sbom.spdx.json --type spdx \
    ghcr.io/acme/orders-api@sha256:9c4f…

# In the cluster, refuse to schedule anything without a verifiable signature
# from the expected GitHub Actions workflow.
$ cosign verify \
    --certificate-identity-regexp "https://github.com/acme/.+/.github/workflows/release.yml@.+" \
    --certificate-oidc-issuer       "https://token.actions.githubusercontent.com" \
    ghcr.io/acme/orders-api@sha256:9c4f…

Three things this gets you that nothing else does: the image came from your CI workflow (not a developer’s laptop), it was built from a specific commit (provenance is in the Rekor transparency log), and the SBOM is cryptographically attached so you can answer the next CVE question in seconds, not days.

API Security

Why the OWASP API Top 10 Exists

The Problem: Most breaches in microservices aren’t exotic crypto attacks — they’re broken object-level authorization, mass assignment, missing rate limits, and unbounded resource consumption. The boring stuff.

The Solution: Defense in depth at the API edge: a WAF for known-bad shapes, schema validation for inputs, rate limits per identity (not per IP), idempotency keys for writes, and request-size caps for everything.

The 2023 OWASP API Security Top 10 — in priority order:

#RiskWhat it is
API1BOLABroken object-level authorization — GET /orders/123 doesn’t check the order belongs to you
API2Broken authenticationWeak JWT validation, no rate limit on login, password reset flaws
API3BOPLABroken object property-level authorization — mass assignment lets you set isAdmin: true
API4Unrestricted resource consumptionNo rate limits, no size caps, no query-complexity limits
API5Broken function-level authorizationAdmin endpoint shipped without role check
API6Unrestricted access to sensitive flowsAccount takeover, signup abuse, gift-card enumeration
API7SSRFServer-side request forgery — the Capital One vector
API8Security misconfigurationDefault creds, verbose errors, missing security headers
API9Improper inventory managementOld /v1 endpoint nobody patched
API10Unsafe consumption of APIsYou trust an upstream you shouldn’t

The non-negotiable controls at the edge

Idempotency in one paragraph

The client generates a UUID per logical operation and sends it as Idempotency-Key. The server stores (key, request hash, response) for some window (24h is standard). On a retry with the same key, the server returns the stored response — no double charge, no double order, no double anything. Stripe, AWS, and every payment processor on Earth implement this; you should too.

Real-World Examples

Google — BeyondCorp

Google removed its corporate VPN. Every internal tool is on the public internet, accessed through an identity-aware proxy. The decision to grant access is made per-request, based on user identity, device posture, and the resource being requested. There is no “inside the network” at Google — this is the foundational zero-trust paper, published in 2014, and the model the rest of the industry has slowly adopted as Cloudflare Access, Tailscale, AWS Verified Access, etc.

Netflix — service-to-service auth at scale

Netflix runs on the order of 1,000 services. Service-to-service authentication is mTLS via their internal mesh, with workload identities issued automatically. End-user tokens are translated at the edge (Zuul) into internal “Passport” tokens that downstream services trust. Every internal call carries a signed identity; no service has implicit trust because of where it runs. This is what zero trust looks like at scale.

Cloudflare Access

The commodified version of BeyondCorp. You point an internal app at Cloudflare, configure an identity provider, and write policy: “engineers in the platform group on a managed device can reach Grafana.” Cloudflare terminates TLS, authenticates the user, evaluates the policy, and forwards the request. Your app sees a verified identity header. No VPN, no IP allowlist, no shared bastion to compromise.

Capital One — the cautionary tale

2019: an attacker exploited an SSRF vulnerability in a misconfigured WAF (ModSecurity on an EC2 instance) to call the EC2 metadata service. The metadata service returned temporary IAM credentials — for a role with s3:ListAllMyBuckets and broad s3:GetObject. ~100 million records exfiltrated. Two failures stacked: SSRF (an application bug) and an over-privileged IAM role on a workload that didn’t need it. The fix is IMDSv2 (which requires a session token and blocks the SSRF pattern) and least-privilege IAM. Both have been the default ever since.

The lesson Capital One teaches

No single control is sufficient. The WAF was misconfigured, so input validation didn’t catch it. The metadata service was IMDSv1, so SSRF reached it. The IAM role was over-broad, so the credentials it returned could read S3. Defense in depth means any one of those being right would have stopped the breach. Build the system so that no single failure produces a breach.

Best Practices

The short list

  • mTLS everywhere by default. Plaintext between services in production is malpractice. Let the mesh do it.
  • One workload, one identity. SPIFFE/SPIRE or your cloud’s equivalent. Never a shared service account.
  • Validate user tokens at the edge, exchange for internal identity. Don’t propagate user JWTs through every hop.
  • Pin JWT algorithms; validate iss, aud, exp, nbf. Reject alg: none and unrecognized algorithms.
  • Policy as code. OPA or Cedar; rules in version control, evaluated at the sidecar.
  • Default-deny network policy. Then allowlist what each service actually needs.
  • Short-lived, dynamic credentials. Vault or your cloud’s secrets manager. Hours, not months.
  • Never bake secrets into images. Inject at runtime. Treat the image as public.
  • Sign every image, attest every build, verify at admission. Cosign + Sigstore + Kyverno is the modern stack.
  • SBOM per build. When the next log4shell drops, you want minutes-to-answer, not days.
  • Rate limit per identity, not per IP. Idempotency keys on every write. Schema validation on every input.
  • Assume breach. Log every authn and authz decision; ship to a SIEM; alert on anomalies.
  • Run a tabletop exercise quarterly. “The CI signing key has leaked.” If you can’t answer in an hour, you have homework.

The mindset

The thread running through every section of this tutorial is the same: do not trust the network; do not trust the caller; do not trust the artifact. Verify identity, verify intent, verify provenance — on every request, every workload, every deploy. That is the entire job.

Security in microservices is not a feature you add at the end. It’s a property of how you compose identity, transport, policy, and supply chain from day one. Bolt-on security is the most expensive security — you pay for it twice, once for the bolts and once for the breach.

The single most useful sentence about microservices security

If a single compromised service can reach — or impersonate, or read the secrets of — any other service in your fleet, you do not have a microservices architecture. You have a monolith with extra latency and worse failure modes.