Security | LIZIU Microservices

Why Microservices Security Matters

The Problem: A monolith has one process, one auth boundary, one set of in-memory function calls. Decompose it into 200 services and you have 200 network endpoints, 200 credential stores, 200 places where TLS can be misconfigured, and an east-west traffic graph that no firewall has ever seen.

The Solution: Stop defending a perimeter. Authenticate every request, encrypt every hop, attach a workload identity to every service, and let a policy engine answer “is this allowed?” uniformly for the entire fleet.

Real Impact: Capital One’s 2019 breach cost ~$190M and exposed 100M records — one SSRF on one service combined with one over-broad IAM role. The blast radius of a single compromised microservice is whatever you let it touch.

Real-World Analogy

Old security thinking is a medieval castle: thick walls, one drawbridge, archers on the parapet. Once you’re inside the wall, you can wander freely.

The castle = your monolith behind a corporate firewall
The drawbridge = the perimeter API gateway
Inside the wall = the “trusted internal network”
Once a thief gets past the gate = they own everything inside, because nobody checks IDs

Microservices security is more like a bank with hundreds of separate vaults. Every door has a guard, every guard checks ID, and the ID itself is signed and short-lived. Even if the front door fails, the gold in vault #47 is still locked behind its own authentication. That’s zero trust — and it’s the only model that survives east-west traffic.

The decomposition that gave you independently deployable services also gave you an explosion in east-west traffic — calls between services inside the cluster. In a monolith, calls between modules were function calls in the same address space. In microservices, every one of those is now a network round trip. Each round trip is a place an attacker can intercept, replay, or impersonate.

What changed when you decomposed

Concern	Monolith	Microservices
Authentication	One login, in-process session	Every hop must prove identity
Authorization	One codebase, one rule set	N services, one policy engine (if you’re lucky)
Secrets	One `app.yml`	Dynamic, short-lived, per-workload
Network	localhost	Hostile by default
Attack surface	One ingress, a few endpoints	Every service, every endpoint, every dependency
Blast radius	The whole app	Whatever the compromised pod can reach — ideally one cell

The hard part is not any one of these — it’s that they all change at once.

Authentication Between Services

Why mTLS, Not Shared Secrets

The Problem: A shared API key in SERVICE_B_TOKEN environment variables means every pod that ever runs Service A can impersonate Service A forever. Rotation is a deploy, leakage is a breach, and audit is impossible.

The Solution: Cryptographic workload identity. Each pod gets a short-lived X.509 certificate or JWT-SVID issued by a trusted authority based on what the pod is, not a secret it was handed.

The standard for service-to-service authentication today is mutual TLS (mTLS). Both ends present certificates, both ends verify, and the connection is encrypted in transit. SPIFFE (the spec) and SPIRE (the implementation) define how those certificates are issued: a workload attests to the node attestor (“I’m running in this pod, in this namespace, with this service account”), and SPIRE returns a short-lived spiffe://<trust-domain>/<path> identity.

Letting the mesh do the work

You almost never want to implement mTLS in application code. A service mesh — Istio, Linkerd, Consul Connect — injects a sidecar proxy (Envoy in most cases) that terminates mTLS on behalf of the application, rotates certificates every hour or two, and exposes the verified peer identity to the app via a header.

# Istio: enforce strict mTLS for everything in the ‘payments’ namespace
apiVersion: security.istio.io/v1beta1
kind: PeerAuthentication
metadata:
  name: default
  namespace: payments
spec:
  mtls:
    mode: STRICT      # reject any plaintext connection
---
# Then constrain who can call who, by SPIFFE identity
apiVersion: security.istio.io/v1beta1
kind: AuthorizationPolicy
metadata:
  name: payments-allow-checkout
  namespace: payments
spec:
  selector:
    matchLabels: { app: payments-api }
  action: ALLOW
  rules:
    - from:
        - source:
            principals: ["cluster.local/ns/checkout/sa/checkout-api"]
      to:
        - operation:
            methods: ["POST"]
            paths:   ["/v1/charges"]

That YAML does three things you should never do by hand: enforces mTLS, attaches a verified workload identity to every request, and authorizes by identity rather than by IP. Certificate rotation, SAN management, and CA trust are all somebody else’s problem now — specifically, the mesh control plane’s.

Stop passing user JWTs through service hops

A common antipattern: the gateway validates a user JWT, then forwards the same token down every internal hop. Every service must now know the user’s OIDC issuer, every service has access to the full token (including any sensitive claims), and any compromised internal service can replay the token. The fix is token exchange: the gateway swaps the user token for a narrowly-scoped internal token (or just propagates a signed claim header on top of mTLS), and downstream services trust the identity of the caller, not a token they cannot validate.

End-User Authentication

Why Validate at the Edge, Not Everywhere

The Problem: If every microservice independently downloads JWKS, validates JWTs, and re-implements OIDC, you have 200 places to ship a CVE fix and 200 places to misconfigure a clock skew.

The Solution: Validate the user token once, at the edge gateway. Translate it into a small, internal, signed identity that downstream services trust by virtue of mTLS — not by re-parsing the original JWT.

The protocols you actually use:

OAuth 2.1 — the modern consolidation of OAuth 2.0 (PKCE for everything, no implicit flow, no resource-owner password). Authorization framework, not authentication.
OIDC (OpenID Connect) — identity layer on top of OAuth 2.1. Adds the ID token (a JWT describing the user) and a /userinfo endpoint.
JWT — signed, self-contained token. Stateless to validate; impossible to revoke immediately. Use short lifetimes (5–15 min).
Refresh tokens — long-lived, stored server-side, used to mint new access tokens. Rotate on every use.
Opaque tokens — random strings; require introspection against the issuer to validate. Slower, but revocable instantly. Pick opaque for highly sensitive APIs, JWT for high-throughput public ones.

JWT validation done correctly (Node, at the gateway)

import { createRemoteJWKSet, jwtVerify } from 'jose';

// Cache the JWKS in memory; jose handles refresh and key rotation.
const JWKS = createRemoteJWKSet(
    new URL('https://auth.example.com/.well-known/jwks.json')
);

export async function verifyAccessToken(token) {
    const { payload } = await jwtVerify(token, JWKS, {
        issuer:   'https://auth.example.com/',
        audience: 'https://api.example.com',
        algorithms: ['RS256', 'ES256'],   // pin algorithms; never accept ‘none’
        clockTolerance: '30s',
    });

    // Defense-in-depth: explicitly check expiry and required claims.
    if (!payload.sub) throw new Error('missing sub');
    if (!payload.scope) throw new Error('missing scope');

    return {
        userId: payload.sub,
        scopes: payload.scope.split(' '),
        tenantId: payload.tid,
    };
}

Three details that production code gets wrong constantly: pinning algorithms (the original JWT spec accepted alg: none, which has been the source of countless CVEs), validating aud (without it, an access token for service X is also valid for service Y), and caching JWKS with refresh (because the IdP will rotate keys without warning).

Token introspection for opaque tokens

# RFC 7662 introspection: opaque token in, JSON about it out
POST /oauth2/introspect HTTP/1.1
Host: auth.example.com
Authorization: Basic <client-creds>
Content-Type: application/x-www-form-urlencoded

token=d4Z9-Q8wXrK…&token_type_hint=access_token

HTTP/1.1 200 OK
{
  "active":    true,
  "sub":       "user_42",
  "scope":     "orders:read orders:write",
  "client_id": "checkout-web",
  "exp":       1736452820
}

JWT vs opaque, decided

Use JWT when scale matters and you can live with up to access-token-lifetime seconds of revocation lag. Keep them short (5–15 min).
Use opaque + introspection when immediate revocation matters more than throughput — admin sessions, payment flows, anything where “log out everywhere now” is a real requirement.
Always rotate refresh tokens on use. A stolen refresh token used twice should trigger a session kill.

Authorization

Why Policy Belongs Outside the Code

The Problem: “If user is admin or owns the record or is in the same tenant…” sprinkled across 200 services. Auditing is impossible. Changing one rule is a fleet-wide deploy.

The Solution: A policy engine. Services ask “can subject S do action A on resource R?” and the engine answers yes or no. Policy lives in version control alongside code, but separate from it.

Model	Decision based on	Good for	Limitation
RBAC	Roles assigned to users	Coarse-grained, internal tools	Role explosion at scale
ABAC	Attributes of subject, action, resource, environment	Multi-tenant SaaS, fine-grained	Harder to reason about
ReBAC	Relationships in a graph (Zanzibar-style)	Document/folder sharing, social graphs	Specialized infra
Policy as code	Declarative rules evaluated against context	All of the above, uniformly	Requires a policy engine

The two engines you’ll see in production: OPA (Open Policy Agent) with its Rego language, deployed almost universally as a sidecar; and Cedar, AWS’s policy language used in Verified Permissions and AVP. Both compile policies once and evaluate per-request in microseconds.

OPA / Rego: a real policy

# package = the namespace this policy lives in
package orders.authz

import rego.v1

# Default: deny.
default allow := false

# Owners can always read their own orders.
allow if {
    input.action == "read"
    input.resource.kind == "order"
    input.resource.owner == input.subject.user_id
}

# Tenant admins can read any order in their tenant.
allow if {
    input.action == "read"
    input.resource.kind == "order"
    input.resource.tenant == input.subject.tenant_id
    "tenant_admin" in input.subject.roles
}

# Refunds require the ‘orders:refund’ scope AND the order must be < 90 days old.
allow if {
    input.action == "refund"
    input.resource.kind == "order"
    "orders:refund" in input.subject.scopes
    time.now_ns() - input.resource.created_ns < 90 * 24 * 3600 * 1e9
}

The application asks OPA “can input.subject do input.action on input.resource?” and gets back {"allow": true} or false. The application contains zero authorization logic. Auditors read the Rego, not your code.

Centralized policy, decentralized enforcement

Policy definition should be central — a single repo, code-reviewed, signed, distributed. Policy evaluation should be decentralized — OPA as a sidecar, evaluating locally, no network call to a central PDP per request. The bundle API distributes policy; the sidecar evaluates it. You get auditable rules without coupling the request path to a remote service.

Zero Trust Architecture

Why “Trusted Internal Network” Is Dead

The Problem: Once an attacker is inside your VPC — through a phished employee, a compromised CI runner, an SSRF on one service — the “internal network” gives them implicit trust to talk to everything.

The Solution: Treat every network as hostile. Every request — including service-to-service inside your own cluster — must carry verifiable identity, must be authorized, must be encrypted, and must be logged.

Zero trust is not a product, it’s a posture. The five principles:

Never trust, always verify. No implicit trust based on network location.
Authenticate every request. mTLS for service-to-service, OIDC for users, on every hop.
Authorize every request. A policy decision — not just “they have a token”.
Least privilege. Each workload gets only the permissions and network reach it needs.
Assume breach. Log everything, segment everything, rotate everything — so when (not if) a service is compromised, the blast radius is bounded.

The network is hostile by default

If your security model relies on “the firewall keeps bad people out”, you do not have a security model — you have hope. Cloud workloads, third-party SaaS, contractor laptops, vendor integrations, and CI runners all routinely traverse your “internal” network. The day someone owns one of them, the only thing standing between them and your data is per-request authentication and authorization. Build for that day.

What zero trust looks like in practice

Layer	What enforces it	What you check
Transport	mTLS via service mesh	Both ends present valid SPIFFE certs
Workload identity	SPIRE / SPIFFE	Pod attestation matches expected SA + namespace
User identity	OIDC at the edge	Signature, issuer, audience, scopes, expiry
Authorization	OPA / Cedar	Policy decision per request
Network policy	Kubernetes NetworkPolicy / Cilium	L3–L7 allowlists, default-deny
Audit	Mesh access logs + SIEM	Every request, with subject and decision

Secrets Management

Why Long-Lived Secrets Are a Liability

The Problem: A database password baked into a Docker image lives forever. If the image leaks, a contractor laptop is stolen, or the registry is misconfigured public, the password is gone. Rotation requires a fleet-wide redeploy.

The Solution: Dynamic, short-lived credentials, fetched at workload startup (and refreshed before expiry) from a secrets backend — HashiCorp Vault, AWS Secrets Manager, AWS KMS, GCP Secret Manager. The workload never sees a long-lived value.

The tools and what they’re for:

HashiCorp Vault — the heavyweight: dynamic database creds, PKI, transit encryption, KV. Auth via Kubernetes service accounts, AWS IAM, JWT, etc.
AWS Secrets Manager / Parameter Store — AWS-native, integrates with IAM. Use Secrets Manager for rotation, Parameter Store for cheap config.
AWS KMS — envelope encryption. Don’t store ciphertext keys; let KMS hold the master key and wrap your data keys.
External Secrets Operator / sealed-secrets — in GitOps you cannot commit plaintext secrets. sealed-secrets encrypts with a cluster-only key; ESO syncs from Vault/ASM/GSM into Kubernetes Secrets at runtime.
cert-manager — same idea for X.509 certs.

Vault dynamic database credentials

# 1) Configure Vault to issue per-workload Postgres creds.
$ vault write database/config/orders-db \
    plugin_name=postgresql-database-plugin \
    allowed_roles="orders-app" \
    connection_url="postgresql://{{username}}:{{password}}@db.internal:5432/orders" \
    username="vault-admin" \
    password="…"

$ vault write database/roles/orders-app \
    db_name=orders-db \
    creation_statements="CREATE ROLE \"{{name}}\" WITH LOGIN PASSWORD '{{password}}' VALID UNTIL '{{expiration}}'; \
                          GRANT SELECT, INSERT, UPDATE ON ALL TABLES IN SCHEMA app TO \"{{name}}\";" \
    default_ttl="1h" \
    max_ttl="24h"

# 2) The pod authenticates with its Kubernetes SA, then asks for creds.
$ vault write auth/kubernetes/login role=orders-app jwt=@/var/run/secrets/…/token

$ vault read database/creds/orders-app
# => { username: "v-orders-app-tH8…", password: "…", lease_duration: 3600 }

That credential is valid for one hour. It’s tied to one pod. If the pod is compromised the attacker has at most an hour. If the credential leaks into a log file, it’s expired before anyone reads it. The database, meanwhile, has a fresh role per workload — auditing actually works.

Never bake secrets into images

Not in ENV, not in ARG, not in build-time files, not in “dev only” images. Layers are cached, registries get mirrored, images get pulled by people who shouldn’t have them. Treat the image as public and inject every secret at runtime. The corollary: never commit secrets to Git, even in private repos — rotate any value that has ever been committed.

Supply Chain Security

Why Your Build Pipeline Is an Attack Surface

The Problem: SolarWinds, Codecov, event-stream, log4shell, xz-utils. Every one of these started with code you didn’t write running in your binary because of a transitive dependency or a compromised build.

The Solution: Sign what you build, generate provenance about how you built it, scan what you depend on, and verify all of it before deploying.

The pieces of a modern supply chain story:

SBOM (Software Bill of Materials) — SPDX or CycloneDX, generated per build, lists every dependency and version. Without an SBOM you can’t answer “am I running affected log4j?”
Image signing — cosign (Sigstore) signs container images, the signature lives in the registry, admission policy verifies before scheduling.
SLSA (Supply chain Levels for Software Artifacts) — a four-level framework defining what build provenance you produce. SLSA 3 means hermetic builds with signed provenance.
Dependency scanning — Dependabot, Snyk, Trivy, Grype. Scan on every PR and on a schedule for newly disclosed CVEs in already-shipped artifacts.
Admission control — the cluster refuses to schedule unsigned or unscanned images. Kyverno, Gatekeeper, or Sigstore policy-controller.

Signing and verifying with cosign

# Keyless signing: cosign uses an OIDC identity (your CI’s OIDC token)
# and a short-lived Fulcio cert. Signature is logged in Rekor.
$ cosign sign \
    --identity-token "$ACTIONS_ID_TOKEN_REQUEST_TOKEN" \
    ghcr.io/acme/orders-api@sha256:9c4f…

# Generate and attach an SBOM
$ syft ghcr.io/acme/orders-api@sha256:9c4f… -o spdx-json > sbom.spdx.json
$ cosign attest --predicate sbom.spdx.json --type spdx \
    ghcr.io/acme/orders-api@sha256:9c4f…

# In the cluster, refuse to schedule anything without a verifiable signature
# from the expected GitHub Actions workflow.
$ cosign verify \
    --certificate-identity-regexp "https://github.com/acme/.+/.github/workflows/release.yml@.+" \
    --certificate-oidc-issuer       "https://token.actions.githubusercontent.com" \
    ghcr.io/acme/orders-api@sha256:9c4f…

Three things this gets you that nothing else does: the image came from your CI workflow (not a developer’s laptop), it was built from a specific commit (provenance is in the Rekor transparency log), and the SBOM is cryptographically attached so you can answer the next CVE question in seconds, not days.

API Security

Why the OWASP API Top 10 Exists

The Problem: Most breaches in microservices aren’t exotic crypto attacks — they’re broken object-level authorization, mass assignment, missing rate limits, and unbounded resource consumption. The boring stuff.

The Solution: Defense in depth at the API edge: a WAF for known-bad shapes, schema validation for inputs, rate limits per identity (not per IP), idempotency keys for writes, and request-size caps for everything.

The 2023 OWASP API Security Top 10 — in priority order:

#	Risk	What it is
API1	BOLA	Broken object-level authorization — `GET /orders/123` doesn’t check the order belongs to you
API2	Broken authentication	Weak JWT validation, no rate limit on login, password reset flaws
API3	BOPLA	Broken object property-level authorization — mass assignment lets you set `isAdmin: true`
API4	Unrestricted resource consumption	No rate limits, no size caps, no query-complexity limits
API5	Broken function-level authorization	Admin endpoint shipped without role check
API6	Unrestricted access to sensitive flows	Account takeover, signup abuse, gift-card enumeration
API7	SSRF	Server-side request forgery — the Capital One vector
API8	Security misconfiguration	Default creds, verbose errors, missing security headers
API9	Improper inventory management	Old `/v1` endpoint nobody patched
API10	Unsafe consumption of APIs	You trust an upstream you shouldn’t

The non-negotiable controls at the edge

Rate limit per identity, not per IP. One subject, one tenant, one API key — not the IP of the proxy in front of you.
Schema validation. OpenAPI / JSON Schema at the gateway. Reject unknown fields (additionalProperties: false) to defeat mass assignment.
Request size limits. Body size, header size, URL length, multipart parts. The default in most frameworks is “way too large.”
Idempotency keys on every write. Idempotency-Key: <uuid> — the server stores the response for 24h and replays on retry. Without this, a network blip plus a retry equals a duplicate charge.
WAF — CloudFront, Cloudflare, AWS WAF. Catches known-bad shapes (SQLi, XSS, LFI) before they reach your code. Not a substitute for input validation; a layer in front of it.
Output filtering. Never return more than the caller asked for. Field-level allowlists at the serializer.

Idempotency in one paragraph

The client generates a UUID per logical operation and sends it as Idempotency-Key. The server stores (key, request hash, response) for some window (24h is standard). On a retry with the same key, the server returns the stored response — no double charge, no double order, no double anything. Stripe, AWS, and every payment processor on Earth implement this; you should too.

Real-World Examples

Google — BeyondCorp

Google removed its corporate VPN. Every internal tool is on the public internet, accessed through an identity-aware proxy. The decision to grant access is made per-request, based on user identity, device posture, and the resource being requested. There is no “inside the network” at Google — this is the foundational zero-trust paper, published in 2014, and the model the rest of the industry has slowly adopted as Cloudflare Access, Tailscale, AWS Verified Access, etc.

Netflix — service-to-service auth at scale

Netflix runs on the order of 1,000 services. Service-to-service authentication is mTLS via their internal mesh, with workload identities issued automatically. End-user tokens are translated at the edge (Zuul) into internal “Passport” tokens that downstream services trust. Every internal call carries a signed identity; no service has implicit trust because of where it runs. This is what zero trust looks like at scale.

Cloudflare Access

The commodified version of BeyondCorp. You point an internal app at Cloudflare, configure an identity provider, and write policy: “engineers in the platform group on a managed device can reach Grafana.” Cloudflare terminates TLS, authenticates the user, evaluates the policy, and forwards the request. Your app sees a verified identity header. No VPN, no IP allowlist, no shared bastion to compromise.

Capital One — the cautionary tale

2019: an attacker exploited an SSRF vulnerability in a misconfigured WAF (ModSecurity on an EC2 instance) to call the EC2 metadata service. The metadata service returned temporary IAM credentials — for a role with s3:ListAllMyBuckets and broad s3:GetObject. ~100 million records exfiltrated. Two failures stacked: SSRF (an application bug) and an over-privileged IAM role on a workload that didn’t need it. The fix is IMDSv2 (which requires a session token and blocks the SSRF pattern) and least-privilege IAM. Both have been the default ever since.

The lesson Capital One teaches

No single control is sufficient. The WAF was misconfigured, so input validation didn’t catch it. The metadata service was IMDSv1, so SSRF reached it. The IAM role was over-broad, so the credentials it returned could read S3. Defense in depth means any one of those being right would have stopped the breach. Build the system so that no single failure produces a breach.

Best Practices

The short list

mTLS everywhere by default. Plaintext between services in production is malpractice. Let the mesh do it.
One workload, one identity. SPIFFE/SPIRE or your cloud’s equivalent. Never a shared service account.
Validate user tokens at the edge, exchange for internal identity. Don’t propagate user JWTs through every hop.
Pin JWT algorithms; validate iss, aud, exp, nbf. Reject alg: none and unrecognized algorithms.
Policy as code. OPA or Cedar; rules in version control, evaluated at the sidecar.
Default-deny network policy. Then allowlist what each service actually needs.
Short-lived, dynamic credentials. Vault or your cloud’s secrets manager. Hours, not months.
Never bake secrets into images. Inject at runtime. Treat the image as public.
Sign every image, attest every build, verify at admission. Cosign + Sigstore + Kyverno is the modern stack.
SBOM per build. When the next log4shell drops, you want minutes-to-answer, not days.
Rate limit per identity, not per IP. Idempotency keys on every write. Schema validation on every input.
Assume breach. Log every authn and authz decision; ship to a SIEM; alert on anomalies.
Run a tabletop exercise quarterly. “The CI signing key has leaked.” If you can’t answer in an hour, you have homework.

The mindset

The thread running through every section of this tutorial is the same: do not trust the network; do not trust the caller; do not trust the artifact. Verify identity, verify intent, verify provenance — on every request, every workload, every deploy. That is the entire job.

Security in microservices is not a feature you add at the end. It’s a property of how you compose identity, transport, policy, and supply chain from day one. Bolt-on security is the most expensive security — you pay for it twice, once for the bolts and once for the breach.

The single most useful sentence about microservices security

If a single compromised service can reach — or impersonate, or read the secrets of — any other service in your fleet, you do not have a microservices architecture. You have a monolith with extra latency and worse failure modes.