Why Containers Matter for Microservices
Why Containerization Matters
The Problem: A monolith ships as one artifact onto one (well-understood) machine. A microservices estate ships dozens or hundreds of artifacts — written in different languages, with different runtime versions, different system libraries, different OS expectations — onto pools of identical machines. Without a uniform packaging format, “works on my laptop” goes from annoying to fatal at 100 services.
The Solution: A container bundles a service’s code, runtime, and OS dependencies into a single immutable image. The host kernel runs it via Linux namespaces and cgroups. Every environment — dev laptop, CI runner, staging, production — runs the exact same bytes.
Real Impact: Google, Netflix, Stripe, Shopify, and effectively every modern platform ship containers. Kubernetes, the de facto orchestrator, only knows how to run containers. If your service can’t be packaged as an OCI image, it can’t participate in the platform.
Real-World Analogy
Think of a shipping container. Before standardization, freight was loaded loose: barrels, crates, sacks. Each ship and each port had its own handling process. Loading took weeks. Damage was constant.
The standard 20-foot intermodal container changed everything. Same hull, any cargo — coffee beans, washing machines, cars. Cranes, ships, trucks, and trains all speak one interface. The container doesn’t care what’s inside; the port doesn’t care what’s inside; only the shipper and the receiver care.
An OCI image is the same idea for software. Your runtime, language version, and OS libraries live inside the box. Kubernetes, the registry, and the host don’t care — they just move the box.
Before containers, deploying a microservice meant a mix of configuration management (Chef, Puppet, Ansible), language-specific package managers (gem, pip, npm), and a fervent hope that the staging machine matched the production machine matched the laptop where the bug was last reproduced. Snowflakes were everywhere. A new service required a new playbook, a new base AMI, a new on-call rotation that knew which build of OpenSSL was on which subnet.
Containers collapsed that complexity into a single artifact: an OCI image. Build once, run anywhere a kernel and a container runtime exist. The blast radius of “the host environment” shrunk from “everything outside your code” to almost zero.
What containers actually buy you
- Reproducibility. The image you tested is the image you ship. Bit-for-bit.
- Density. Containers share the host kernel, so you can pack 50 of them on a node where you might fit 5 VMs.
- Speed. A container starts in hundreds of milliseconds, not minutes. This is what makes autoscaling and rolling deploys feel instant.
- Isolation that’s good enough. Not VM-grade, but enough for trusted multi-tenant workloads on a shared cluster.
- An interface for the platform. Kubernetes, Nomad, ECS — they all consume the same OCI image. You decouple the workload from the runner.
The Container Model
Why You Should Understand the Primitives
The Problem: Treat containers as “tiny VMs” and you’ll be surprised by every networking quirk, every signal-handling bug, and every “why is my PID 1 a shell script” outage.
The Solution: Internalize three Linux primitives — namespaces, cgroups, and the OCI image — and the rest of the runtime stack stops being magic.
A container is not a thing. It’s a process (or a small group of processes) that the kernel pretends is alone. Three Linux features do all the work:
- Namespaces isolate what the process can see: its own PID space, its own mount table, its own network interfaces, its own user IDs, its own hostname. From inside, PID 1 looks like the only process on the system.
- cgroups isolate what the process can use: CPU shares, memory limits, IO bandwidth, PIDs. The kernel enforces these; if you exceed memory you get OOM-killed, regardless of whether the host has free RAM.
- The OCI image is the on-disk format: a JSON manifest plus a stack of tarball layers. The runtime unpacks the layers, sets up namespaces and cgroups, and execs your entrypoint inside that environment.
The OCI image format
An image is a list of layers, each layer is a tarball of filesystem changes, and a manifest describes the order. When you build, each Dockerfile instruction that touches the filesystem produces a new layer. When you pull, the runtime fetches only the layers you don’t already have. This is why sharing a base image across services is cheap — the base layers are downloaded once and shared on disk.
# Inspect the layer structure of an image
docker image inspect nginx:1.27 --format='{{json .RootFS.Layers}}' | jq
# Each sha256:... is one layer (gzipped tarball) in the registry
[
"sha256:9853575bc4f955c5892dd64187538a6cd02dba6968b32f9b410d8a3fdd2e9d70",
"sha256:577a23b5f0bdfae5ec8c7c1bf5beb4b0e98ce0cb79f6fdc7c8b09f24bfb0bcd1",
"sha256:1f8c4e4a5e23c2d4cf28a0f37c0fbe7a8b69d3b53b1cdda43f4b25e89ea5e0b1"
]
Layer caching is the single most important performance property of the build. Order your Dockerfile so that the layers that change least often come first — system packages, then dependencies, then your code. Get this wrong and every commit invalidates a 200 MB layer; get it right and most builds reuse 99% of bytes.
Container vs VM
| Property | Container | Virtual Machine |
|---|---|---|
| Isolation boundary | Kernel namespaces & cgroups | Hardware virtualization |
| Kernel | Shared with host | Each VM has its own |
| Boot time | ~100 ms | Tens of seconds |
| Image size | 10s of MB typical | Hundreds of MB to GB |
| Density | 50–200 per node | ~10 per node |
| Security boundary | Adequate for trusted code | Hardware-level; stronger |
| Use when | App-level isolation, scale-out microservices | OS-level isolation, untrusted multi-tenancy |
Modern setups blur the line: Firecracker (the engine behind AWS Lambda) and Kata Containers run an OCI image inside a microVM, getting VM-grade isolation with container-grade startup. For 99% of microservices on a private cluster, plain containers are correct.
Writing a Production Dockerfile
Why the Dockerfile Is Production Code
The Problem: A first-draft Dockerfile usually works. It also leaks 800 MB, runs as root, has no healthcheck, doesn’t forward signals, and rebuilds dependencies on every commit. None of those failures are visible in dev.
The Solution: Treat the Dockerfile as a deliverable, not as glue. Multi-stage build. Pinned base. Non-root user. Cache-friendly layer order. Small final image.
A multi-stage Go Dockerfile
# syntax=docker/dockerfile:1.7
# ---- build stage ----
FROM golang:1.22-alpine@sha256:0d3653dd6f35159ec6e3d10263a42372f6f194c3dea0b35235d72aabde86486e AS build
WORKDIR /src
# Copy module files first so dependency download is cached
# across code-only changes.
COPY go.mod go.sum ./
RUN --mount=type=cache,target=/go/pkg/mod \
go mod download
# Now copy source. This layer rebuilds on any code change,
# but the dependency layer above stays cached.
COPY . .
RUN --mount=type=cache,target=/root/.cache/go-build \
CGO_ENABLED=0 GOOS=linux GOARCH=amd64 \
go build -trimpath -ldflags="-s -w" -o /out/api ./cmd/api
# ---- runtime stage: distroless, ~20 MB total ----
FROM gcr.io/distroless/static-debian12:nonroot@sha256:6ec5aa99dc335666e79dc64e4a6c8b89d8a0a2bcfe9a8cfce4a1f3f3a5b3e7e1
COPY --from=build /out/api /api
# distroless includes a 'nonroot' user (uid 65532) by default
USER nonroot:nonroot
EXPOSE 8080
# Use exec form so PID 1 is the binary, not a shell.
# SIGTERM goes directly to the process; graceful shutdown works.
ENTRYPOINT ["/api"]
Roughly 20 MB final image, no shell, no package manager, no setuid binaries, runs as a non-root user. There is almost nothing inside for an attacker to pivot to. The build stage may be a 400 MB Go toolchain image, but it’s discarded — only the runtime stage ships.
A multi-stage Python Dockerfile
# syntax=docker/dockerfile:1.7
FROM python:3.12-slim@sha256:4c0a9b2c3a8e84e6f3a8c54f1f0c4b6d2a9b8e7c3d4e5f6a7b8c9d0e1f2a3b4c AS build
ENV PYTHONDONTWRITEBYTECODE=1 \
PYTHONUNBUFFERED=1 \
PIP_NO_CACHE_DIR=1 \
PIP_DISABLE_PIP_VERSION_CHECK=1
WORKDIR /app
# Build wheels into an isolated dir so we can copy only what's needed.
COPY requirements.txt ./
RUN --mount=type=cache,target=/root/.cache/pip \
pip install --prefix=/install -r requirements.txt
# ---- runtime: slim, non-root, no build tools ----
FROM python:3.12-slim@sha256:4c0a9b2c3a8e84e6f3a8c54f1f0c4b6d2a9b8e7c3d4e5f6a7b8c9d0e1f2a3b4c
ENV PYTHONDONTWRITEBYTECODE=1 \
PYTHONUNBUFFERED=1 \
PATH=/install/bin:$PATH
# Create a real, non-login system user.
RUN groupadd --system --gid 1001 app \
&& useradd --system --uid 1001 --gid app --no-create-home app
WORKDIR /app
COPY --from=build /install /install
COPY --chown=app:app . .
USER app
EXPOSE 8000
# tini (or dumb-init) handles signal forwarding + zombie reaping
# when your process spawns children. Skip if you're already PID 1-safe.
ENTRYPOINT ["tini", "--"]
CMD ["gunicorn", "-w", "4", "-k", "uvicorn.workers.UvicornWorker", \
"-b", "0.0.0.0:8000", "app.main:app"]
HEALTHCHECK --interval=30s --timeout=3s --start-period=10s --retries=3 \
CMD python -c "import urllib.request,sys; sys.exit(0 if urllib.request.urlopen('http://127.0.0.1:8000/healthz', timeout=2).status==200 else 1)"
What every production Dockerfile gets right
- Multi-stage. Build tools never reach the runtime image.
- Pinned digest, not tag.
FROM image:tag@sha256:...— tags can move under you; digests cannot. - Layer order: least-changing first. Lockfile copy + install before source copy.
- Non-root user. A real UID, set with
USER. Default-root is the most common production gotcha. - Exec-form ENTRYPOINT. Brackets, not strings. Otherwise a shell sits in front of your process and signals get eaten.
- COPY, not ADD.
ADDauto-extracts archives and fetches URLs — both surprise behaviors. - Healthcheck. Defines what “running” means to the platform. Distinct from the readiness probe in Kubernetes.
Base image choice
| Base | Size | Has shell? | Has package manager? | When to use |
|---|---|---|---|---|
distroless/static | ~2 MB | No | No | Static binaries (Go, Rust). Smallest attack surface. |
distroless/base | ~20 MB | No | No | Binaries that need glibc + a CA bundle. |
distroless/python3 | ~50 MB | No | No | Python apps that don’t need a shell at runtime. |
alpine | ~5 MB | Yes (ash) | Yes (apk) | Tiny, but musl libc breaks some wheels & native libs. |
debian-slim / python:3.12-slim | ~80 MB | Yes | Yes (apt) | Compatibility with anything glibc; sane default. |
ubuntu / full debian | ~200 MB+ | Yes | Yes | Almost never needed for a service. |
Default to slim. Reach for distroless once the service is stable and you want to harden it. Avoid alpine for Python and Node unless you’ve already paid the musl tax somewhere else.
Never run as root, never bake secrets, never use :latest
Three rules that will cover 80% of incidents:
- Never run as root. A container that runs as UID 0 inside is one kernel CVE away from being UID 0 on the host. Use a non-root user; refuse to deploy anything else (PodSecurityPolicy / Pod Security Standards / OPA Gatekeeper enforce this).
- Never bake secrets into images. Anything in any layer is permanent and visible to anyone who can
docker pull. Mount secrets at runtime via Kubernetes Secrets, env vars from a vault, or BuildKit’s--mount=type=secret. - Never deploy
:latestto production. Tags float. Two pods of the same Deployment can end up on different image versions. Pin to digests, or to immutable tags like a git SHA.
Image Hygiene
Why Image Hygiene Is a Production Concern
The Problem: Most images shipped from a typical CI pipeline contain hundreds of CVEs and dozens of unused packages. Each is an attack surface, a pull-time cost, and a future audit finding.
The Solution: Small images, pinned bases, vulnerability scans on every build, and an SBOM you can grep when the next log4shell hits.
A good rule: aim for <100 MB final images. Sub-50 MB is achievable for Go and Rust. Once you’re over a few hundred megabytes, your pull times dominate cold-start latency and your scan reports become unreadable.
Scanning with Trivy or Grype
# Trivy: by Aqua Security; the most widely deployed scanner
trivy image --severity HIGH,CRITICAL --ignore-unfixed myorg/api:1.4.2
# Fail the CI build if any HIGH or CRITICAL with a fix exists
trivy image --exit-code 1 --severity HIGH,CRITICAL --ignore-unfixed myorg/api:1.4.2
# Grype: by Anchore; very fast, pairs naturally with Syft for SBOM
grype myorg/api:1.4.2 --fail-on high
# Docker Scout: built into recent Docker Desktop / Docker CLI
docker scout cves myorg/api:1.4.2 --only-severity high,critical
docker scout recommendations myorg/api:1.4.2
Generating an SBOM
# Syft: produces an SBOM in SPDX or CycloneDX format
syft myorg/api:1.4.2 -o spdx-json > api-1.4.2.sbom.json
# Attach the SBOM to the image as an OCI artifact (cosign)
cosign attach sbom --sbom api-1.4.2.sbom.json myorg/api:1.4.2
# Sign the image and the SBOM with a keyless OIDC identity
cosign sign myorg/api@sha256:<digest>
An SBOM is what lets you answer “which of our images contain library X version Y?” in seconds instead of weeks. The day a Heartbleed-class vulnerability drops, that question is the entire incident response.
The hygiene checklist
- Pin base images by digest —
FROM image:tag@sha256:.... Renovate / Dependabot can keep digests fresh in PRs. - Use distroless or slim — if there’s no shell, RCE is much harder to weaponize.
- Drop unused packages. If you installed
build-essentialonly to compile a wheel, do it in a build stage, not in runtime. - No secrets, no
.env, no SSH keys in any layer. Use.dockerignoreto keep them out of the build context entirely. - Scan in CI with Trivy, Grype, or Docker Scout. Block merge on HIGH/CRITICAL with available fixes.
- Generate an SBOM with Syft. Store it next to the image.
- Sign the image with cosign. Verify the signature in admission control.
- Re-scan periodically. A clean image today can have new CVEs tomorrow as new vulnerabilities are disclosed.
Local Development with Containers
Why Dev Should Look Like Prod
The Problem: Developers run services with npm run dev on a Mac, then ship to a Linux container. Subtle differences — case-sensitive filesystems, signal handling, networking — create “works locally, breaks in staging” bugs that are expensive to chase.
The Solution: Run dev inside the same container image, with bind mounts for hot reload. The dev loop stays fast; the runtime matches production.
docker-compose for a multi-service stack
# compose.yaml — api + postgres + redis, plus hot reload
services:
api:
build:
context: ./api
target: build # use the build stage for dev; full toolchain present
command: ["go", "run", "./cmd/api"]
ports:
- "8080:8080"
environment:
DATABASE_URL: postgres://app:app@db:5432/app?sslmode=disable
REDIS_URL: redis://cache:6379
depends_on:
db:
condition: service_healthy
cache:
condition: service_started
develop: # compose 'develop' watch mode
watch:
- action: sync
path: ./api
target: /src
ignore:
- "**/*_test.go"
- action: rebuild
path: ./api/go.mod
db:
image: postgres:16-alpine
environment:
POSTGRES_USER: app
POSTGRES_PASSWORD: app
POSTGRES_DB: app
healthcheck:
test: ["CMD-SHELL", "pg_isready -U app"]
interval: 2s
timeout: 2s
retries: 10
volumes:
- db-data:/var/lib/postgresql/data
cache:
image: redis:7-alpine
command: ["redis-server", "--save", "", "--appendonly", "no"]
volumes:
db-data:
docker compose up brings the entire stack online with one command. docker compose watch syncs source changes into the container without a rebuild. Newcomers are productive in minutes; CI can stand the same stack up for integration tests.
Tilt and Skaffold for Kubernetes-native dev
Once the stack outgrows compose — or you want to develop against the same Kubernetes manifests that ship to prod — reach for Tilt or Skaffold. Both watch your source, rebuild images, and apply manifests to a local cluster (kind, k3d, minikube) automatically. Tilt’s live-update can patch files inside a running pod without rebuilding the image, keeping the dev loop sub-second even with a real Kubernetes runtime.
# Tiltfile
docker_build(
'myorg/api', './api',
live_update=[
sync('./api', '/src'),
run('go build -o /api ./cmd/api', trigger=['./api']),
restart_container(),
],
)
k8s_yaml('k8s/api.yaml')
k8s_resource('api', port_forwards='8080:8080')
Container Runtimes and Orchestration
Why You Should Care About the Runtime
The Problem: “Docker” is shorthand for at least four different things: a CLI, a daemon, a build engine, and a runtime. Most production Kubernetes clusters do not actually run Docker.
The Solution: Know that the OCI image format is the contract. The runtime that actually starts your process is interchangeable: containerd, CRI-O, or runc directly. The image you build with docker build runs on all of them.
The stack, top to bottom:
| Layer | Examples | What it does |
|---|---|---|
| High-level orchestrator | Kubernetes, Nomad, ECS | Decides which node runs what; restarts; rolling deploys. |
| CRI shim | containerd CRI plugin, CRI-O | Implements the Kubernetes Container Runtime Interface. |
| High-level runtime | containerd, CRI-O | Pulls images, manages container lifecycles. |
| Low-level runtime | runc, crun, youki | Sets up namespaces / cgroups, execs the entrypoint. |
| Build engine | BuildKit, Buildpacks, kaniko | Turns source into an OCI image. |
Docker the company donated containerd to the CNCF; Kubernetes deprecated the dockershim in 1.24 and moved to talking to containerd (or CRI-O) directly. The Docker CLI you use locally is now mostly a friendly wrapper that calls into BuildKit and containerd. Nothing about your image changes.
Why Kubernetes won
Three reasons. One: it had Google’s Borg lineage, which gave it an opinionated model for declarative state, controllers, and reconciliation that competitors lacked. Two: it standardized the contract (Pods, Deployments, Services, the CRI), so the ecosystem could build around it without forking. Three: it’s pluggable everywhere — runtime, networking, storage, ingress — which let cloud vendors and platform teams customize without leaving the Kubernetes surface area.
Resource limits and probes (Kubernetes Deployment)
apiVersion: apps/v1
kind: Deployment
metadata:
name: api
labels: { app: api }
spec:
replicas: 3
selector:
matchLabels: { app: api }
template:
metadata:
labels: { app: api }
spec:
securityContext:
runAsNonRoot: true
runAsUser: 65532
fsGroup: 65532
seccompProfile: { type: RuntimeDefault }
containers:
- name: api
image: registry.example.com/api@sha256:8a7d6c5b4a3f2e1d0c9b8a7f6e5d4c3b2a1f0e9d8c7b6a5f4e3d2c1b0a9f8e7d
imagePullPolicy: IfNotPresent
ports:
- containerPort: 8080
name: http
resources:
requests: # scheduler reserves this much
cpu: "100m"
memory: "128Mi"
limits: # kernel kills you if you exceed memory
cpu: "500m"
memory: "256Mi"
livenessProbe: # restart if this fails
httpGet: { path: /healthz, port: http }
initialDelaySeconds: 5
periodSeconds: 10
readinessProbe: # take out of service rotation if this fails
httpGet: { path: /readyz, port: http }
periodSeconds: 5
startupProbe: # gives slow starters time before liveness kicks in
httpGet: { path: /healthz, port: http }
failureThreshold: 30
periodSeconds: 2
securityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
capabilities: { drop: ["ALL"] }
Requests drive scheduling; limits drive enforcement. Set memory limit equal to memory request when you can — it gives you the most predictable scheduling (Guaranteed QoS class) and avoids surprise OOM kills. CPU limits can throttle hot paths; many teams set requests but skip CPU limits intentionally.
For depth on probes, scheduling, and resource governance, see the Kubernetes track in this site — this section is a deliberate skim.
Real-World Examples
Google — Borg. Containers at Google predate Docker by close to a decade. Borg, the internal cluster manager Google has run since the mid-2000s, was the model that Kubernetes was extracted from. Every major architectural choice in Kubernetes — declarative specs, controllers, the Pod abstraction, the way scheduling and bin-packing work — comes from operational lessons Google paid for in production.
Netflix — Titus. Netflix runs the majority of its workload on Titus, an internal container platform on top of EC2. Titus predates Kubernetes maturity at the scale Netflix needed and was tuned for the realities of running containers as the unit of deployment in a giant streaming estate — integrated with Spinnaker for delivery, Atlas for telemetry, and Eureka for service discovery. The lesson is not “build your own” — it’s that the container, not the VM, became the right granularity for everyone above a certain scale.
Stripe — Buildpacks. Stripe (and Heroku, where the model originated) uses Cloud Native Buildpacks instead of hand-written Dockerfiles for many internal services. A buildpack inspects your repo, picks the right base image, installs dependencies, and produces an OCI image with consistent labels, healthchecks, and security defaults — without anyone needing to write a Dockerfile. For platform teams supporting hundreds of small services this trades flexibility for uniformity, which is almost always the right trade.
Shopify — conventions over creativity. Shopify documents and enforces strict Dockerfile conventions across teams: pinned bases, multi-stage, non-root, tini as init, standardized labels (org.opencontainers.image.source, revision, etc.). The point isn’t the conventions themselves — it’s that everyone follows the same ones, so the platform team can scan, sign, audit, and re-base images as a fleet.
Best Practices
The short list
- One process per container. If you find yourself reaching for supervisord, you probably want a sidecar Pod instead.
- Multi-stage builds, always. The runtime image should not contain compilers, package managers, or test fixtures.
- Pin base images by digest. Tags float; digests are immutable. Renovate or Dependabot can keep them current.
- Run as non-root. Set a real UID with
USER; enforce it with PodSecurityStandards or OPA Gatekeeper. - No secrets in layers. Use BuildKit secret mounts at build time; mount Kubernetes Secrets at runtime.
- Order layers by change frequency. Lockfiles before source. The cache pays for itself within a week.
- Smallest viable base. Distroless for static binaries; slim for everything else; alpine only when you mean it.
- Healthchecks at the image level and probes at the platform level. They are not the same.
- Forward signals correctly. Exec-form ENTRYPOINT, or use tini/dumb-init when the process spawns children.
- Scan every build. Trivy, Grype, or Docker Scout in CI; fail on HIGH/CRITICAL with a fix available.
- Generate and store an SBOM. Syft for the artifact; cosign to attach it to the image.
- Sign images. cosign with keyless OIDC; verify in admission control before pods schedule.
- Read-only root filesystem. Mount
emptyDirvolumes for the few writable paths the app actually needs. - Set resource requests and limits. Memory limit = memory request for predictable scheduling.
- Never
:latestin production. Pin to git SHA or digest. Same image in staging and prod, different config.
The two failures that will get you
Almost every container-related production incident is one of two things. One: the container ran fine in dev but the image is bloated, slow to pull, and chokes node startup during a scale-up event — trace it back to a missing multi-stage build. Two: a vulnerability is disclosed in a base image and you have no idea which of your 200 services are affected because nobody scanned and nobody kept SBOMs — trace it back to a CI pipeline that treats image hygiene as optional.
Both are preventable in an afternoon. Neither is forgiven by the platform.
The single most useful sentence about containers
The container is the contract between your code and the platform. Everything inside the image is your problem; everything outside — scheduling, networking, secrets, restarts — is the platform’s. Get the boundary clean and the rest of microservices gets dramatically simpler.