Containerization | LIZIU Microservices

Why Containers Matter for Microservices

Why Containerization Matters

The Problem: A monolith ships as one artifact onto one (well-understood) machine. A microservices estate ships dozens or hundreds of artifacts — written in different languages, with different runtime versions, different system libraries, different OS expectations — onto pools of identical machines. Without a uniform packaging format, “works on my laptop” goes from annoying to fatal at 100 services.

The Solution: A container bundles a service’s code, runtime, and OS dependencies into a single immutable image. The host kernel runs it via Linux namespaces and cgroups. Every environment — dev laptop, CI runner, staging, production — runs the exact same bytes.

Real Impact: Google, Netflix, Stripe, Shopify, and effectively every modern platform ship containers. Kubernetes, the de facto orchestrator, only knows how to run containers. If your service can’t be packaged as an OCI image, it can’t participate in the platform.

Real-World Analogy

Think of a shipping container. Before standardization, freight was loaded loose: barrels, crates, sacks. Each ship and each port had its own handling process. Loading took weeks. Damage was constant.

The standard 20-foot intermodal container changed everything. Same hull, any cargo — coffee beans, washing machines, cars. Cranes, ships, trucks, and trains all speak one interface. The container doesn’t care what’s inside; the port doesn’t care what’s inside; only the shipper and the receiver care.

An OCI image is the same idea for software. Your runtime, language version, and OS libraries live inside the box. Kubernetes, the registry, and the host don’t care — they just move the box.

Before containers, deploying a microservice meant a mix of configuration management (Chef, Puppet, Ansible), language-specific package managers (gem, pip, npm), and a fervent hope that the staging machine matched the production machine matched the laptop where the bug was last reproduced. Snowflakes were everywhere. A new service required a new playbook, a new base AMI, a new on-call rotation that knew which build of OpenSSL was on which subnet.

Containers collapsed that complexity into a single artifact: an OCI image. Build once, run anywhere a kernel and a container runtime exist. The blast radius of “the host environment” shrunk from “everything outside your code” to almost zero.

What containers actually buy you

Reproducibility. The image you tested is the image you ship. Bit-for-bit.
Density. Containers share the host kernel, so you can pack 50 of them on a node where you might fit 5 VMs.
Speed. A container starts in hundreds of milliseconds, not minutes. This is what makes autoscaling and rolling deploys feel instant.
Isolation that’s good enough. Not VM-grade, but enough for trusted multi-tenant workloads on a shared cluster.
An interface for the platform. Kubernetes, Nomad, ECS — they all consume the same OCI image. You decouple the workload from the runner.

The Container Model

Why You Should Understand the Primitives

The Problem: Treat containers as “tiny VMs” and you’ll be surprised by every networking quirk, every signal-handling bug, and every “why is my PID 1 a shell script” outage.

The Solution: Internalize three Linux primitives — namespaces, cgroups, and the OCI image — and the rest of the runtime stack stops being magic.

A container is not a thing. It’s a process (or a small group of processes) that the kernel pretends is alone. Three Linux features do all the work:

Namespaces isolate what the process can see: its own PID space, its own mount table, its own network interfaces, its own user IDs, its own hostname. From inside, PID 1 looks like the only process on the system.
cgroups isolate what the process can use: CPU shares, memory limits, IO bandwidth, PIDs. The kernel enforces these; if you exceed memory you get OOM-killed, regardless of whether the host has free RAM.
The OCI image is the on-disk format: a JSON manifest plus a stack of tarball layers. The runtime unpacks the layers, sets up namespaces and cgroups, and execs your entrypoint inside that environment.

The OCI image format

An image is a list of layers, each layer is a tarball of filesystem changes, and a manifest describes the order. When you build, each Dockerfile instruction that touches the filesystem produces a new layer. When you pull, the runtime fetches only the layers you don’t already have. This is why sharing a base image across services is cheap — the base layers are downloaded once and shared on disk.

# Inspect the layer structure of an image
docker image inspect nginx:1.27 --format='{{json .RootFS.Layers}}' | jq

# Each sha256:... is one layer (gzipped tarball) in the registry
[
  "sha256:9853575bc4f955c5892dd64187538a6cd02dba6968b32f9b410d8a3fdd2e9d70",
  "sha256:577a23b5f0bdfae5ec8c7c1bf5beb4b0e98ce0cb79f6fdc7c8b09f24bfb0bcd1",
  "sha256:1f8c4e4a5e23c2d4cf28a0f37c0fbe7a8b69d3b53b1cdda43f4b25e89ea5e0b1"
]

Layer caching is the single most important performance property of the build. Order your Dockerfile so that the layers that change least often come first — system packages, then dependencies, then your code. Get this wrong and every commit invalidates a 200 MB layer; get it right and most builds reuse 99% of bytes.

Container vs VM

Property	Container	Virtual Machine
Isolation boundary	Kernel namespaces & cgroups	Hardware virtualization
Kernel	Shared with host	Each VM has its own
Boot time	~100 ms	Tens of seconds
Image size	10s of MB typical	Hundreds of MB to GB
Density	50–200 per node	~10 per node
Security boundary	Adequate for trusted code	Hardware-level; stronger
Use when	App-level isolation, scale-out microservices	OS-level isolation, untrusted multi-tenancy

Modern setups blur the line: Firecracker (the engine behind AWS Lambda) and Kata Containers run an OCI image inside a microVM, getting VM-grade isolation with container-grade startup. For 99% of microservices on a private cluster, plain containers are correct.

Writing a Production Dockerfile

Why the Dockerfile Is Production Code

The Problem: A first-draft Dockerfile usually works. It also leaks 800 MB, runs as root, has no healthcheck, doesn’t forward signals, and rebuilds dependencies on every commit. None of those failures are visible in dev.

The Solution: Treat the Dockerfile as a deliverable, not as glue. Multi-stage build. Pinned base. Non-root user. Cache-friendly layer order. Small final image.

A multi-stage Go Dockerfile

# syntax=docker/dockerfile:1.7

# ---- build stage ----
FROM golang:1.22-alpine@sha256:0d3653dd6f35159ec6e3d10263a42372f6f194c3dea0b35235d72aabde86486e AS build

WORKDIR /src

# Copy module files first so dependency download is cached
# across code-only changes.
COPY go.mod go.sum ./
RUN --mount=type=cache,target=/go/pkg/mod \
    go mod download

# Now copy source. This layer rebuilds on any code change,
# but the dependency layer above stays cached.
COPY . .

RUN --mount=type=cache,target=/root/.cache/go-build \
    CGO_ENABLED=0 GOOS=linux GOARCH=amd64 \
    go build -trimpath -ldflags="-s -w" -o /out/api ./cmd/api

# ---- runtime stage: distroless, ~20 MB total ----
FROM gcr.io/distroless/static-debian12:nonroot@sha256:6ec5aa99dc335666e79dc64e4a6c8b89d8a0a2bcfe9a8cfce4a1f3f3a5b3e7e1

COPY --from=build /out/api /api

# distroless includes a 'nonroot' user (uid 65532) by default
USER nonroot:nonroot

EXPOSE 8080

# Use exec form so PID 1 is the binary, not a shell.
# SIGTERM goes directly to the process; graceful shutdown works.
ENTRYPOINT ["/api"]

Roughly 20 MB final image, no shell, no package manager, no setuid binaries, runs as a non-root user. There is almost nothing inside for an attacker to pivot to. The build stage may be a 400 MB Go toolchain image, but it’s discarded — only the runtime stage ships.

A multi-stage Python Dockerfile

# syntax=docker/dockerfile:1.7

FROM python:3.12-slim@sha256:4c0a9b2c3a8e84e6f3a8c54f1f0c4b6d2a9b8e7c3d4e5f6a7b8c9d0e1f2a3b4c AS build

ENV PYTHONDONTWRITEBYTECODE=1 \
    PYTHONUNBUFFERED=1 \
    PIP_NO_CACHE_DIR=1 \
    PIP_DISABLE_PIP_VERSION_CHECK=1

WORKDIR /app

# Build wheels into an isolated dir so we can copy only what's needed.
COPY requirements.txt ./
RUN --mount=type=cache,target=/root/.cache/pip \
    pip install --prefix=/install -r requirements.txt

# ---- runtime: slim, non-root, no build tools ----
FROM python:3.12-slim@sha256:4c0a9b2c3a8e84e6f3a8c54f1f0c4b6d2a9b8e7c3d4e5f6a7b8c9d0e1f2a3b4c

ENV PYTHONDONTWRITEBYTECODE=1 \
    PYTHONUNBUFFERED=1 \
    PATH=/install/bin:$PATH

# Create a real, non-login system user.
RUN groupadd --system --gid 1001 app \
 && useradd  --system --uid 1001 --gid app --no-create-home app

WORKDIR /app

COPY --from=build /install /install
COPY --chown=app:app . .

USER app

EXPOSE 8000

# tini (or dumb-init) handles signal forwarding + zombie reaping
# when your process spawns children. Skip if you're already PID 1-safe.
ENTRYPOINT ["tini", "--"]
CMD ["gunicorn", "-w", "4", "-k", "uvicorn.workers.UvicornWorker", \
     "-b", "0.0.0.0:8000", "app.main:app"]

HEALTHCHECK --interval=30s --timeout=3s --start-period=10s --retries=3 \
    CMD python -c "import urllib.request,sys; sys.exit(0 if urllib.request.urlopen('http://127.0.0.1:8000/healthz', timeout=2).status==200 else 1)"

What every production Dockerfile gets right

Multi-stage. Build tools never reach the runtime image.
Pinned digest, not tag. FROM image:tag@sha256:... — tags can move under you; digests cannot.
Layer order: least-changing first. Lockfile copy + install before source copy.
Non-root user. A real UID, set with USER. Default-root is the most common production gotcha.
Exec-form ENTRYPOINT. Brackets, not strings. Otherwise a shell sits in front of your process and signals get eaten.
COPY, not ADD. ADD auto-extracts archives and fetches URLs — both surprise behaviors.
Healthcheck. Defines what “running” means to the platform. Distinct from the readiness probe in Kubernetes.

Base image choice

Base	Size	Has shell?	Has package manager?	When to use
`distroless/static`	~2 MB	No	No	Static binaries (Go, Rust). Smallest attack surface.
`distroless/base`	~20 MB	No	No	Binaries that need glibc + a CA bundle.
`distroless/python3`	~50 MB	No	No	Python apps that don’t need a shell at runtime.
`alpine`	~5 MB	Yes (ash)	Yes (apk)	Tiny, but musl libc breaks some wheels & native libs.
`debian-slim` / `python:3.12-slim`	~80 MB	Yes	Yes (apt)	Compatibility with anything glibc; sane default.
`ubuntu` / full debian	~200 MB+	Yes	Yes	Almost never needed for a service.

Default to slim. Reach for distroless once the service is stable and you want to harden it. Avoid alpine for Python and Node unless you’ve already paid the musl tax somewhere else.

Never run as root, never bake secrets, never use `:latest`

Three rules that will cover 80% of incidents:

Never run as root. A container that runs as UID 0 inside is one kernel CVE away from being UID 0 on the host. Use a non-root user; refuse to deploy anything else (PodSecurityPolicy / Pod Security Standards / OPA Gatekeeper enforce this).
Never bake secrets into images. Anything in any layer is permanent and visible to anyone who can docker pull. Mount secrets at runtime via Kubernetes Secrets, env vars from a vault, or BuildKit’s --mount=type=secret.
Never deploy :latest to production. Tags float. Two pods of the same Deployment can end up on different image versions. Pin to digests, or to immutable tags like a git SHA.

Image Hygiene

Why Image Hygiene Is a Production Concern

The Problem: Most images shipped from a typical CI pipeline contain hundreds of CVEs and dozens of unused packages. Each is an attack surface, a pull-time cost, and a future audit finding.

The Solution: Small images, pinned bases, vulnerability scans on every build, and an SBOM you can grep when the next log4shell hits.

A good rule: aim for <100 MB final images. Sub-50 MB is achievable for Go and Rust. Once you’re over a few hundred megabytes, your pull times dominate cold-start latency and your scan reports become unreadable.

Scanning with Trivy or Grype

# Trivy: by Aqua Security; the most widely deployed scanner
trivy image --severity HIGH,CRITICAL --ignore-unfixed myorg/api:1.4.2

# Fail the CI build if any HIGH or CRITICAL with a fix exists
trivy image --exit-code 1 --severity HIGH,CRITICAL --ignore-unfixed myorg/api:1.4.2

# Grype: by Anchore; very fast, pairs naturally with Syft for SBOM
grype myorg/api:1.4.2 --fail-on high

# Docker Scout: built into recent Docker Desktop / Docker CLI
docker scout cves myorg/api:1.4.2 --only-severity high,critical
docker scout recommendations myorg/api:1.4.2

Generating an SBOM

# Syft: produces an SBOM in SPDX or CycloneDX format
syft myorg/api:1.4.2 -o spdx-json > api-1.4.2.sbom.json

# Attach the SBOM to the image as an OCI artifact (cosign)
cosign attach sbom --sbom api-1.4.2.sbom.json myorg/api:1.4.2

# Sign the image and the SBOM with a keyless OIDC identity
cosign sign myorg/api@sha256:<digest>

An SBOM is what lets you answer “which of our images contain library X version Y?” in seconds instead of weeks. The day a Heartbleed-class vulnerability drops, that question is the entire incident response.

The hygiene checklist

Pin base images by digest — FROM image:tag@sha256:.... Renovate / Dependabot can keep digests fresh in PRs.
Use distroless or slim — if there’s no shell, RCE is much harder to weaponize.
Drop unused packages. If you installed build-essential only to compile a wheel, do it in a build stage, not in runtime.
No secrets, no .env, no SSH keys in any layer. Use .dockerignore to keep them out of the build context entirely.
Scan in CI with Trivy, Grype, or Docker Scout. Block merge on HIGH/CRITICAL with available fixes.
Generate an SBOM with Syft. Store it next to the image.
Sign the image with cosign. Verify the signature in admission control.
Re-scan periodically. A clean image today can have new CVEs tomorrow as new vulnerabilities are disclosed.

Local Development with Containers

Why Dev Should Look Like Prod

The Problem: Developers run services with npm run dev on a Mac, then ship to a Linux container. Subtle differences — case-sensitive filesystems, signal handling, networking — create “works locally, breaks in staging” bugs that are expensive to chase.

The Solution: Run dev inside the same container image, with bind mounts for hot reload. The dev loop stays fast; the runtime matches production.

docker-compose for a multi-service stack

# compose.yaml — api + postgres + redis, plus hot reload
services:
  api:
    build:
      context: ./api
      target: build           # use the build stage for dev; full toolchain present
    command: ["go", "run", "./cmd/api"]
    ports:
      - "8080:8080"
    environment:
      DATABASE_URL: postgres://app:app@db:5432/app?sslmode=disable
      REDIS_URL: redis://cache:6379
    depends_on:
      db:
        condition: service_healthy
      cache:
        condition: service_started
    develop:                  # compose 'develop' watch mode
      watch:
        - action: sync
          path: ./api
          target: /src
          ignore:
            - "**/*_test.go"
        - action: rebuild
          path: ./api/go.mod

  db:
    image: postgres:16-alpine
    environment:
      POSTGRES_USER: app
      POSTGRES_PASSWORD: app
      POSTGRES_DB: app
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U app"]
      interval: 2s
      timeout: 2s
      retries: 10
    volumes:
      - db-data:/var/lib/postgresql/data

  cache:
    image: redis:7-alpine
    command: ["redis-server", "--save", "", "--appendonly", "no"]

volumes:
  db-data:

docker compose up brings the entire stack online with one command. docker compose watch syncs source changes into the container without a rebuild. Newcomers are productive in minutes; CI can stand the same stack up for integration tests.

Tilt and Skaffold for Kubernetes-native dev

Once the stack outgrows compose — or you want to develop against the same Kubernetes manifests that ship to prod — reach for Tilt or Skaffold. Both watch your source, rebuild images, and apply manifests to a local cluster (kind, k3d, minikube) automatically. Tilt’s live-update can patch files inside a running pod without rebuilding the image, keeping the dev loop sub-second even with a real Kubernetes runtime.

# Tiltfile
docker_build(
    'myorg/api', './api',
    live_update=[
        sync('./api', '/src'),
        run('go build -o /api ./cmd/api', trigger=['./api']),
        restart_container(),
    ],
)

k8s_yaml('k8s/api.yaml')
k8s_resource('api', port_forwards='8080:8080')

Container Runtimes and Orchestration

Why You Should Care About the Runtime

The Problem: “Docker” is shorthand for at least four different things: a CLI, a daemon, a build engine, and a runtime. Most production Kubernetes clusters do not actually run Docker.

The Solution: Know that the OCI image format is the contract. The runtime that actually starts your process is interchangeable: containerd, CRI-O, or runc directly. The image you build with docker build runs on all of them.

The stack, top to bottom:

Layer	Examples	What it does
High-level orchestrator	Kubernetes, Nomad, ECS	Decides which node runs what; restarts; rolling deploys.
CRI shim	containerd CRI plugin, CRI-O	Implements the Kubernetes Container Runtime Interface.
High-level runtime	containerd, CRI-O	Pulls images, manages container lifecycles.
Low-level runtime	runc, crun, youki	Sets up namespaces / cgroups, execs the entrypoint.
Build engine	BuildKit, Buildpacks, kaniko	Turns source into an OCI image.

Docker the company donated containerd to the CNCF; Kubernetes deprecated the dockershim in 1.24 and moved to talking to containerd (or CRI-O) directly. The Docker CLI you use locally is now mostly a friendly wrapper that calls into BuildKit and containerd. Nothing about your image changes.

Why Kubernetes won

Three reasons. One: it had Google’s Borg lineage, which gave it an opinionated model for declarative state, controllers, and reconciliation that competitors lacked. Two: it standardized the contract (Pods, Deployments, Services, the CRI), so the ecosystem could build around it without forking. Three: it’s pluggable everywhere — runtime, networking, storage, ingress — which let cloud vendors and platform teams customize without leaving the Kubernetes surface area.

Resource limits and probes (Kubernetes Deployment)

apiVersion: apps/v1
kind: Deployment
metadata:
  name: api
  labels: { app: api }
spec:
  replicas: 3
  selector:
    matchLabels: { app: api }
  template:
    metadata:
      labels: { app: api }
    spec:
      securityContext:
        runAsNonRoot: true
        runAsUser: 65532
        fsGroup: 65532
        seccompProfile: { type: RuntimeDefault }
      containers:
        - name: api
          image: registry.example.com/api@sha256:8a7d6c5b4a3f2e1d0c9b8a7f6e5d4c3b2a1f0e9d8c7b6a5f4e3d2c1b0a9f8e7d
          imagePullPolicy: IfNotPresent
          ports:
            - containerPort: 8080
              name: http
          resources:
            requests:               # scheduler reserves this much
              cpu: "100m"
              memory: "128Mi"
            limits:                 # kernel kills you if you exceed memory
              cpu: "500m"
              memory: "256Mi"
          livenessProbe:           # restart if this fails
            httpGet: { path: /healthz, port: http }
            initialDelaySeconds: 5
            periodSeconds: 10
          readinessProbe:          # take out of service rotation if this fails
            httpGet: { path: /readyz, port: http }
            periodSeconds: 5
          startupProbe:            # gives slow starters time before liveness kicks in
            httpGet: { path: /healthz, port: http }
            failureThreshold: 30
            periodSeconds: 2
          securityContext:
            allowPrivilegeEscalation: false
            readOnlyRootFilesystem: true
            capabilities: { drop: ["ALL"] }

Requests drive scheduling; limits drive enforcement. Set memory limit equal to memory request when you can — it gives you the most predictable scheduling (Guaranteed QoS class) and avoids surprise OOM kills. CPU limits can throttle hot paths; many teams set requests but skip CPU limits intentionally.

For depth on probes, scheduling, and resource governance, see the Kubernetes track in this site — this section is a deliberate skim.

Real-World Examples

Google — Borg. Containers at Google predate Docker by close to a decade. Borg, the internal cluster manager Google has run since the mid-2000s, was the model that Kubernetes was extracted from. Every major architectural choice in Kubernetes — declarative specs, controllers, the Pod abstraction, the way scheduling and bin-packing work — comes from operational lessons Google paid for in production.

Netflix — Titus. Netflix runs the majority of its workload on Titus, an internal container platform on top of EC2. Titus predates Kubernetes maturity at the scale Netflix needed and was tuned for the realities of running containers as the unit of deployment in a giant streaming estate — integrated with Spinnaker for delivery, Atlas for telemetry, and Eureka for service discovery. The lesson is not “build your own” — it’s that the container, not the VM, became the right granularity for everyone above a certain scale.

Stripe — Buildpacks. Stripe (and Heroku, where the model originated) uses Cloud Native Buildpacks instead of hand-written Dockerfiles for many internal services. A buildpack inspects your repo, picks the right base image, installs dependencies, and produces an OCI image with consistent labels, healthchecks, and security defaults — without anyone needing to write a Dockerfile. For platform teams supporting hundreds of small services this trades flexibility for uniformity, which is almost always the right trade.

Shopify — conventions over creativity. Shopify documents and enforces strict Dockerfile conventions across teams: pinned bases, multi-stage, non-root, tini as init, standardized labels (org.opencontainers.image.source, revision, etc.). The point isn’t the conventions themselves — it’s that everyone follows the same ones, so the platform team can scan, sign, audit, and re-base images as a fleet.

Best Practices

The short list

One process per container. If you find yourself reaching for supervisord, you probably want a sidecar Pod instead.
Multi-stage builds, always. The runtime image should not contain compilers, package managers, or test fixtures.
Pin base images by digest. Tags float; digests are immutable. Renovate or Dependabot can keep them current.
Run as non-root. Set a real UID with USER; enforce it with PodSecurityStandards or OPA Gatekeeper.
No secrets in layers. Use BuildKit secret mounts at build time; mount Kubernetes Secrets at runtime.
Order layers by change frequency. Lockfiles before source. The cache pays for itself within a week.
Smallest viable base. Distroless for static binaries; slim for everything else; alpine only when you mean it.
Healthchecks at the image level and probes at the platform level. They are not the same.
Forward signals correctly. Exec-form ENTRYPOINT, or use tini/dumb-init when the process spawns children.
Scan every build. Trivy, Grype, or Docker Scout in CI; fail on HIGH/CRITICAL with a fix available.
Generate and store an SBOM. Syft for the artifact; cosign to attach it to the image.
Sign images. cosign with keyless OIDC; verify in admission control before pods schedule.
Read-only root filesystem. Mount emptyDir volumes for the few writable paths the app actually needs.
Set resource requests and limits. Memory limit = memory request for predictable scheduling.
Never :latest in production. Pin to git SHA or digest. Same image in staging and prod, different config.

The two failures that will get you

Almost every container-related production incident is one of two things. One: the container ran fine in dev but the image is bloated, slow to pull, and chokes node startup during a scale-up event — trace it back to a missing multi-stage build. Two: a vulnerability is disclosed in a base image and you have no idea which of your 200 services are affected because nobody scanned and nobody kept SBOMs — trace it back to a CI pipeline that treats image hygiene as optional.

Both are preventable in an afternoon. Neither is forgiven by the platform.

The single most useful sentence about containers

The container is the contract between your code and the platform. Everything inside the image is your problem; everything outside — scheduling, networking, secrets, restarts — is the platform’s. Get the boundary clean and the rest of microservices gets dramatically simpler.