Why Deployment Strategy Matters
Why Deployment Strategy Matters
The Problem: Most production incidents are not caused by hardware, hackers, or load — they’re caused by a deploy. Industry surveys put the share at around 70%. The release is the moment your tested code meets the untested combination of real traffic, real data, and real downstreams.
The Solution: Treat the deployment itself as a controllable risk. Pick a strategy — rolling, blue-green, canary, shadow — that limits blast radius and lets you roll back faster than the bad code can hurt you.
Real Impact: A canary that catches a regression at 1% of traffic costs you 1% of customer pain. The same regression deployed all-at-once costs you 100%. The math never gets less brutal than that.
Real-World Analogy
Think of a heart transplant. A surgeon does not stop the patient’s heart, throw it out, and hope the new one works on the first beat. They:
- Graft the new vessels alongside the old ones first
- Verify perfusion — blood flowing, pressure stable — before committing
- Switch over gradually, ready to revert at every step
- Keep the bypass machine running as the rollback plan
Your deploy is the same. The new version runs alongside the old one, you verify it carries traffic safely, you cut over progressively, and you keep the old version warm in case you have to back out. Anything else is asking the patient to bet their life on your unit tests.
In a monolith, deploys were rare and risky. In microservices, they’re hourly and routine — which only works if the strategy itself contains the blame. The conversation has shifted from “how do we ship safely?” to “how do we ship continuously without anyone noticing?” That second question is what this tutorial answers.
What a Deployment Strategy Actually Controls
Every strategy is a different set of trade-offs across four axes. The job is to know which axis matters most for the service you’re shipping.
| Strategy | Rollback Speed | Infra Cost | Blast Radius | Complexity |
|---|---|---|---|---|
| Recreate | Slow (full redeploy) | 1x | 100% during downtime | Trivial |
| Rolling | Slow (roll back over many pods) | 1x + surge | Gradual but unbounded | Low |
| Blue-Green | Instant (flip LB) | 2x | 100% per cutover | Medium |
| Canary | Fast (drain canary) | 1x + small surge | Bounded by canary % | High |
| Shadow | N/A (no user traffic) | 2x for shadowed paths | Zero (read-only) | Highest |
The rest of this tutorial walks each strategy in production-flavored detail — not the textbook version, the version that survives an on-call rotation.
Recreate & Rolling Deploys
Why Most Teams Start Here
The Problem: You need to ship and you don’t have a service mesh, an LB scripting layer, or a progressive-delivery controller yet.
The Solution: Kubernetes’ native RollingUpdate strategy is the floor. It’s built in, it’s safe-by-default, and it gets you to zero-downtime deploys before you reach for anything fancier.
Recreate tears down all the old pods, then starts the new ones. There is downtime by design — usually seconds, sometimes minutes. Use it for stateful workloads that cannot run two versions side-by-side (e.g., a single-writer batch job, or a schema migration that’s incompatible with the old code). Never use it for a customer-facing service.
Rolling replaces pods a few at a time. Kubernetes adds maxSurge new pods, waits for them to become ready, then removes maxUnavailable old ones, and repeats. The two knobs control how aggressive you are.
# A production-shaped rolling deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: orders-api
labels:
app: orders-api
spec:
replicas: 12
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 25% # at most 3 extra pods during the roll
maxUnavailable: 0 # never drop capacity below replicas
minReadySeconds: 15 # let metrics & warm-up settle before counting healthy
progressDeadlineSeconds: 600 # auto-rollback if not done in 10 min
selector:
matchLabels:
app: orders-api
template:
metadata:
labels:
app: orders-api
spec:
containers:
- name: orders-api
image: registry.example.com/orders-api:v2.4.1
readinessProbe:
httpGet: { path: /healthz/ready, port: 8080 }
initialDelaySeconds: 5
periodSeconds: 3
failureThreshold: 2
livenessProbe:
httpGet: { path: /healthz/live, port: 8080 }
initialDelaySeconds: 30
periodSeconds: 10
lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "sleep 10 && /app/drain"]
The four knobs that decide if rolling actually works
readinessProbe— the single most important field. A pod that says “ready” before it can serve traffic will black-hole every request that lands on it during the roll.minReadySeconds— require pods to stay healthy for N seconds before counting them as available. Catches crash-on-warmup bugs.maxUnavailable: 0— for stateless web services, never reduce capacity. Surge upward instead. Capacity drops during rolls cause cascading queue buildup downstream.preStop+terminationGracePeriodSeconds— sleep, then drain. Without this, in-flight requests get RST’d when the pod terminates.
Rolling is a slow rollback
Rolling deploys give you slow forward and slow back. If the new version is bad and you’ve already rolled out to 8 of 12 pods, rolling back means another full roll. Meanwhile every new request has a 67% chance of hitting the broken version. Rolling is the right starting point, but the day you find yourself staring at a half-rolled bad release is the day you graduate to canary.
Blue-Green Deployment
Why Blue-Green
The Problem: Rolling deploys leak the new version into production gradually, which means rollback is gradual too. For high-stakes releases (payment, auth, schema-coupled services) you want an atomic switch you can reverse in one click.
The Solution: Run two complete environments — blue (live) and green (new) — identical in every way. Cut traffic from blue to green by flipping the load balancer. If anything goes wrong, flip it back. Time-to-rollback is seconds, not minutes.
The classic shape: blue is serving 100% of traffic on v1.4. You stand up green on v1.5, run smoke tests against it directly (no real traffic), then switch the load balancer’s target group from blue to green. New requests hit green; in-flight requests on blue drain. Blue stays warm for at least one cycle (often a full day) so you can flip back instantly if something rotted in the first hour.
On Kubernetes, blue-green is just two Deployments and a Service whose selector you change:
# Two Deployments, distinguished by version label
apiVersion: apps/v1
kind: Deployment
metadata: { name: checkout-blue }
spec:
replicas: 10
selector: { matchLabels: { app: checkout, slot: blue } }
template:
metadata: { labels: { app: checkout, slot: blue, version: v1.4 } }
spec: { containers: [{ name: app, image: checkout:v1.4 }] }
---
apiVersion: apps/v1
kind: Deployment
metadata: { name: checkout-green }
spec:
replicas: 10
selector: { matchLabels: { app: checkout, slot: green } }
template:
metadata: { labels: { app: checkout, slot: green, version: v1.5 } }
spec: { containers: [{ name: app, image: checkout:v1.5 }] }
---
# The Service is the switch. Change `slot: blue` to `slot: green` to cut over.
apiVersion: v1
kind: Service
metadata: { name: checkout }
spec:
selector:
app: checkout
slot: blue # <-- one-line cutover
ports: [{ port: 80, targetPort: 8080 }]
Where blue-green hurts
- Double the infra cost while both environments are warm. For a 10-pod service that’s mild; for a fleet of 10,000 it’s a real bill.
- Database schemas must be backwards-compatible across blue and green. The cutover happens in seconds; the schema migration cannot.
- Stateful caches and connection pools are cold on green at cutover. Pre-warm them or expect a P99 spike for 30–90 seconds.
- Long-lived connections (gRPC streaming, WebSockets) don’t cut over — they pin to blue until the client reconnects.
Blue-green’s killer feature is time to rollback. The killer constraint is no partial validation — you go from 0% to 100% on the new version with one button. If green has a bug, every customer sees it. That’s why production teams that can run canary usually do, and reserve blue-green for releases where coupling forces an atomic switch.
Canary Deployment
Why Canary Wins for High-Volume Services
The Problem: You want to test in production — the only environment that has production traffic — but without exposing every customer to a bad release.
The Solution: Send a small fraction of real traffic (1%, 5%, 10%) to the new version, watch the metrics that matter, and either promote or roll back automatically. Most outages either show up in the canary or never show up at all.
A canary deployment is named after the bird in the coal mine. The new version sees real traffic but only enough to die quietly if it’s going to die. Tooling has caught up: Argo Rollouts, Flagger, LaunchDarkly, Linkerd’s SMI traffic split, and managed offerings from AWS CodeDeploy all do canary as a first-class concept. You shouldn’t hand-roll it in 2026.
Argo Rollouts canary spec
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: payments-api
spec:
replicas: 20
strategy:
canary:
canaryService: payments-canary # gets canary traffic only
stableService: payments-stable # gets the rest
trafficRouting:
istio:
virtualService:
name: payments-vs
routes: [primary]
steps:
- setWeight: 5 # 5% canary
- pause: { duration: 10m }
- analysis:
templates: [{ templateName: success-rate-and-latency }]
- setWeight: 25
- pause: { duration: 15m }
- analysis:
templates: [{ templateName: success-rate-and-latency }]
- setWeight: 50
- pause: { duration: 30m }
- setWeight: 100
selector: { matchLabels: { app: payments-api } }
template:
metadata: { labels: { app: payments-api } }
spec:
containers:
- name: app
image: registry.example.com/payments-api:v3.7.0
---
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata: { name: success-rate-and-latency }
spec:
metrics:
- name: success-rate
interval: 1m
successCondition: result[0] >= 0.995
failureLimit: 2 # 2 bad samples = automatic rollback
provider:
prometheus:
address: http://prometheus:9090
query: |
sum(rate(http_requests_total{service="payments-api",code!~"5.."}[2m]))
/
sum(rate(http_requests_total{service="payments-api"}[2m]))
- name: p99-latency
successCondition: result[0] <= 300 # ms
provider:
prometheus:
address: http://prometheus:9090
query: |
histogram_quantile(0.99,
sum by (le) (rate(http_request_duration_ms_bucket{service="payments-api"}[2m])))
Flagger Canary CR (alternative)
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
name: orders
namespace: production
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: orders
service:
port: 80
targetPort: 8080
gateways: [public-gateway]
hosts: [orders.example.com]
analysis:
interval: 1m
threshold: 5 # consecutive failed checks before rollback
maxWeight: 50 # cap canary at 50% before promoting
stepWeight: 5 # +5% per successful interval
metrics:
- name: request-success-rate
thresholdRange: { min: 99 } # >= 99% non-5xx
interval: 1m
- name: request-duration
thresholdRange: { max: 500 } # P99 ms
interval: 1m
webhooks:
- name: load-test
url: http://flagger-loadtester.test/
timeout: 5s
metadata:
cmd: "hey -z 1m -q 10 -c 2 http://orders-canary/"
Istio VirtualService for traffic splitting
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: payments-vs
spec:
hosts: [payments]
http:
- name: primary
route:
- destination: { host: payments, subset: stable }
weight: 95
- destination: { host: payments, subset: canary }
weight: 5 # Argo Rollouts updates this on each step
---
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
name: payments-dr
spec:
host: payments
subsets:
- name: stable
labels: { app: payments-api, role: stable }
- name: canary
labels: { app: payments-api, role: canary }
Your canary metrics must be the metrics customers actually care about
It is tempting to use whatever Prometheus already exposes — CPU, memory, GC time. Don’t. A canary that promotes because CPU is normal will happily ship a release that returns wrong answers at 200 ms. The right metrics are HTTP success rate, P99 latency at the customer-visible endpoint, and business KPIs (orders/minute, login success rate, search-to-click). The metric that matters is the customer-visible one. Everything else is a proxy that will eventually lie to you.
Shadow / Mirror Traffic
Why Shadow Traffic Exists
The Problem: Some changes are too risky to expose to even 1% of users — a rewritten ranker, a new pricing engine, a swapped database driver. You need production traffic to validate, but you cannot risk production responses.
The Solution: Mirror a copy of every request to the new version. Compare its responses to the old version’s. The user only sees the old version’s answer. The new version sees real load and real data with zero customer impact.
Shadow traffic is the highest-confidence, highest-cost strategy. You’re running a second copy of the service that does the work and throws the answer away — or, more usefully, persists it for offline diff against the production answer. Istio supports this natively:
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata: { name: search-vs }
spec:
hosts: [search]
http:
- route:
- destination: { host: search, subset: stable }
weight: 100
mirror:
host: search
subset: candidate # new version sees a copy of every request
mirrorPercentage:
value: 100.0 # mirror everything — or set to 10.0 for cost
What shadowing is and is not safe for
- Safe: read-only paths (search, ranking, recommendations), idempotent reads, latency benchmarking, regression hunting via response diff.
- Not safe: any path that mutates state. A mirrored
POST /orderscreates a duplicate order. A mirroredPOST /chargedouble-charges customers. Mirror only after you’ve confirmed the path is read-only or has been wired through an idempotency key — and even then, scrub the side-effecting downstream calls (e.g., point the shadow at a separate Kafka topic, a sandbox payment processor, a /dev/null email sink). - Watch the bill: mirroring doubles compute, doubles downstream load, doubles database read traffic. For high-QPS services this is non-trivial.
Shadow traffic shines when paired with response diffing. Twitter’s Diffy and LinkedIn’s internal tooling compare old vs. new responses field-by-field, flagging cases where the candidate disagrees with stable. You catch correctness regressions you would never notice from success-rate metrics alone.
Feature Flags vs. Deployment Strategies
Two Different Axes, Both Required
The Problem: Teams conflate “deploy” with “release.” They are different events. A canary controls which servers run the new code. A flag controls which users see the new behavior. Coupling them means every release is also a deploy and every deploy is also a release — which is exactly the rigidity you’re trying to escape.
The Solution: Deploy continuously with rolling/canary; release independently with feature flags. The new code can sit in production for weeks before any user sees it, and you can dark-launch behavior to internal users, beta cohorts, or a single customer for diagnosis — all without a deploy.
The two axes:
| Deployment strategy | Feature flag | |
|---|---|---|
| What it controls | Which binary runs where | Which behavior is enabled for whom |
| Granularity | Per pod, per region, per cell | Per user, per cohort, per request attribute |
| Rollback speed | Seconds to minutes | Milliseconds (config push) |
| Tools | Argo Rollouts, Flagger, Spinnaker, AWS CodeDeploy | LaunchDarkly, Unleash, OpenFeature, Statsig |
| Best for | Infra changes, rewrites, dependency upgrades | UX changes, business rules, A/B tests, kill switches |
Together they enable dark launches — ship the code, leave the flag off, then enable it for 1% of users next Tuesday with no deploy. And they enable ring-based rollouts, the model Microsoft has used for Windows and Office for over a decade:
The Microsoft ring model
- Ring 0 — Canary: the team that wrote it. Hours to days.
- Ring 1 — Internal: the rest of the company. Days to a week.
- Ring 2 — Insider/Beta: opted-in external users. One to two weeks.
- Ring 3 — Production wave 1: small geographies or low-risk cohorts. Days.
- Ring 4 — General availability: the entire user base.
Each ring is a feature-flag cohort, not a separate deploy. The same binary is in every ring; the flag decides who gets the new behavior. A bug found in Ring 1 never reaches Ring 4.
Flags are debt
Every feature flag is an if statement that someone will forget to delete. Long-lived flags become invisible coupling between modules and silent landmines (“why does this break only when flag X is on and flag Y is off?”). Treat flags like temporary scaffolding: open a deletion ticket the day you create the flag, and clean up within 90 days of the rollout completing. LaunchDarkly even has “technical debt” reporting for stale flags — use it.
Progressive Delivery and SLO-Driven Promotion
Why Humans Should Not Press the Promote Button
The Problem: A human watching a Grafana dashboard at 3 AM during a canary roll is going to miss the 2-minute spike that mattered, or panic at the 30-second blip that didn’t. The decision “promote or roll back” is too important to be made by tired humans pattern-matching on graphs.
The Solution: Progressive delivery means the SLO is the gate. The controller queries Prometheus, compares to a baseline, and either advances to the next weight or aborts. The human is in the loop only when the math is ambiguous.
Progressive delivery is what canary becomes when you take it seriously. The promotion logic is code. The metrics are SLOs. The rollback is automated. The human is a reviewer of choices made by the controller, not the operator pressing buttons.
What makes a good promotion metric:
- Customer-visible. Success rate at the edge, not at the database. Latency at the user, not at the load balancer.
- Statistically valid at canary scale. A 1% canary on a service doing 100 rps gives you 1 rps of canary traffic — not enough signal to detect a P99 regression in 5 minutes. Either pump up the canary weight, lengthen the wait, or pick metrics with bigger denominators.
- Compared to the baseline, not an absolute threshold. “P99 must be under 300 ms” is brittle — legitimate traffic shifts move it. “P99 must be no more than 10% worse than the stable subset” survives reality.
- Symmetric. If the canary is better than baseline by 50%, that’s also suspicious — it usually means the canary isn’t actually getting the requests you think it is.
# A good Argo Rollouts AnalysisRun compares canary to stable, not to a constant.
metrics:
- name: success-rate-vs-baseline
successCondition: |
result[0] >= 0.99 * baseline[0]
failureCondition: |
result[0] < 0.95 * baseline[0]
provider:
prometheus:
address: http://prometheus:9090
query: |
sum(rate(http_requests_total{service="payments",role="canary",code!~"5.."}[2m]))
/
sum(rate(http_requests_total{service="payments",role="canary"}[2m]))
baseline:
provider:
prometheus:
query: |
sum(rate(http_requests_total{service="payments",role="stable",code!~"5.."}[2m]))
/
sum(rate(http_requests_total{service="payments",role="stable"}[2m]))
Automated rollback triggers worth wiring up
- Error-rate ratio — canary 5xx rate > 1.5x stable 5xx rate.
- Latency ratio — canary P99 > 1.2x stable P99 over a 5-minute window.
- Saturation — CPU or memory on canary pods sustained > 90% (real bug or cardinality leak).
- Crashloop — any pod restart count > 0 in canary set.
- Business metric guardrail — orders/minute, signups/minute, payment-success-rate cannot drop below their seasonal baseline.
Wire each one to halt the rollout (not just alert). The whole point is to remove humans from the speed-of-decision path.
Real-World Examples
Netflix — Spinnaker pipelines. Netflix open-sourced Spinnaker after building it for their own needs. A typical Spinnaker pipeline bakes an AMI, runs it through judgment-day testing, deploys to a canary cluster alongside a baseline cluster (same code, same traffic split, no new version — the baseline exists to detect noise), runs Kayenta canary analysis comparing canary vs. baseline metrics, and either promotes or rolls back automatically. The deploy itself is 50+ stages long, and humans rarely touch it once it starts.
Facebook (Meta) — mobile rings. Facebook’s mobile apps roll through alpha (employees), beta (opted-in users via Play Store/TestFlight), and production (general availability). Each ring is days to weeks long. A regression caught in alpha never reaches a billion users. The same binary moves through all three; what changes is the cohort the build is published to.
Google — canary analysis service. Google’s SRE book describes a canary analysis service that statistically compares canary and control populations across hundreds of metrics, accounts for natural variance, and emits a single “promote / hold / rollback” verdict. The output is so trusted that it gates deploys with no human review by default. Kayenta (open-sourced via Spinnaker) is the publicly available analog.
Amazon — deployment automation philosophy. Amazon’s public talks describe a culture where the deployment system is owned harder than the application. Every change goes through a pipeline (CodePipeline / CodeDeploy) that does region-by-region, AZ-by-AZ, cell-by-cell rollout, with bake times between stages and automated rollback on metric regression. The blast radius of any single deploy is bounded to one cell of one AZ in one region. A bad deploy can hurt <0.1% of customers before the system catches it.
Helm + Argo Rollouts — the open-source default. For most teams not at FAANG scale, the working stack is Helm to template the manifests, Argo CD to apply them, and Argo Rollouts (or Flagger) to drive the canary. Plus Prometheus for metrics and a service mesh (Istio or Linkerd) for traffic splitting. Every piece is open source. None of it requires a custom team of platform engineers.
Best Practices
The short list
- Pick the cheapest strategy that bounds your blast radius. Rolling for low-risk services, canary for everything customer-facing, blue-green when you must cut over atomically, shadow only when you cannot afford even 1% exposure.
- Decouple deploy from release. Ship code with feature flags off. Releasing later is a config change, not a deploy.
- Make rollback faster than rollout. The asymmetry matters. If forward takes 10 minutes and rollback takes 30, you are gambling. Blue-green and canary both invert this.
- Database migrations are backwards-compatible or they’re not done. Two versions of the app must run against the same schema during the rollout. Expand-then-contract: add the new column, deploy code that writes both, deploy code that reads new, drop the old column — four releases, never one.
- Canary metrics are customer-visible metrics. Success rate at the edge, P99 at the user. Resource metrics are signals, not gates.
- Compare canary to a concurrent baseline, not to a fixed threshold. Traffic patterns drift; the relative number is honest in a way the absolute number is not.
- Bake time is non-negotiable. Five minutes catches crash-loops. Thirty minutes catches memory leaks. A few hours catches the bug that only fires when the cron job runs. Don’t skip bake time to ship faster.
- Automate the rollback. The on-call human is for diagnosing the bad release, not for being the trigger that pulls it.
- Practice rolling back. Rollback is a code path. Untested code paths break. Quarterly game days that explicitly roll back a clean release keep the muscle alive.
- Delete feature flags within 90 days of full rollout. Long-lived flags are the slow-burning technical debt every codebase regrets.
The single most useful sentence about deployments
The deploy is not the moment your new code starts working. The deploy is the moment your old code stops being your safety net. Pick the strategy that lets you put the safety net back the fastest.