Microservices Cheatsheet

Dense reference card for microservice decisions, tradeoffs, and tools.
For when you need an answer fast, not a tutorial.

Reference11 sections

Service Decomposition Heuristics

Map a real-world signal to a decomposition decision. If two signals point opposite ways, the louder one usually wins.

SignalObservationDecision
Team sizeOne team owns it end-to-end (5–9 people)One service per team; split when team splits
Change frequencyTwo areas change at very different ratesSplit — the slow part shouldn’t block the fast part
Scaling needsOne workload is 10x heavier than the restSplit — scale the hot path independently
Data ownershipOne bounded context owns the entityThat context owns the service and its DB
Failure isolationOutage in area X must not affect area YSplit — separate process, separate deploy
Tech stackWorkload demands a different language/runtimeSplit, but only if the cost is justified
CompliancePCI / HIPAA / GDPR scope is localSplit — shrink the audit blast radius
Transactional couplingTwo entities require strong ACID across themKeep together — don’t fight the database
Latency budgetCaller needs <5 ms and you cross the networkKeep together or co-locate
ReuseMany services need the same logicLibrary first; service only if it has its own data

Communication Patterns

Pick the pattern by the answer the caller needs and how soon it needs it.

PatternUse CaseTradeoff
Sync REST / JSONExternal API, simple CRUD, browser clientsEasy to debug; chatty; tight coupling
Sync gRPCInternal service-to-service, low latency, polyglotFaster + typed; harder to debug; needs HTTP/2
GraphQL (BFF)Multiple clients with different field needsFlexible queries; N+1, caching, and auth get harder
Async event (pub/sub)Broadcast facts: order placed, user signed upLoose coupling; eventual consistency; harder to trace
Async queue (work)Hand off a task to be processed laterSmooths spikes; needs idempotency + DLQ
Request/reply over queueAsync with a correlation ID for the responseDecouples uptime; latency is unpredictable
Streaming (Kafka, Kinesis)Ordered, replayable log of eventsPowerful; ops-heavy; consumer state is your problem
WebSocket / SSEServer pushes to browser/clientReal-time; stateful connections complicate scaling

Quick rule

Resilience Patterns Cheatsheet

Compose them in order: bulkhead → circuit breaker → timeout → retry → fallback. See Circuit Breaker & Resilience for full implementations.

PatternWhat It DoesTypical ConfigMore
TimeoutCaps how long a single call can runP99 of dependency × 1.5 (e.g. 500 ms–3 s)composition
Retry + backoffRe-attempts transient failures with jitter3–5 attempts, base 1 s, full jitterretries
Circuit breakerFails fast when a dependency is sick50% failure rate over last 20 calls, 30 s cooldownbreakers
BulkheadCaps concurrent calls per dependency5–20 in-flight + small queue, then shedbulkheads
FallbackReturns degraded response on failureCached value, default, or 503-fastcomposition
Rate limitCaps inbound calls per client/routeToken bucket; per-key quotasAPI gateway
Load sheddingRejects new work when overloadedDrop low-priority requests firstresilience
Idempotency keyLets clients safely retry writesServer dedupes by key for 24 hretries
Chaos testVerifies the patterns actually workLatency + error injection in staging, then prodchaos

Data Patterns

The hard problem in microservices is data. Pick the lightest pattern that solves your case.

PatternUse CaseCost
Database-per-serviceDefault. Each service owns its schema and storage.No cross-service joins; needs API/event for shared data
Shared database (avoid)Legacy or short-term migration onlyTight coupling; one schema change breaks N services
Saga (choreography)Multi-service workflow via eventsHard to visualize; emergent flow
Saga (orchestration)Multi-service workflow with a coordinatorSingle point to debug; coordinator is critical
OutboxAtomic DB write + reliable event publishExtra table + relay; near-exactly-once delivery
CQRSRead load >> write load, or different shapesTwo models to maintain; eventual consistency
Event sourcingAudit, time-travel, replayable historySchema evolution is hard; rebuilding projections is expensive
Materialized viewRead-only join of data from many servicesStale by design; needs invalidation strategy
Change Data CaptureStream DB changes to other systemsCouples consumers to internal schema
2-phase commit (avoid)Strong cross-service ACIDBlocking; brittle; almost always the wrong answer

API Versioning Quick Reference

Pick one strategy and apply it everywhere — mixing versioning styles within an org is the real failure mode.

StrategyExampleProsConsUsed By
URL path/v2/ordersVisible, cache-friendly, easy to routeURL changes per version; clutterTwilio, GitHub
Query param/orders?v=2Trivial to addEasy to forget; bad for cachingRare in production
HeaderAPI-Version: 2Clean URLs; client opts inInvisible; tooling support unevenAzure, some AWS APIs
Media typeAccept: application/vnd.x.v2+jsonRESTful; versions per representationVerbose; clients forget the headerGitHub (historic)
Date-based headerStripe-Version: 2024-06-20Explicit pinning; gradual migrationNeed a version registryStripe
Schema evolution (no version)Add fields, never remove; tolerant readersNo breakage; cheapest to operateSchema bloat; hard to deprecategRPC + protobuf shops, Kafka

Rules of thumb

Deployment Strategies Quick Reference

StrategyHow It WorksRollback SpeedInfra CostBlast Radius
RecreateStop old, start newSlow (redeploy old)1x100% — downtime
RollingReplace pods in batchesMinutes (rolling back)~1x (small surge)Gradual; mixed versions briefly
Blue-greenDeploy new color, switch trafficSeconds (flip router)2x during cutover0% — until full cutover
CanarySmall slice of traffic to new versionSeconds (drop canary)~1.05x1–5% of users
Progressive deliveryCanary + auto-promotion on SLOSeconds (auto-rollback)~1.05x + toolingBounded by SLO gates
Shadow / mirrorDuplicate prod traffic to new version, ignore responseInstant (no users)2x (briefly)0% — users untouched
Feature flagNew code shipped dark, toggled per cohortInstant (flip flag)1xWhatever the flag scope is
A/B testTwo variants split for measurementSeconds (kill variant)~1.05xDefined experiment cohort

Observability Pillars

Three signals; each answers a different question. You need all three.

LogsMetricsTraces
QuestionWhat happened in this request?How is the system behaving over time?Where did this request spend its time?
CardinalityHigh — raw event streamLow — aggregated counters/gaugesHigh — one tree per request (sampled)
FormatStructured JSON, one line per eventOpenMetrics / PrometheusOpenTelemetry / W3C tracecontext
ToolsLoki, ELK, Splunk, Datadog LogsPrometheus, Datadog, CloudWatch, GrafanaJaeger, Tempo, Honeycomb, Lightstep
Alert onError patterns, security eventsSLO burn, saturation, error rateLatency spikes in a specific span
Cost driverVolume (GB/day)Cardinality (label combos)Sample rate × span count
Retention7–30 days hot13–15 months3–7 days hot, longer for slow traces

Golden signals to alert on (Google SRE)

Security Quick Reference

ConcernStandardTool / Implementation
End-user authOIDC, OAuth 2.1Auth0, Okta, Cognito, Keycloak
Service-to-service authmTLS, SPIFFE / SPIRE, signed JWTIstio, Linkerd, cert-manager
AuthorizationRBAC, ABAC, ReBACOPA / Rego, Cedar, Zanzibar / SpiceDB
API keys / tokensShort-lived JWTs, rotated keysGateway-issued tokens, KMS-signed
Secrets at restSealed secrets, envelope encryptionVault, AWS Secrets Manager, SOPS, Sealed Secrets
Transport encryptionTLS 1.3 everywherecert-manager + Let’s Encrypt; mesh-managed certs
Network segmentationZero trust, default-denyNetworkPolicy, Cilium, mesh authz
Supply chainSLSA, SBOM, signed artifactsSigstore (cosign), Syft, Grype, Trivy
Image scanningCVE scan in CITrivy, Grype, Snyk, Wiz
Runtime threat detectioneBPF-based runtime securityFalco, Tetragon, Aqua
Audit logImmutable, append-onlySIEM (Splunk, Datadog, Sumo)
WAF / bot defenseL7 inspection at edgeCloudflare, AWS WAF, Akamai

Three rules

Common Anti-Patterns

If you recognize one of these in your stack, fix it before adding more services.

Tooling Cheatsheet

Pick one per category and standardize. Two tools competing for the same job is the most expensive choice.

CategoryToolOne-Liner
Service meshIstioFeature-rich mesh; complex; Envoy data plane
LinkerdLightweight Rust mesh; simpler ops
Cilium Service MesheBPF-based; fast; merges network + mesh
Message brokerApache KafkaDurable log; replay; high throughput
RabbitMQClassic AMQP queue; rich routing
NATS / JetStreamLightweight pub/sub with optional durability
AWS SNS + SQSManaged pub/sub + queue combo
ObservabilityPrometheus + GrafanaMetrics + dashboards; OSS standard
OpenTelemetryVendor-neutral instrumentation SDK
Jaeger / TempoDistributed tracing backends
Loki / ELKLog aggregation
IaCTerraform / OpenTofuDeclarative cloud infra; HCL
PulumiCloud infra in your real programming language
CrossplaneManage cloud resources via Kubernetes APIs
CI/CDGitHub ActionsBuilt-in CI for GitHub repos
GitLab CIPipelines tied to GitLab
Argo CDGitOps for Kubernetes deploys
FluxLightweight GitOps controller
SecretsHashiCorp VaultCentralized secrets, dynamic credentials
AWS / GCP / Azure Secrets ManagerCloud-native secret store
External Secrets OperatorSync external stores into Kubernetes
API gatewayKong / Envoy / TraefikL7 ingress, auth, rate limit
AWS API GatewayManaged gateway; integrates with Lambda
Apigee / MuleSoftEnterprise API management
Schema / contractsProtobuf + BufSchema-first; lint and breaking-change check
OpenAPI + SpectralHTTP API spec + linting
ChaosLitmusChaos / Chaos MeshKubernetes-native chaos experiments
GremlinSaaS chaos engineering platform

When NOT to Use Microservices

Microservices are an answer to a problem of scale — team scale, traffic scale, change scale. Without that problem, they are pure cost.

Start here instead