Microservices Cheatsheet | LIZIU Microservices

Service Decomposition Heuristics

Map a real-world signal to a decomposition decision. If two signals point opposite ways, the louder one usually wins.

Signal	Observation	Decision
Team size	One team owns it end-to-end (5–9 people)	One service per team; split when team splits
Change frequency	Two areas change at very different rates	Split — the slow part shouldn’t block the fast part
Scaling needs	One workload is 10x heavier than the rest	Split — scale the hot path independently
Data ownership	One bounded context owns the entity	That context owns the service and its DB
Failure isolation	Outage in area X must not affect area Y	Split — separate process, separate deploy
Tech stack	Workload demands a different language/runtime	Split, but only if the cost is justified
Compliance	PCI / HIPAA / GDPR scope is local	Split — shrink the audit blast radius
Transactional coupling	Two entities require strong ACID across them	Keep together — don’t fight the database
Latency budget	Caller needs <5 ms and you cross the network	Keep together or co-locate
Reuse	Many services need the same logic	Library first; service only if it has its own data

Communication Patterns

Pick the pattern by the answer the caller needs and how soon it needs it.

Pattern	Use Case	Tradeoff
Sync REST / JSON	External API, simple CRUD, browser clients	Easy to debug; chatty; tight coupling
Sync gRPC	Internal service-to-service, low latency, polyglot	Faster + typed; harder to debug; needs HTTP/2
GraphQL (BFF)	Multiple clients with different field needs	Flexible queries; N+1, caching, and auth get harder
Async event (pub/sub)	Broadcast facts: order placed, user signed up	Loose coupling; eventual consistency; harder to trace
Async queue (work)	Hand off a task to be processed later	Smooths spikes; needs idempotency + DLQ
Request/reply over queue	Async with a correlation ID for the response	Decouples uptime; latency is unpredictable
Streaming (Kafka, Kinesis)	Ordered, replayable log of events	Powerful; ops-heavy; consumer state is your problem
WebSocket / SSE	Server pushes to browser/client	Real-time; stateful connections complicate scaling

Quick rule

If the caller needs an answer to continue → sync.
If the caller is announcing a fact → async event.
If the caller is requesting work → async queue with idempotency key.

Resilience Patterns Cheatsheet

Compose them in order: bulkhead → circuit breaker → timeout → retry → fallback. See Circuit Breaker & Resilience for full implementations.

Pattern	What It Does	Typical Config	More
Timeout	Caps how long a single call can run	P99 of dependency × 1.5 (e.g. 500 ms–3 s)	composition
Retry + backoff	Re-attempts transient failures with jitter	3–5 attempts, base 1 s, full jitter	retries
Circuit breaker	Fails fast when a dependency is sick	50% failure rate over last 20 calls, 30 s cooldown	breakers
Bulkhead	Caps concurrent calls per dependency	5–20 in-flight + small queue, then shed	bulkheads
Fallback	Returns degraded response on failure	Cached value, default, or 503-fast	composition
Rate limit	Caps inbound calls per client/route	Token bucket; per-key quotas	API gateway
Load shedding	Rejects new work when overloaded	Drop low-priority requests first	resilience
Idempotency key	Lets clients safely retry writes	Server dedupes by key for 24 h	retries
Chaos test	Verifies the patterns actually work	Latency + error injection in staging, then prod	chaos

Data Patterns

The hard problem in microservices is data. Pick the lightest pattern that solves your case.

Pattern	Use Case	Cost
Database-per-service	Default. Each service owns its schema and storage.	No cross-service joins; needs API/event for shared data
Shared database (avoid)	Legacy or short-term migration only	Tight coupling; one schema change breaks N services
Saga (choreography)	Multi-service workflow via events	Hard to visualize; emergent flow
Saga (orchestration)	Multi-service workflow with a coordinator	Single point to debug; coordinator is critical
Outbox	Atomic DB write + reliable event publish	Extra table + relay; near-exactly-once delivery
CQRS	Read load >> write load, or different shapes	Two models to maintain; eventual consistency
Event sourcing	Audit, time-travel, replayable history	Schema evolution is hard; rebuilding projections is expensive
Materialized view	Read-only join of data from many services	Stale by design; needs invalidation strategy
Change Data Capture	Stream DB changes to other systems	Couples consumers to internal schema
2-phase commit (avoid)	Strong cross-service ACID	Blocking; brittle; almost always the wrong answer

API Versioning Quick Reference

Pick one strategy and apply it everywhere — mixing versioning styles within an org is the real failure mode.

Strategy	Example	Pros	Cons	Used By
URL path	`/v2/orders`	Visible, cache-friendly, easy to route	URL changes per version; clutter	Twilio, GitHub
Query param	`/orders?v=2`	Trivial to add	Easy to forget; bad for caching	Rare in production
Header	`API-Version: 2`	Clean URLs; client opts in	Invisible; tooling support uneven	Azure, some AWS APIs
Media type	`Accept: application/vnd.x.v2+json`	RESTful; versions per representation	Verbose; clients forget the header	GitHub (historic)
Date-based header	`Stripe-Version: 2024-06-20`	Explicit pinning; gradual migration	Need a version registry	Stripe
Schema evolution (no version)	Add fields, never remove; tolerant readers	No breakage; cheapest to operate	Schema bloat; hard to deprecate	gRPC + protobuf shops, Kafka

Rules of thumb

Additive changes (new optional field) → no version bump.
Removing or renaming a field → new version + deprecation window (6–12 months).
Document Sunset/Deprecation headers; honor them in clients.

Deployment Strategies Quick Reference

Strategy	How It Works	Rollback Speed	Infra Cost	Blast Radius
Recreate	Stop old, start new	Slow (redeploy old)	1x	100% — downtime
Rolling	Replace pods in batches	Minutes (rolling back)	~1x (small surge)	Gradual; mixed versions briefly
Blue-green	Deploy new color, switch traffic	Seconds (flip router)	2x during cutover	0% — until full cutover
Canary	Small slice of traffic to new version	Seconds (drop canary)	~1.05x	1–5% of users
Progressive delivery	Canary + auto-promotion on SLO	Seconds (auto-rollback)	~1.05x + tooling	Bounded by SLO gates
Shadow / mirror	Duplicate prod traffic to new version, ignore response	Instant (no users)	2x (briefly)	0% — users untouched
Feature flag	New code shipped dark, toggled per cohort	Instant (flip flag)	1x	Whatever the flag scope is
A/B test	Two variants split for measurement	Seconds (kill variant)	~1.05x	Defined experiment cohort

Observability Pillars

Three signals; each answers a different question. You need all three.

	Logs	Metrics	Traces
Question	What happened in this request?	How is the system behaving over time?	Where did this request spend its time?
Cardinality	High — raw event stream	Low — aggregated counters/gauges	High — one tree per request (sampled)
Format	Structured JSON, one line per event	OpenMetrics / Prometheus	OpenTelemetry / W3C tracecontext
Tools	Loki, ELK, Splunk, Datadog Logs	Prometheus, Datadog, CloudWatch, Grafana	Jaeger, Tempo, Honeycomb, Lightstep
Alert on	Error patterns, security events	SLO burn, saturation, error rate	Latency spikes in a specific span
Cost driver	Volume (GB/day)	Cardinality (label combos)	Sample rate × span count
Retention	7–30 days hot	13–15 months	3–7 days hot, longer for slow traces

Golden signals to alert on (Google SRE)

Latency — P50, P95, P99 of successful requests.
Traffic — requests per second by route.
Errors — rate of 5xx and explicit failures.
Saturation — how close to capacity (CPU, memory, queue depth, pool usage).

Security Quick Reference

Concern	Standard	Tool / Implementation
End-user auth	OIDC, OAuth 2.1	Auth0, Okta, Cognito, Keycloak
Service-to-service auth	mTLS, SPIFFE / SPIRE, signed JWT	Istio, Linkerd, cert-manager
Authorization	RBAC, ABAC, ReBAC	OPA / Rego, Cedar, Zanzibar / SpiceDB
API keys / tokens	Short-lived JWTs, rotated keys	Gateway-issued tokens, KMS-signed
Secrets at rest	Sealed secrets, envelope encryption	Vault, AWS Secrets Manager, SOPS, Sealed Secrets
Transport encryption	TLS 1.3 everywhere	cert-manager + Let’s Encrypt; mesh-managed certs
Network segmentation	Zero trust, default-deny	NetworkPolicy, Cilium, mesh authz
Supply chain	SLSA, SBOM, signed artifacts	Sigstore (cosign), Syft, Grype, Trivy
Image scanning	CVE scan in CI	Trivy, Grype, Snyk, Wiz
Runtime threat detection	eBPF-based runtime security	Falco, Tetragon, Aqua
Audit log	Immutable, append-only	SIEM (Splunk, Datadog, Sumo)
WAF / bot defense	L7 inspection at edge	Cloudflare, AWS WAF, Akamai

Three rules

Rotate everything: keys, certs, tokens, passwords. Automate it.
Default deny at the network and the policy layer. Allow narrowly.
The build pipeline is part of the attack surface — sign artifacts and verify on deploy.

Common Anti-Patterns

If you recognize one of these in your stack, fix it before adding more services.

Distributed monolith — many services that must deploy together. You got the network calls without the independence.
Shared database — multiple services hitting the same schema. Schema changes cascade; ownership is undefined.
Chatty services — one user action triggers 30 internal hops. Latency dies of a thousand cuts.
Synchronous chains — A calls B calls C calls D, all blocking. Availability is the product of all four.
No idempotency — retries cause double-charges, double-shipments, double-emails.
Hand-rolled service discovery — YAML lists of hosts. Use DNS, the platform, or a mesh.
Hand-rolled retries — ad-hoc loops with no jitter, no budget, no breaker. Causes retry storms.
God service / data hub — one service that knows everything; everyone depends on it.
Nano-services — a service per function. More glue than logic; ops cost dwarfs feature work.
Versionless events — producers change payloads silently; consumers break.
Cross-service transactions — trying to make a saga look like one ACID transaction. Embrace the asynchrony.
Logs without correlation IDs — you cannot debug a distributed request from one service’s logs alone.
Snowflake environments — staging differs from prod “just a little.” That little is where outages live.
Manual deploys — humans clicking through pipelines. Automate or accept the outages.
One repo, no boundaries — monorepo with no module ownership; every change touches every service.

Tooling Cheatsheet

Pick one per category and standardize. Two tools competing for the same job is the most expensive choice.

Category	Tool	One-Liner
Service mesh	Istio	Feature-rich mesh; complex; Envoy data plane
	Linkerd	Lightweight Rust mesh; simpler ops
	Cilium Service Mesh	eBPF-based; fast; merges network + mesh
Message broker	Apache Kafka	Durable log; replay; high throughput
	RabbitMQ	Classic AMQP queue; rich routing
	NATS / JetStream	Lightweight pub/sub with optional durability
	AWS SNS + SQS	Managed pub/sub + queue combo
Observability	Prometheus + Grafana	Metrics + dashboards; OSS standard
	OpenTelemetry	Vendor-neutral instrumentation SDK
	Jaeger / Tempo	Distributed tracing backends
	Loki / ELK	Log aggregation
IaC	Terraform / OpenTofu	Declarative cloud infra; HCL
	Pulumi	Cloud infra in your real programming language
	Crossplane	Manage cloud resources via Kubernetes APIs
CI/CD	GitHub Actions	Built-in CI for GitHub repos
	GitLab CI	Pipelines tied to GitLab
	Argo CD	GitOps for Kubernetes deploys
	Flux	Lightweight GitOps controller
Secrets	HashiCorp Vault	Centralized secrets, dynamic credentials
	AWS / GCP / Azure Secrets Manager	Cloud-native secret store
	External Secrets Operator	Sync external stores into Kubernetes
API gateway	Kong / Envoy / Traefik	L7 ingress, auth, rate limit
	AWS API Gateway	Managed gateway; integrates with Lambda
	Apigee / MuleSoft	Enterprise API management
Schema / contracts	Protobuf + Buf	Schema-first; lint and breaking-change check
Schema / contracts	OpenAPI + Spectral	HTTP API spec + linting
Chaos	LitmusChaos / Chaos Mesh	Kubernetes-native chaos experiments
Chaos	Gremlin	SaaS chaos engineering platform

When NOT to Use Microservices

Microservices are an answer to a problem of scale — team scale, traffic scale, change scale. Without that problem, they are pure cost.

Small team (<10 engineers). You don’t have enough people to staff the platform work. Stay monolithic.
Greenfield product, unclear domain. You don’t know your bounded contexts yet. Splitting too early calcifies the wrong seams.
No production observability. Microservices without tracing and centralized logs is debugging blindfolded.
No CI/CD. If deploys are manual, multiplying services multiplies pain.
Strong cross-entity transactional needs. If everything must be ACID across N entities, the database wants to keep them together.
Latency-critical inner loop. Sub-5 ms paths can’t afford a network hop per call.
Predictable, low-traffic workload. One box runs your monolith forever and that’s fine.
No platform team / no plan to build one. Without paved roads, every service team reinvents the wheel.
Org cannot tolerate eventual consistency. Some businesses really do need “the answer is correct right now” everywhere.
You’re trying to fix a code-quality problem. A bad monolith becomes a bad distributed monolith. Refactor first.

Start here instead

Modular monolith with clear internal module boundaries.
One database, multiple schemas owned by different modules.
Extract the first service only when a real signal (team split, scaling, compliance) forces it.