Service Decomposition Heuristics
Map a real-world signal to a decomposition decision. If two signals point opposite ways, the louder one usually wins.
| Signal | Observation | Decision |
|---|---|---|
| Team size | One team owns it end-to-end (5–9 people) | One service per team; split when team splits |
| Change frequency | Two areas change at very different rates | Split — the slow part shouldn’t block the fast part |
| Scaling needs | One workload is 10x heavier than the rest | Split — scale the hot path independently |
| Data ownership | One bounded context owns the entity | That context owns the service and its DB |
| Failure isolation | Outage in area X must not affect area Y | Split — separate process, separate deploy |
| Tech stack | Workload demands a different language/runtime | Split, but only if the cost is justified |
| Compliance | PCI / HIPAA / GDPR scope is local | Split — shrink the audit blast radius |
| Transactional coupling | Two entities require strong ACID across them | Keep together — don’t fight the database |
| Latency budget | Caller needs <5 ms and you cross the network | Keep together or co-locate |
| Reuse | Many services need the same logic | Library first; service only if it has its own data |
Communication Patterns
Pick the pattern by the answer the caller needs and how soon it needs it.
| Pattern | Use Case | Tradeoff |
|---|---|---|
| Sync REST / JSON | External API, simple CRUD, browser clients | Easy to debug; chatty; tight coupling |
| Sync gRPC | Internal service-to-service, low latency, polyglot | Faster + typed; harder to debug; needs HTTP/2 |
| GraphQL (BFF) | Multiple clients with different field needs | Flexible queries; N+1, caching, and auth get harder |
| Async event (pub/sub) | Broadcast facts: order placed, user signed up | Loose coupling; eventual consistency; harder to trace |
| Async queue (work) | Hand off a task to be processed later | Smooths spikes; needs idempotency + DLQ |
| Request/reply over queue | Async with a correlation ID for the response | Decouples uptime; latency is unpredictable |
| Streaming (Kafka, Kinesis) | Ordered, replayable log of events | Powerful; ops-heavy; consumer state is your problem |
| WebSocket / SSE | Server pushes to browser/client | Real-time; stateful connections complicate scaling |
Quick rule
- If the caller needs an answer to continue → sync.
- If the caller is announcing a fact → async event.
- If the caller is requesting work → async queue with idempotency key.
Resilience Patterns Cheatsheet
Compose them in order: bulkhead → circuit breaker → timeout → retry → fallback. See Circuit Breaker & Resilience for full implementations.
| Pattern | What It Does | Typical Config | More |
|---|---|---|---|
| Timeout | Caps how long a single call can run | P99 of dependency × 1.5 (e.g. 500 ms–3 s) | composition |
| Retry + backoff | Re-attempts transient failures with jitter | 3–5 attempts, base 1 s, full jitter | retries |
| Circuit breaker | Fails fast when a dependency is sick | 50% failure rate over last 20 calls, 30 s cooldown | breakers |
| Bulkhead | Caps concurrent calls per dependency | 5–20 in-flight + small queue, then shed | bulkheads |
| Fallback | Returns degraded response on failure | Cached value, default, or 503-fast | composition |
| Rate limit | Caps inbound calls per client/route | Token bucket; per-key quotas | API gateway |
| Load shedding | Rejects new work when overloaded | Drop low-priority requests first | resilience |
| Idempotency key | Lets clients safely retry writes | Server dedupes by key for 24 h | retries |
| Chaos test | Verifies the patterns actually work | Latency + error injection in staging, then prod | chaos |
Data Patterns
The hard problem in microservices is data. Pick the lightest pattern that solves your case.
| Pattern | Use Case | Cost |
|---|---|---|
| Database-per-service | Default. Each service owns its schema and storage. | No cross-service joins; needs API/event for shared data |
| Shared database (avoid) | Legacy or short-term migration only | Tight coupling; one schema change breaks N services |
| Saga (choreography) | Multi-service workflow via events | Hard to visualize; emergent flow |
| Saga (orchestration) | Multi-service workflow with a coordinator | Single point to debug; coordinator is critical |
| Outbox | Atomic DB write + reliable event publish | Extra table + relay; near-exactly-once delivery |
| CQRS | Read load >> write load, or different shapes | Two models to maintain; eventual consistency |
| Event sourcing | Audit, time-travel, replayable history | Schema evolution is hard; rebuilding projections is expensive |
| Materialized view | Read-only join of data from many services | Stale by design; needs invalidation strategy |
| Change Data Capture | Stream DB changes to other systems | Couples consumers to internal schema |
| 2-phase commit (avoid) | Strong cross-service ACID | Blocking; brittle; almost always the wrong answer |
API Versioning Quick Reference
Pick one strategy and apply it everywhere — mixing versioning styles within an org is the real failure mode.
| Strategy | Example | Pros | Cons | Used By |
|---|---|---|---|---|
| URL path | /v2/orders | Visible, cache-friendly, easy to route | URL changes per version; clutter | Twilio, GitHub |
| Query param | /orders?v=2 | Trivial to add | Easy to forget; bad for caching | Rare in production |
| Header | API-Version: 2 | Clean URLs; client opts in | Invisible; tooling support uneven | Azure, some AWS APIs |
| Media type | Accept: application/vnd.x.v2+json | RESTful; versions per representation | Verbose; clients forget the header | GitHub (historic) |
| Date-based header | Stripe-Version: 2024-06-20 | Explicit pinning; gradual migration | Need a version registry | Stripe |
| Schema evolution (no version) | Add fields, never remove; tolerant readers | No breakage; cheapest to operate | Schema bloat; hard to deprecate | gRPC + protobuf shops, Kafka |
Rules of thumb
- Additive changes (new optional field) → no version bump.
- Removing or renaming a field → new version + deprecation window (6–12 months).
- Document Sunset/Deprecation headers; honor them in clients.
Deployment Strategies Quick Reference
| Strategy | How It Works | Rollback Speed | Infra Cost | Blast Radius |
|---|---|---|---|---|
| Recreate | Stop old, start new | Slow (redeploy old) | 1x | 100% — downtime |
| Rolling | Replace pods in batches | Minutes (rolling back) | ~1x (small surge) | Gradual; mixed versions briefly |
| Blue-green | Deploy new color, switch traffic | Seconds (flip router) | 2x during cutover | 0% — until full cutover |
| Canary | Small slice of traffic to new version | Seconds (drop canary) | ~1.05x | 1–5% of users |
| Progressive delivery | Canary + auto-promotion on SLO | Seconds (auto-rollback) | ~1.05x + tooling | Bounded by SLO gates |
| Shadow / mirror | Duplicate prod traffic to new version, ignore response | Instant (no users) | 2x (briefly) | 0% — users untouched |
| Feature flag | New code shipped dark, toggled per cohort | Instant (flip flag) | 1x | Whatever the flag scope is |
| A/B test | Two variants split for measurement | Seconds (kill variant) | ~1.05x | Defined experiment cohort |
Observability Pillars
Three signals; each answers a different question. You need all three.
| Logs | Metrics | Traces | |
|---|---|---|---|
| Question | What happened in this request? | How is the system behaving over time? | Where did this request spend its time? |
| Cardinality | High — raw event stream | Low — aggregated counters/gauges | High — one tree per request (sampled) |
| Format | Structured JSON, one line per event | OpenMetrics / Prometheus | OpenTelemetry / W3C tracecontext |
| Tools | Loki, ELK, Splunk, Datadog Logs | Prometheus, Datadog, CloudWatch, Grafana | Jaeger, Tempo, Honeycomb, Lightstep |
| Alert on | Error patterns, security events | SLO burn, saturation, error rate | Latency spikes in a specific span |
| Cost driver | Volume (GB/day) | Cardinality (label combos) | Sample rate × span count |
| Retention | 7–30 days hot | 13–15 months | 3–7 days hot, longer for slow traces |
Golden signals to alert on (Google SRE)
- Latency — P50, P95, P99 of successful requests.
- Traffic — requests per second by route.
- Errors — rate of 5xx and explicit failures.
- Saturation — how close to capacity (CPU, memory, queue depth, pool usage).
Security Quick Reference
| Concern | Standard | Tool / Implementation |
|---|---|---|
| End-user auth | OIDC, OAuth 2.1 | Auth0, Okta, Cognito, Keycloak |
| Service-to-service auth | mTLS, SPIFFE / SPIRE, signed JWT | Istio, Linkerd, cert-manager |
| Authorization | RBAC, ABAC, ReBAC | OPA / Rego, Cedar, Zanzibar / SpiceDB |
| API keys / tokens | Short-lived JWTs, rotated keys | Gateway-issued tokens, KMS-signed |
| Secrets at rest | Sealed secrets, envelope encryption | Vault, AWS Secrets Manager, SOPS, Sealed Secrets |
| Transport encryption | TLS 1.3 everywhere | cert-manager + Let’s Encrypt; mesh-managed certs |
| Network segmentation | Zero trust, default-deny | NetworkPolicy, Cilium, mesh authz |
| Supply chain | SLSA, SBOM, signed artifacts | Sigstore (cosign), Syft, Grype, Trivy |
| Image scanning | CVE scan in CI | Trivy, Grype, Snyk, Wiz |
| Runtime threat detection | eBPF-based runtime security | Falco, Tetragon, Aqua |
| Audit log | Immutable, append-only | SIEM (Splunk, Datadog, Sumo) |
| WAF / bot defense | L7 inspection at edge | Cloudflare, AWS WAF, Akamai |
Three rules
- Rotate everything: keys, certs, tokens, passwords. Automate it.
- Default deny at the network and the policy layer. Allow narrowly.
- The build pipeline is part of the attack surface — sign artifacts and verify on deploy.
Common Anti-Patterns
If you recognize one of these in your stack, fix it before adding more services.
- Distributed monolith — many services that must deploy together. You got the network calls without the independence.
- Shared database — multiple services hitting the same schema. Schema changes cascade; ownership is undefined.
- Chatty services — one user action triggers 30 internal hops. Latency dies of a thousand cuts.
- Synchronous chains — A calls B calls C calls D, all blocking. Availability is the product of all four.
- No idempotency — retries cause double-charges, double-shipments, double-emails.
- Hand-rolled service discovery — YAML lists of hosts. Use DNS, the platform, or a mesh.
- Hand-rolled retries — ad-hoc loops with no jitter, no budget, no breaker. Causes retry storms.
- God service / data hub — one service that knows everything; everyone depends on it.
- Nano-services — a service per function. More glue than logic; ops cost dwarfs feature work.
- Versionless events — producers change payloads silently; consumers break.
- Cross-service transactions — trying to make a saga look like one ACID transaction. Embrace the asynchrony.
- Logs without correlation IDs — you cannot debug a distributed request from one service’s logs alone.
- Snowflake environments — staging differs from prod “just a little.” That little is where outages live.
- Manual deploys — humans clicking through pipelines. Automate or accept the outages.
- One repo, no boundaries — monorepo with no module ownership; every change touches every service.
Tooling Cheatsheet
Pick one per category and standardize. Two tools competing for the same job is the most expensive choice.
| Category | Tool | One-Liner |
|---|---|---|
| Service mesh | Istio | Feature-rich mesh; complex; Envoy data plane |
| Linkerd | Lightweight Rust mesh; simpler ops | |
| Cilium Service Mesh | eBPF-based; fast; merges network + mesh | |
| Message broker | Apache Kafka | Durable log; replay; high throughput |
| RabbitMQ | Classic AMQP queue; rich routing | |
| NATS / JetStream | Lightweight pub/sub with optional durability | |
| AWS SNS + SQS | Managed pub/sub + queue combo | |
| Observability | Prometheus + Grafana | Metrics + dashboards; OSS standard |
| OpenTelemetry | Vendor-neutral instrumentation SDK | |
| Jaeger / Tempo | Distributed tracing backends | |
| Loki / ELK | Log aggregation | |
| IaC | Terraform / OpenTofu | Declarative cloud infra; HCL |
| Pulumi | Cloud infra in your real programming language | |
| Crossplane | Manage cloud resources via Kubernetes APIs | |
| CI/CD | GitHub Actions | Built-in CI for GitHub repos |
| GitLab CI | Pipelines tied to GitLab | |
| Argo CD | GitOps for Kubernetes deploys | |
| Flux | Lightweight GitOps controller | |
| Secrets | HashiCorp Vault | Centralized secrets, dynamic credentials |
| AWS / GCP / Azure Secrets Manager | Cloud-native secret store | |
| External Secrets Operator | Sync external stores into Kubernetes | |
| API gateway | Kong / Envoy / Traefik | L7 ingress, auth, rate limit |
| AWS API Gateway | Managed gateway; integrates with Lambda | |
| Apigee / MuleSoft | Enterprise API management | |
| Schema / contracts | Protobuf + Buf | Schema-first; lint and breaking-change check |
| OpenAPI + Spectral | HTTP API spec + linting | |
| Chaos | LitmusChaos / Chaos Mesh | Kubernetes-native chaos experiments |
| Gremlin | SaaS chaos engineering platform |
When NOT to Use Microservices
Microservices are an answer to a problem of scale — team scale, traffic scale, change scale. Without that problem, they are pure cost.
- Small team (<10 engineers). You don’t have enough people to staff the platform work. Stay monolithic.
- Greenfield product, unclear domain. You don’t know your bounded contexts yet. Splitting too early calcifies the wrong seams.
- No production observability. Microservices without tracing and centralized logs is debugging blindfolded.
- No CI/CD. If deploys are manual, multiplying services multiplies pain.
- Strong cross-entity transactional needs. If everything must be ACID across N entities, the database wants to keep them together.
- Latency-critical inner loop. Sub-5 ms paths can’t afford a network hop per call.
- Predictable, low-traffic workload. One box runs your monolith forever and that’s fine.
- No platform team / no plan to build one. Without paved roads, every service team reinvents the wheel.
- Org cannot tolerate eventual consistency. Some businesses really do need “the answer is correct right now” everywhere.
- You’re trying to fix a code-quality problem. A bad monolith becomes a bad distributed monolith. Refactor first.
Start here instead
- Modular monolith with clear internal module boundaries.
- One database, multiple schemas owned by different modules.
- Extract the first service only when a real signal (team split, scaling, compliance) forces it.