Why Communication Patterns Matter
Why Service Communication Matters
The Problem: In a monolith, “calling another module” is a function call — nanoseconds, no failure modes you didn’t already model. The moment you split into services, every internal call becomes a network round-trip with its own latency, its own failure modes, and its own contract version skew.
The Solution: Pick the protocol intentionally per call. REST when humans and partners read your API. gRPC when latency and strict typing matter. GraphQL when many client shapes hit the same data. Async messaging when the caller doesn’t need (or shouldn’t need) a synchronous answer.
Real Impact: The wrong choice metastasizes. A REST call where an event would have been right turns into a sync cascade that takes the whole site down when one downstream stalls. An event where a REST call would have been right turns into mysterious eventual-consistency bugs that QA can never reproduce.
Real-World Analogy
Think about how a restaurant kitchen actually communicates:
- Server → line cook = a synchronous call. The server stands at the pass and waits for the plate.
- Server → bar = a queued message. The drink ticket goes up; the server walks away and comes back.
- “Fire table 12” over the headset = a broadcast event. Everyone who needs to act on it hears it.
- Hot line shouting back “heard!” = an explicit acknowledgement. Without it, the order is lost.
A microservice mesh is the same: every interaction is one of these shapes. Picking the wrong shape doesn’t just feel awkward — it produces real, expensive failures during a Saturday-night rush.
This tutorial walks the four shapes that cover essentially everything you’ll build: REST, gRPC, GraphQL, and asynchronous messaging. For each one we’ll cover when to reach for it, what it costs, and the production gotchas that hurt the most.
Synchronous vs Asynchronous
Before the protocol choice, the bigger choice: do you need an answer now, or do you need the work to eventually happen? That’s the only question that matters at this layer.
| Aspect | Synchronous (REST, gRPC, GraphQL) | Asynchronous (Queues, Streams, Events) |
|---|---|---|
| Coupling | Caller and callee both have to be up | Producer publishes; consumer can be down or slow |
| Latency contract | End-to-end latency = sum of every hop | Best-effort; consumer drains the queue at its pace |
| Failure mode | One slow downstream blocks the whole chain | Producer keeps emitting; consumer backlog grows |
| Backpressure | Manual — you build it (timeouts, breakers) | Free — the queue is the buffer |
| Debugging | One trace, one stack — easy | Distributed trace + correlation IDs required |
| Right for | User-facing reads, anything the caller needs an answer to | Background work, fan-out, pipelines, event sourcing |
Reach for synchronous when
- The caller cannot continue without the answer (login, search, cart price).
- The data is small and the latency is tight.
- You need to surface a clean error to the end user (“card declined”).
- The interaction is naturally one-shot — not a pipeline.
Reach for asynchronous when
- The work is “fire and forget” from the caller’s point of view (send email, write to analytics).
- The producer’s rate and the consumer’s rate are different.
- Many services need to react to the same business fact.
- Failure of the consumer must not take down the producer.
The most common mistake is treating these as a strict dichotomy. Most production paths are both: a synchronous request returns “accepted, here’s your tracking ID,” and the heavy work happens asynchronously behind it.
REST: The Workhorse
Why REST Wins by Default
The Problem: Every team has different languages, different tooling, and external partners who’ll never read your .proto file.
The Solution: REST is the lowest-common-denominator protocol that every HTTP client on Earth speaks. Plain JSON over HTTP/1.1 or HTTP/2. Every CDN, every proxy, every browser, every curl works. The cost is verbosity and weak typing — usually a fair trade.
REST is what you should choose when in doubt. It is also what you should choose when partners or third-party clients will use the API — OpenAPI is the lingua franca for documentation, mock servers, and SDK generation.
Resource design that actually scales
- Nouns, not verbs.
POST /orders, notPOST /createOrder. The HTTP verb is the verb. - Plural collection, singular item.
/ordersfor the list,/orders/{id}for one. - Idempotent verbs do what they say.
GET,PUT,DELETEare safe to retry;POSTis not unless you ship an idempotency key. - Versioning lives in the URL or the
Acceptheader./v1/ordersis uglier but unambiguous; partners always get it right. - Pagination is mandatory. Cursor-based for streams, offset-based for tabular — never “just return everything.”
# Flask example: a small but well-shaped REST resource.
from flask import Flask, jsonify, request, abort
app = Flask(__name__)
products = {
1: {"id": 1, "name": "Laptop", "price": 999.99, "stock": 50},
2: {"id": 2, "name": "Mouse", "price": 29.99, "stock": 200},
}
@app.route("/v1/products", methods=["GET"])
def list_products():
# Pagination — never return unbounded lists.
limit = min(int(request.args.get("limit", 50)), 200)
cursor = int(request.args.get("cursor", 0))
items = [p for p in products.values() if p["id"] > cursor][:limit]
next_cursor = items[-1]["id"] if items else None
return jsonify({"items": items, "next_cursor": next_cursor})
@app.route("/v1/products/<int:product_id>", methods=["GET"])
def get_product(product_id):
product = products.get(product_id)
if not product:
# RFC 7807 problem+json — see the error handling section below.
return jsonify({
"type": "https://errors.example.com/not-found",
"title": "Product not found",
"status": 404,
"instance": f"/v1/products/{product_id}",
}), 404
return jsonify(product)
@app.route("/v1/products", methods=["POST"])
def create_product():
# Idempotency key makes POST safe to retry.
idem = request.headers.get("Idempotency-Key")
if not idem:
abort(400, "Idempotency-Key header required")
if idem in idempotency_store:
return idempotency_store[idem] # replay the exact same response
data = request.get_json()
new_id = max(products) + 1
products[new_id] = {"id": new_id, **data}
response = (jsonify(products[new_id]), 201)
idempotency_store[idem] = response
return response
Errors are types, not just status codes
HTTP status codes carry the category (4xx vs 5xx, retryable vs not). The body carries the specifics — which field, which constraint, what to do next. RFC 7807 (application/problem+json) is the de facto standard shape:
{
"type": "https://errors.example.com/insufficient-stock",
"title": "Insufficient stock",
"status": 409,
"detail": "Requested 50 units of SKU-1234; only 12 available.",
"instance": "/v1/orders/abc-123",
"available_stock": 12
}
Stripe’s API is the canonical example to imitate: every error has a stable type, a human-readable message, and structured fields the client can branch on without parsing English.
The OpenAPI contract is your real API
Hand-written API docs lie within the week. An OpenAPI (formerly Swagger) spec is the only documentation that stays honest, because every other artifact — mock servers, SDKs, contract tests, gateway routing — is generated from it. Treat the spec as code: review it in PRs, version it, and break the build when handlers don’t match it.
gRPC and Protobuf
Why gRPC for Internal Calls
The Problem: JSON over HTTP/1.1 is fine for one call to your server. It is wasteful for a service mesh making millions of internal calls per second — serialization is slow, payloads are large, and there’s no native streaming.
The Solution: gRPC ships binary Protobuf payloads over HTTP/2 multiplexed streams, with code generation in 11+ languages. The contract is the .proto file — the server and client are both generated from it, so you can’t accidentally drift.
gRPC pays off when you have many internal services calling each other a lot. The wins are concrete: smaller payloads (binary encoding, no field names), lower CPU (fast codegen-based serialization), real streaming (server, client, and bidirectional), and a typed contract that catches mismatches at compile time instead of in production.
| Feature | REST + JSON | gRPC + Protobuf |
|---|---|---|
| Transport | HTTP/1.1 (mostly) | HTTP/2 streams |
| Payload | JSON text | Protobuf binary |
| Schema | Optional (OpenAPI) | Mandatory (.proto) |
| Streaming | Server-sent events / hacks | Native, all four directions |
| Browser | Native | gRPC-Web proxy required |
| Debug-with-curl | Yes | No (need grpcurl) |
| Throughput (typical) | Baseline | 5–10x baseline |
The contract: a .proto file
// product.proto — this file IS the API.
syntax = "proto3";
package ecommerce.v1;
service ProductService {
rpc GetProduct (ProductRequest) returns (ProductResponse);
rpc ListProducts (ListProductsRequest) returns (stream ProductResponse);
rpc CreateProduct (CreateProductRequest) returns (ProductResponse);
}
message ProductRequest {
int32 product_id = 1;
}
message ListProductsRequest {
int32 page_size = 1;
string page_token = 2;
}
message CreateProductRequest {
string name = 1;
double price = 2;
int32 stock = 3;
}
message ProductResponse {
int32 id = 1;
string name = 2;
double price = 3;
int32 stock = 4;
}
From this one file you generate server stubs and client libraries in Go, Python, Java, Kotlin, Swift, TypeScript, C#, Rust — whatever you need. The fields are wire-tagged by number (= 1, = 2), which is what makes Protobuf forward- and backward-compatible: never reuse a field number, never change its type, and your old clients keep working forever.
Server in Python
import grpc
from concurrent import futures
import product_pb2, product_pb2_grpc
class ProductServicer(product_pb2_grpc.ProductServiceServicer):
def __init__(self):
self.products = {1: {"id": 1, "name": "Laptop", "price": 999.99, "stock": 50}}
def GetProduct(self, request, context):
product = self.products.get(request.product_id)
if not product:
context.set_code(grpc.StatusCode.NOT_FOUND)
context.set_details("Product not found")
return product_pb2.ProductResponse()
return product_pb2.ProductResponse(**product)
def ListProducts(self, request, context):
# Server-side streaming — yield each product as it’s available.
for p in self.products.values():
yield product_pb2.ProductResponse(**p)
def serve():
server = grpc.server(futures.ThreadPoolExecutor(max_workers=10))
product_pb2_grpc.add_ProductServiceServicer_to_server(ProductServicer(), server)
server.add_insecure_port("[::]:50051")
server.start()
server.wait_for_termination()
gRPC has sharp edges
- Browser pain: Browsers cannot speak gRPC directly — you need a gRPC-Web proxy (Envoy, grpc-web). For public APIs, this alone is usually a deal-breaker.
- Debugging tools: No
curl, no Postman without plugins. Usegrpcurland Bloom RPC, but the discoverability tax is real. - Load balancing: HTTP/2 keeps a single long-lived connection per server. A naive L4 load balancer pins all traffic to one backend. You need an L7 LB (Envoy, Linkerd) or client-side load balancing.
- Status codes: gRPC has its own status enum (
OK,NOT_FOUND,UNAVAILABLE, etc.) — not HTTP codes. Map them carefully when bridging to REST gateways.
GraphQL
Why GraphQL Exists
The Problem: A mobile screen needs some fields from the user, some from their orders, and some from the product catalog. With REST that’s 3 round-trips and over-fetching every time. With many client apps (iOS, Android, web), the per-screen endpoint sprawl gets out of hand.
The Solution: GraphQL exposes one schema across many backing services. The client writes a query that describes the exact shape of the response it wants — one request, no over-fetching, no under-fetching.
GraphQL solves the over-fetching and under-fetching problem that REST has when many client form factors share the same backend. It is also a strong fit for API aggregation — one query that pulls together data from N microservices — via federation (Apollo Federation, GraphQL Mesh).
The schema
type Product {
id: ID!
name: String!
price: Float!
stock: Int!
reviews: [Review!]!
}
type Review {
id: ID!
rating: Int!
comment: String
user: User!
}
type User {
id: ID!
name: String!
email: String!
}
type Query {
product(id: ID!): Product
products(limit: Int = 20): [Product!]!
}
type Mutation {
createProduct(name: String!, price: Float!, stock: Int!): Product!
}
The query and the response have the same shape
# Client query — ask for exactly the fields you need.
query {
product(id: "1") {
name
price
reviews {
rating
user { name }
}
}
}
# Response — mirror image of the query.
{
"data": {
"product": {
"name": "Laptop",
"price": 999.99,
"reviews": [
{"rating": 5, "user": {"name": "Jane"}}
]
}
}
}
GraphQL’s sharp edges
- The N+1 problem. A naive resolver for
Product.reviewsfires one DB query per product. Use DataLoader (or your stack’s equivalent) to batch within a single request. - Caching. HTTP caches don’t help — everything is a
POSTto/graphql. You either build query-aware caching at the gateway (persisted queries + Apollo CDN) or live with cache misses. - Query cost. A malicious or careless client can ask for deeply nested data and DoS your DB. Enforce query depth limits, field-cost analysis, and timeouts.
- Schema is a single source of truth and a single point of failure. Federation helps, but the discipline overhead is real — teams must agree on entity ownership.
If you have one or two clients hitting one backend, GraphQL is overkill. If you have a dozen client apps consuming a hundred microservices, GraphQL (federated) is often the only sane way to keep the API surface coherent.
Asynchronous Messaging
Why Async at All
The Problem: Synchronous calls fail synchronously. If three services downstream of your checkout each take 200 ms, your checkout takes 600 ms minimum — and one of them being down means checkout is down.
The Solution: Push work that doesn’t need an immediate answer onto a queue or event stream. The producer is decoupled from consumer health, traffic spikes are absorbed, and adding a new consumer doesn’t require touching the producer.
Two broad shapes dominate, and they’re not interchangeable:
| Shape | Tools | Semantics | Use For |
|---|---|---|---|
| Message queue (work distribution) | RabbitMQ, AWS SQS, Google Pub/Sub | One message, one consumer. Acked & deleted on success. | Background jobs, email, billing, retries. |
| Event stream (broadcast log) | Apache Kafka, AWS Kinesis, Redpanda | One message, many consumer groups. Retained for days/weeks. | Event sourcing, analytics, fan-out, replay. |
For deep coverage of broker selection, queue patterns (work queue, pub/sub, routing, topics), and consumer group semantics, see Messaging Patterns. For event-driven architecture as a system shape (event sourcing, CQRS, sagas), see Event-Driven Architecture. The summary below is the part you need to choose between sync and async at the call-site level.
RabbitMQ producer & consumer (the “work queue” shape)
import pika, json
# Producer — publish a durable message and walk away.
conn = pika.BlockingConnection(pika.ConnectionParameters("localhost"))
ch = conn.channel()
ch.queue_declare(queue="order_processing", durable=True)
ch.basic_publish(
exchange="",
routing_key="order_processing",
body=json.dumps({"order_id": "ORD-12345", "total": 1999.98}),
properties=pika.BasicProperties(delivery_mode=2), # persistent
)
# Consumer — one job at a time, ack on success, requeue on failure.
def handle(ch, method, props, body):
order = json.loads(body)
try:
process_order(order)
ch.basic_ack(delivery_tag=method.delivery_tag)
except Exception:
ch.basic_nack(delivery_tag=method.delivery_tag, requeue=True)
ch.basic_qos(prefetch_count=1) # fair dispatch
ch.basic_consume(queue="order_processing", on_message_callback=handle)
ch.start_consuming()
Kafka producer & consumer (the “event log” shape)
from kafka import KafkaProducer, KafkaConsumer
import json
producer = KafkaProducer(
bootstrap_servers=["localhost:9092"],
value_serializer=lambda v: json.dumps(v).encode(),
key_serializer=lambda k: k.encode() if k else None,
)
# Key determines partition — ordering is per-key, not global.
producer.send(
"order-events",
key="ORD-12345",
value={"event_type": "OrderCreated", "order_id": "ORD-12345", "total": 1999.98},
)
producer.flush()
# A consumer group reads the topic; many groups can read the same topic independently.
consumer = KafkaConsumer(
"order-events",
bootstrap_servers=["localhost:9092"],
group_id="billing-service",
value_deserializer=lambda m: json.loads(m.decode()),
auto_offset_reset="earliest",
)
for msg in consumer:
handle_event(msg.value)
Quick decision rule
- One consumer group, work distribution, low retention → RabbitMQ or SQS.
- Many consumer groups, replay, high throughput, ordered per key → Kafka.
- You don’t know yet? Start with the simplest queue your platform offers (SQS on AWS, Pub/Sub on GCP). Migrate to Kafka the day you actually need replay.
Communication Reliability
Why Reliability Is a First-Class Concern
The Problem: Every network call has three failure modes: it doesn’t arrive, it arrives slowly, or it arrives twice. Naive code assumes none of those happen and is therefore wrong in production.
The Solution: Combine timeouts, retries with backoff, idempotency, and circuit breakers. None of these is optional once you have more than two services.
This section is a fast tour. For the full treatment of circuit breakers, retries, bulkheads, and chaos engineering, see Circuit Breaker & Resilience.
Timeouts: pick numbers and write them down
Every outbound call needs an explicit timeout. The default in most HTTP clients is “forever,” which is exactly the timeout that turns a slow downstream into a cascading outage.
| Call type | Reasonable timeout | Why |
|---|---|---|
| Internal microservice (in-region) | 200 ms – 2 s | Same-DC latency is sub-ms; anything above 2s is a sick service. |
| Database query | 1 s – 10 s | Most reads are <100 ms; long tail covers locks and slow scans. |
| External SaaS API | 5 s – 30 s | You don’t control their P99; budget for it but cap it. |
| Async job processing | 30 s – 5 min | Per-message visibility timeout in the queue. |
The timeout for any service should be shorter than the timeout of whoever calls it. Otherwise the upstream gives up first and the work it asked for keeps running anyway — pure waste.
Retry with exponential backoff and jitter
import random, time
from typing import Callable, TypeVar
T = TypeVar("T")
def retry_with_backoff(
fn: Callable[[], T],
max_attempts: int = 5,
base_delay: float = 0.2,
max_delay: float = 5.0,
retryable: tuple = (TimeoutError, ConnectionError),
) -> T:
last = None
for attempt in range(max_attempts):
try:
return fn()
except retryable as e:
last = e
if attempt == max_attempts - 1:
break
# Full jitter: pick a random delay in [0, exp_backoff)
backoff = min(max_delay, base_delay * 2 ** attempt)
time.sleep(random.uniform(0, backoff))
raise last
Never retry a non-idempotent POST without an idempotency key
A retried POST /charge can charge the customer twice. The HTTP-level retry happens because the network swallowed the response, not because the work didn’t happen. Either:
- Only retry idempotent verbs (
GET,PUT,DELETE). - Require an
Idempotency-Keyheader on every write so the server deduplicates. - Move the work to an async queue and let the broker’s at-least-once semantics push the deduplication problem onto the consumer.
Stripe’s idempotency-key model is the industry reference — it stores the response keyed by the client-supplied UUID for 24 hours, so retrying the exact same request returns the exact same response.
Circuit breakers
Wrap every outbound call to an external dependency in a circuit breaker. After a configured failure rate is exceeded the breaker trips OPEN, and subsequent calls fail immediately instead of hanging on the timeout. This is what prevents one slow downstream from eating all of your service’s threads. See Circuit Breaker & Resilience for implementation, tuning, and observability of breakers, retries, and bulkheads.
Error Handling Across Services
Why Errors Need a Schema
The Problem: “500 Internal Server Error” tells the caller nothing actionable. They retry, they fail again, they page someone. Worse: when a 4xx error becomes a 5xx (or vice versa) at a gateway, callers do the wrong thing.
The Solution: Treat errors as data. Every error has a stable type, a category, and structured fields the caller can branch on without parsing prose.
Classify errors at the source
| Class | HTTP | gRPC | Caller should |
|---|---|---|---|
| Bad request from caller | 400, 422 | INVALID_ARGUMENT | Fix the request. Do not retry. |
| Unauthorized / forbidden | 401, 403 | UNAUTHENTICATED, PERMISSION_DENIED | Re-auth or escalate. Do not retry. |
| Not found | 404 | NOT_FOUND | Treat as legitimate empty result. |
| Conflict / business rule | 409, 422 | FAILED_PRECONDITION | Show the user; don’t retry. |
| Rate-limited | 429 | RESOURCE_EXHAUSTED | Backoff and retry; honor Retry-After. |
| Server bug | 500 | INTERNAL | Retry once; log; alert. |
| Dependency timeout / down | 503, 504 | UNAVAILABLE, DEADLINE_EXCEEDED | Retry with backoff; trip breaker if persistent. |
RFC 7807 problem details — the structured error body
# Content-Type: application/problem+json
{
"type": "https://errors.example.com/insufficient-stock",
"title": "Insufficient stock",
"status": 409,
"detail": "Requested 50 units of SKU-1234; only 12 available.",
"instance": "/v1/orders/abc-123",
"sku": "SKU-1234",
"requested": 50,
"available": 12,
"correlation_id": "01HXYZ..."
}
Three rules for cross-service errors
- Don’t leak internal errors. A 500 from your DB shouldn’t propagate as a 500 from your public API — map it to a generic 503 with a stable type so callers can react.
- Carry a correlation ID through every hop. Errors without a correlation ID are unsolvable. Inject
X-Correlation-IDat the edge, propagate it on every outbound call, and log it on every line. - Error budgets are a real budget. Your SLO defines how many errors are acceptable. Resilience patterns (retry, fallback, circuit breaker) spend that budget — track them as carefully as you track success rate.
Real-World Examples
Stripe: REST done right
Stripe’s public API is REST + JSON, with a few opinionated extensions that the industry has steadily copied:
- Idempotency keys on every write. Send
Idempotency-Key: <uuid>; the server stores the response for 24 hours and replays it on retry. - Versioning by date. The
Stripe-Version: 2023-10-16header pins each merchant to a specific API snapshot. Stripe can ship breaking changes without breaking anyone. - Errors as types. Every error has a stable
typeand acode. Client SDKs branch on these, not on HTTP status alone. - Cursor-based pagination. Every list endpoint has
has_moreand astarting_aftercursor — consistent across the whole API.
Google: gRPC at planet scale
Google built gRPC on top of an internal RPC framework (Stubby) used for over a decade across thousands of services. Every internal Google service-to-service call is gRPC over HTTP/2 over a custom transport, with Protobuf contracts checked into a single monorepo. The .proto files are the API surface; SDKs in every language are generated by the same toolchain. This is why gRPC is opinionated about things like deadlines (context.Deadline) and metadata propagation — those are Google’s production lessons turned into protocol features.
Shopify: GraphQL Admin API
Shopify’s public Admin API is GraphQL. The reason is the surface area: tens of thousands of third-party apps, each needing different slices of merchant data — orders, customers, products, fulfillment, inventory. With REST that’s either hundreds of endpoints (each one a backwards-compat liability) or massive over-fetching. With GraphQL each app asks for exactly what it needs. Shopify enforces a query-cost limit (calculated from the query AST) so a careless app can’t blow up the database; the cost is published as part of the schema so app authors can budget against it.
Uber: a mixed RPC stack
Uber’s service mesh is a deliberate mix:
- Internal service-to-service: gRPC, with Protobuf contracts. Mostly internal, low-latency, polyglot fleets — the gRPC sweet spot.
- Mobile and partner APIs: REST + JSON, with a thin GraphQL layer (DOSA / Schemaless internally) for client aggregation.
- Event backbone: Apache Kafka, for everything from trip-state changes to ML training pipelines — trillions of messages a day.
- Background jobs: Cadence (Uber’s open-source workflow engine) for long-running multi-step orchestrations like driver onboarding and payment reconciliation.
The lesson is that real production stacks are polyglot at the protocol layer. There is no “one true API style” — pick the right shape per call.
Best Practices
The short list
- Default to REST. Reach for gRPC or GraphQL when you have a concrete reason — not because it’s fashionable.
- Every outbound call has a timeout. Without exception. The default timeout in your HTTP client library is wrong.
- Every write has an idempotency key. Either client-supplied (REST/Stripe model) or broker-mediated (queue + dedup).
- Errors are types, not strings. Adopt RFC 7807 (or your stack’s equivalent) and stop returning bare 500s.
- Correlation IDs everywhere. Generate at the edge, propagate on every hop, log on every line. Without this, distributed debugging is guesswork.
- Pick async aggressively. If the caller doesn’t need an answer, don’t make them wait. Async absorbs spikes, isolates failures, and lets you add subscribers without redeploying the producer.
- Contract-test the boundaries. Pact, Spring Cloud Contract, or schema diffing in CI. The contract is the only artifact two teams share.
- Bound your fan-out. A single user request that explodes into 50 internal calls is a tail-latency disaster. Aggregate at the gateway (see API Gateway) or use GraphQL field-level batching.
Common anti-patterns
| Anti-pattern | Why it hurts | What to do instead |
|---|---|---|
| Synchronous cascade (A → B → C → D) | Latency adds; failures multiply; one slow node blocks everyone | Async after the first hop where possible; aggregate at gateway |
| No timeout on outbound calls | Slow downstream eats the caller’s threads — cascading outage | Explicit per-call timeout shorter than upstream’s |
| Retrying non-idempotent POST | Duplicate orders, double charges | Idempotency keys, or move to async with broker dedup |
| Sharing a JSON model across teams without versioning | One team’s rename breaks every consumer | OpenAPI / Protobuf contracts, semantic versioning, contract tests |
| Using REST for high-throughput internal RPC | JSON parsing alone burns 10–30% CPU at scale | gRPC + Protobuf for hot internal paths |
| Using GraphQL on a single-client backend | Adds N+1 risk, query-cost overhead, tooling tax | Plain REST — come back when you have many clients |
| Treating queue messages as fire-and-forget “notifications” | Lost work, no audit trail, no replay | Durable broker, explicit acks, dead-letter queue |
The single most useful sentence about service communication
Your protocol choice is a contract between the two teams that own the two services — and it will outlive both of them. Choose for the next five years of operations, not for this sprint’s feature.