Why Architecture Principles Matter
Why Architecture Principles Matter
The Problem: Splitting a monolith into 30 services does not give you microservices. It gives you a distributed monolith — code that has to be deployed in lockstep, fails together, and is harder to debug than the original. The wire cost is paid; the benefits are not collected.
The Solution: A small, opinionated set of principles — bounded contexts, loose coupling, autonomy, design for failure, observability — that decide what belongs in a service and how services relate. Frameworks change. These principles do not.
Real Impact: Amazon, Netflix, Spotify, Shopify, LinkedIn — every shop that runs hundreds of services in production runs the same handful of principles underneath. The languages, queues, and orchestrators are interchangeable. The principles are not.
Real-World Analogy
Think of a city, not a single building.
- Each neighborhood has its own purpose — residential, financial, industrial — with clear borders (bounded contexts).
- Neighborhoods are connected by roads and a common postal address scheme (well-defined APIs), not by knocking through everyone’s walls.
- If one neighborhood loses power, the others stay lit (design for failure).
- The city plans for traffic, sewage, garbage, and 911 from day one (observability and resilience).
- You do not need a city-wide vote to repaint a coffee shop (autonomy and independent deployment).
A monolith is one giant building. A microservices system is a city. The principles in this tutorial are the zoning code.
Most teams do not fail at microservices because they pick the wrong queue or the wrong language. They fail because they skip the principles and build services around technical layers (a “DB service,” a “CRUD service,” a “helper service”) instead of around business capability. The result is services that all change together, all release together, and all break together — a monolith with a network in the middle.
The principles below are the canon. They predate “microservices” as a buzzword: SOLID (Robert C. Martin), Domain-Driven Design (Eric Evans, 2003), Conway’s Law (1968), the Twelve-Factor App (Heroku, 2011), and Sam Newman’s Building Microservices codify them. If a tutorial argues against one of these, be suspicious.
| Principle | What it forces | What it prevents |
|---|---|---|
| Single Responsibility / Bounded Context | One service, one business capability | God services that own everything |
| Loose Coupling | Change one service without touching others | Lock-step deploys |
| High Cohesion | Related things live together | Logic scattered across 5 services |
| Autonomy | Teams ship on their own schedule | Release-train coordination meetings |
| Smart Endpoints, Dumb Pipes | Logic in services, not the bus | ESB-style hidden coupling |
| Design for Failure | Assume every dependency will break | Cascading outages |
| Statelessness | Any instance can serve any request | Sticky sessions, scaling pain |
| Observability First | You can answer “is it working?” | 3 AM mystery outages |
Single Responsibility & Bounded Contexts
Why Bounded Contexts Matter
The Problem: “User” means something different to Sales, Support, Shipping, and Billing. Try to model one universal “User” class and every team ends up fighting in the same file. The model collapses under contradictory requirements.
The Solution: A bounded context (Eric Evans, Domain-Driven Design) is an explicit boundary inside which a model has a single, consistent meaning. Different contexts can use the same word for different things — and that is fine.
The Single Responsibility Principle from SOLID, applied to a service, becomes: each service owns one bounded context and one business capability. Not one endpoint. Not one database table. One coherent piece of the business.
The same word means different things
The classic example: a “Customer” in Sales is not the same as a “Customer” in Support or Shipping. Forcing one model on all three is what kills monoliths.
# Sales context — Customer as buyer
class Customer:
customer_id: str
credit_limit: Decimal
purchase_history: list[Order]
loyalty_points: int
payment_methods: list[PaymentMethod]
# Support context — Customer as case
class Customer:
customer_id: str
support_tier: str # bronze, silver, gold
open_tickets: list[Ticket]
satisfaction_score: float
contact_preferences: dict
# Shipping context — Customer as recipient
class Customer:
customer_id: str
shipping_addresses: list[Address]
delivery_preferences: dict
delivery_instructions: str
Three classes, all called Customer, all correct. They share an ID — that is the integration point — and nothing else. Each lives in a different service.
How to find a bounded context
You do not find bounded contexts at a whiteboard with a UML editor. You find them by listening to the business. The technique is Event Storming (Alberto Brandolini): get domain experts in a room and write every event the business cares about on sticky notes — OrderPlaced, PaymentAuthorized, ItemShipped, TicketEscalated. Group events that always travel together. Each cluster is a candidate bounded context.
Four questions that find a service boundary
- Can this functionality change independently? If the business rule changes for “pricing,” do you also have to change “shipping”? If yes, they are not separate.
- Does it own its own data? A real service owns its tables. If two services
JOINacross the same DB, they are one service in two pods. - Can one team own this completely? Conway’s Law (1968): the system mirrors your org chart. If five teams all touch the same service, that service is split wrong.
- Does it represent a distinct business capability? “Checkout” is a capability. “DatabaseHelper” is not.
Service decomposition by capability
A useful exercise: list everything a monolith does, then split by what the business calls each thing.
| Monolith Function | Microservice | Owns |
|---|---|---|
| User Management | Auth Service | Login, logout, tokens |
| Profile Service | User data, preferences, settings | |
| Permissions Service | Roles, access control | |
| E-Commerce | Catalog Service | Product listings, search |
| Cart Service | Shopping cart state | |
| Checkout Service | Order placement, validation | |
| Payment Service | Payment processing, refunds |
What a single-responsibility service looks like in code
// User Service — ONLY manages user profiles.
// Does NOT handle authentication or payments.
const express = require('express');
const app = express();
class UserService {
async createProfile(userId, profileData) {
const profile = {
userId,
firstName: profileData.firstName,
lastName: profileData.lastName,
email: profileData.email,
preferences: profileData.preferences || {},
createdAt: new Date()
};
await db.users.insert(profile);
return profile;
}
async updateProfile(userId, updates) {
const profile = await db.users.findOne({ userId });
if (!profile) throw new Error('Profile not found');
const updated = { ...profile, ...updates, updatedAt: new Date() };
await db.users.update({ userId }, updated);
return updated;
}
}
app.get('/users/:id', async (req, res) => {
res.json(await userService.getProfile(req.params.id));
});
app.put('/users/:id', async (req, res) => {
res.json(await userService.updateProfile(req.params.id, req.body));
});
app.listen(3001);
Bad service names that signal trouble
If you find yourself naming a service any of these, stop. The boundary is wrong.
DatabaseService,DataService— technical layer, not a capability.HelperService,UtilService,CommonService— vague responsibility, magnet for unrelated code.BusinessLogicService— a synonym for “the monolith.”OrchestratorServicethat calls 8 others — usually a sign that one of the 8 should own the workflow.
Loose Coupling, High Cohesion
Why Coupling and Cohesion Are the Whole Game
The Problem: Two services with shared types, shared databases, or synchronous chains of five calls are coupled. They have to deploy together, scale together, and fail together. You paid for a network and got a monolith back.
The Solution: Couple loosely (talk only over well-defined contracts), cohere tightly (everything one service does is closely related). These two terms — coined by Larry Constantine in the 1970s — are still the best diagnostic for any service split.
Loose coupling: what it actually means
Loose coupling does not mean “no calls between services.” It means: if Service B changes its internals — language, database, deploy schedule — Service A does not have to change. That is achieved by:
- Owning your own data. No service reads another’s tables. The boundary is the API, not the schema.
- Stable, versioned contracts. Communicate over HTTP/JSON, gRPC, or events — never via a shared library that drags you into the other team’s release cadence.
- Async where possible. Events and queues decouple in time as well as in code.
- Tolerant readers. Ignore fields you don’t understand; never explode on an extra key.
High cohesion: things that change together live together
The flip side of loose coupling. If a single business change forces edits in three services, those three responsibilities probably belong in one service. The classic test: write down the next ten user stories. Color-code each one by which service it touches. If most stories paint multiple services, your boundaries are wrong.
The Anti-Corruption Layer
One of the most useful coupling tools from DDD: when you must integrate with a messy or external model, do not let it leak into your domain. Build a thin translator at the boundary.
# Your clean domain model
class Order:
def __init__(self, order_id, customer, items):
self.order_id = order_id
self.customer = customer
self.items = items
# Legacy system has a messy model
class LegacyOrderData:
ORD_NUM: str
CUST_CODE: str
LINE_ITEMS: str # comma-separated SKUs!
# Anti-Corruption Layer translates
class LegacyOrderAdapter:
def to_domain(self, legacy: LegacyOrderData) -> Order:
customer = self.customer_service.get_by_code(legacy.CUST_CODE)
items = self._parse_line_items(legacy.LINE_ITEMS)
return Order(order_id=legacy.ORD_NUM, customer=customer, items=items)
def to_legacy(self, order: Order) -> LegacyOrderData:
return LegacyOrderData(
ORD_NUM=order.order_id,
CUST_CODE=order.customer.code,
LINE_ITEMS=",".join(i.sku for i in order.items),
)
Without the adapter, every consumer of the legacy system bends to its shape. With it, the legacy system is contained. The pattern works equally well for third-party SaaS APIs you cannot change.
Context mapping: how services relate
| Pattern | Relationship | When to use |
|---|---|---|
| Shared Kernel | Two contexts share a small piece of model code | Same team owns both; rare and risky |
| Customer-Supplier | Upstream provides, downstream depends | The downstream team can influence upstream priorities |
| Conformist | Downstream just accepts upstream’s model | Upstream is external and won’t change for you |
| Anti-Corruption Layer | Translation layer protects your model | Legacy systems, third-party APIs |
| Published Language | Stable, versioned, well-documented contract | Public APIs, integration platforms |
Service Autonomy & Decentralization
Why Autonomy Is the Real Win
The Problem: A team that needs to coordinate with three other teams to ship a one-line change is not getting microservices’ benefit. The org has split the code but kept the coupling — meetings instead of imports.
The Solution: Autonomous services own their schema, their deploy pipeline, their on-call rotation, and their tech choices. Decentralize. Resist the urge to mandate a single language, single database, or single framework.
Sam Newman calls this independent deployability. Martin Fowler calls it decentralized governance. The principle is the same: push decisions out to the team that owns the service. The two-pizza team (Amazon’s phrase) does not ask permission to ship.
What autonomy requires
- Database per service. No shared schemas. The service’s data is its own; the API is the only way in.
- Independent CI/CD. One pipeline per service. No release trains. No quarterly “big bang” deploys.
- Backward-compatible contracts. If you must coordinate releases, the contract is wrong. Use additive changes and deprecation windows.
- Local tech choice. Within reason. Polyglot is fine; chaos is not. Most shops settle on 2–3 sanctioned stacks.
Conway’s Law works both directions
“Any organization that designs a system will produce a design whose structure is a copy of the organization’s communication structure.”
Melvin Conway, 1968. Read it twice. The shape of your services will mirror your org chart, whether you plan it or not.
The corollary — the “Inverse Conway Maneuver” — is to design the org you want, then let the architecture follow. Spotify’s squad / tribe / chapter / guild model is the famous example: small autonomous squads, each owning a slice of product, each owning the services for that slice.
Decentralized data
The most painful coupling in any distributed system is shared state. Two services that write to the same table are not two services — they are two clients of one database, with all of the lock contention, schema-change drama, and consistency questions that implies.
The principle: each service owns its database. If another service needs the data, it asks via API or subscribes to events. Yes, this means duplication. Yes, this means eventual consistency in places. That is the price of decoupling, and on a system at any real scale, it is worth paying.
# BAD: two services share a database
[order-service] ---> [orders DB] <--- [reporting-service]
^ ^
schema change breaks both deploys
# GOOD: each service owns its data, events flow between them
[order-service] ---> [orders DB]
|
| publishes OrderPlaced
v
[event bus] ---> [reporting-service] ---> [reporting DB]
Smart Endpoints, Dumb Pipes
Why the Bus Should Stay Stupid
The Problem: Enterprise Service Buses (ESBs) tried to solve integration by putting business logic — routing, transformation, orchestration — into the message bus itself. The result: a giant central system that every team had to coordinate on. The bus became the new monolith.
The Solution: Put the smarts in the services (the endpoints). Keep the bus dumb — it just moves bytes. This is the phrase Martin Fowler used in the original microservices essay and it is still the cleanest summary of how services should communicate.
A “dumb pipe” is HTTP, gRPC, Kafka, RabbitMQ used as a transport. The transport routes; it does not interpret. A “smart endpoint” is a service that owns its own validation, business rules, and translation. If the endpoint is smart enough, the pipe can be gloriously simple.
Domain events are how you keep the pipe dumb
Instead of one service calling another, services announce things they did. Other services subscribe to what they care about. The bus does not know about “order” or “payment” — it just delivers messages.
// Domain event — past tense, immutable, named for the business
public class OrderPlacedEvent {
private final String orderId;
private final String customerId;
private final Money totalAmount;
private final Instant occurredAt;
}
// Aggregate raises the event as part of state change
public class Order {
private List<DomainEvent> domainEvents = new ArrayList<>();
public void place() {
if (orderLines.isEmpty())
throw new BusinessException("Cannot place empty order");
this.status = OrderStatus.PLACED;
this.domainEvents.add(new OrderPlacedEvent(id, customerId, totalAmount));
}
}
// Service publishes after a successful save
@Service
public class OrderService {
@Transactional
public void placeOrder(Order order) {
order.place();
orderRepository.save(order);
order.getDomainEvents().forEach(eventPublisher::publish);
order.clearDomainEvents();
}
}
The shipping service subscribes to OrderPlaced. So does the recommendation service. So does the analytics pipeline. None of them know about each other. The order service doesn’t know they exist. That is loose coupling delivered by a dumb pipe.
Benefits of the event-first style
- Loose coupling: services don’t need each other’s addresses, only the event contract.
- Audit trail: the event log is the truth of what happened.
- Replayable: rebuild a downstream view from history.
- Open extension: new subscribers don’t require changes upstream.
The Published Language
Whatever flows over the dumb pipe — JSON schema, Protobuf, Avro — is the Published Language between services. Treat it like a public API: versioned, additive-only changes, deprecation announcements. Internal model changes do not change the contract; the contract is what your consumers depend on.
Design for Failure
Why Failure Is the Default
The Problem: In a monolith, a function call either returns or throws. In a distributed system, a remote call can succeed, fail, time out, partially succeed, succeed but lose the response, or take 30 seconds. Code written assuming the monolith model breaks under the distributed one.
The Solution: Treat every cross-service call as “will fail eventually.” Build with timeouts, retries with jitter, circuit breakers, bulkheads, and graceful fallbacks. Werner Vogels’ rule: everything fails, all the time.
Cascading failures are the signature outage of a microservices system. Service B slows down, Service A’s threads pile up waiting on it, Service A starts rejecting requests, A’s callers retry and add load to a sick system, the blast radius doubles every hop. Within minutes the whole mesh is down. Designing for failure is how you stop step two — “the caller doesn’t notice.”
The non-negotiable patterns
| Pattern | What it does | Without it |
|---|---|---|
| Timeout | Caps how long a call can hang | Threads pin forever on a sick downstream |
| Retry with backoff + jitter | Handles transient failure without thundering-herd | 1,000 callers retry at the same instant |
| Circuit breaker | Fails fast when a dependency is sick | Slow death by thread pool exhaustion |
| Bulkhead | Isolates resources per dependency | One sick downstream starves all the others |
| Fallback | Graceful degradation (cache, defaults) | One service’s failure becomes the user’s 500 |
| Idempotency keys | Safe retries for writes | Charging the customer twice |
Each is covered in depth in the Circuit Breaker & Resilience tutorial. The principle here is: build them in from day one. Bolting them on after the first incident is twice as expensive and half as effective.
The retry that becomes an outage
A naive retry: 3 means that when 1,000 callers hit a flaky downstream, you immediately turn 1,000 requests into 3,000. The downstream stays sick longer, more callers retry, and the system spirals. Always pair retries with exponential backoff, jitter, and a budget. Never retry a non-idempotent write without an idempotency key.
Chaos engineering: prove it works
Resilience patterns that have not fired in production are theoretical. Netflix invented Chaos Monkey to kill random instances during business hours; the discipline is now standard at any shop running real-scale services. The point: practice the failure on your terms, with monitoring, in business hours, with a rollback plan, before the failure picks the time itself.
Statelessness vs Managed State
Why Stateless Services Scale
The Problem: A service that holds session state in memory cannot be killed without losing the user’s session. It cannot be horizontally scaled without sticky sessions. Rolling deploys are dangerous. Autoscaling is dangerous. Spot instances are out of the question.
The Solution: Push state out of the request-handling tier. The Twelve-Factor App calls this processes are stateless and share-nothing. Any instance can serve any request. Killing a pod has no business consequence.
The Twelve-Factor App (Heroku, 2011) codified the modern stateless-service style. Factor VI: Execute the app as one or more stateless processes. State that must persist goes to a backing service — a database, a cache, a session store, an object store. The application tier is replaceable.
What state can live where
| State | Where it goes | Why |
|---|---|---|
| Per-request data | The request itself | Lives only as long as the call |
| User session | Redis, Memcached, signed JWT | Any instance can read it; no sticky sessions needed |
| Business data | The service’s own database | Durable, transactional |
| Hot reads / computed views | Cache (Redis, CDN) | Performance; rebuildable from source of truth |
| Files / blobs | S3, GCS, blob storage | Durable, cheap, separate scaling |
| Long-lived workflows | Workflow engine (Temporal, Step Functions) | Survives pod restarts |
Stateful services exist — and that is fine
Databases. Stream processors. Cache nodes. Workflow engines. These are supposed to be stateful and they have their own scaling story (sharding, replication, consensus). The principle is not “eliminate state.” The principle is: be deliberate about which services hold state and which do not. The vast majority of your business services should be stateless replicas behind a load balancer. The handful that are not should be operated by people who know what they signed up for.
Observability as a First-Class Concern
Why You Cannot Bolt Observability On
The Problem: In a monolith, a stack trace and a log file get you 80% of the way to a root cause. In a distributed system, a single user request might touch 12 services, and the bug is in the 7th. Without instrumentation, debugging is archaeology.
The Solution: Treat metrics, logs, and traces as part of the service contract — not as something the SRE team adds later. Every service emits structured logs with a request ID, exposes Prometheus-style metrics, and propagates trace context.
The three pillars are Logs, Metrics, and Traces. Each answers a different question:
| Pillar | Answers | Tooling (representative) |
|---|---|---|
| Logs | What exactly happened on this instance? | JSON to stdout, Loki, Elasticsearch, Datadog |
| Metrics | How is the system behaving in aggregate? | Prometheus, OpenTelemetry, Grafana |
| Traces | Where did this one request spend its time? | Jaeger, Tempo, Zipkin, OpenTelemetry |
The minimum bar for any service
- Structured logs to stdout with
request_id,trace_id,service,level. The platform ships them; the app does not own log files. - The four golden signals as metrics: latency, traffic, errors, saturation (Google SRE).
- Distributed tracing via OpenTelemetry. Propagate
traceparenton every outbound call. - Health and readiness endpoints (
/healthz,/readyz) that the orchestrator can probe. - SLOs with error budgets. Pick the few user-visible numbers you commit to. Alert on burning the budget, not on every stack trace.
# Structured log line — one JSON object per event
{
"ts": "2026-05-12T14:22:01.337Z",
"level": "info",
"service": "checkout",
"trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
"span_id": "00f067aa0ba902b7",
"request_id": "req_8f3kd9",
"customer_id": "cus_19",
"event": "order.placed",
"order_id": "ord_551",
"latency_ms": 182
}
The single useful question
For any service, ask: at 3 AM during a P0 incident, can the on-call engineer answer “is this service working?” in under 60 seconds, from a dashboard? If no, observability is not done.
Real-World Examples
The principles above sound abstract until you see them at scale. Here is how four well-known shops apply them.
Amazon: cells, two-pizza teams, and “you build it, you run it”
Amazon’s service architecture predates the term “microservices.” The doctrine: every team owns a service end-to-end — code, deploy, on-call. Teams are small (the “two-pizza team”), services are owned, and inter-team communication happens over APIs, not Slack threads. Amazon’s cell-based architecture takes design-for-failure further: each service is split into independent cells, customers are assigned to cells, and a bug in one cell affects ~1/N of users instead of 100%.
Netflix: bounded contexts and the Simian Army
Netflix runs thousands of services on AWS. Every cross-service call goes through a circuit breaker. Chaos Monkey kills random instances during business hours. The Simian Army extends that to latency injection, AZ failure, and region failure. The result: the blast radius of any single service failure is bounded by design, and the team that finds out in an outage is rare.
Spotify: the squad model and the Inverse Conway Maneuver
Spotify designed the org to get the architecture they wanted. Small autonomous squads own a slice of product end-to-end. Squads with related missions group into tribes. Specialists across squads form chapters and guilds for cross-cutting concerns. Each squad picks its tech, ships its services, and runs its on-call. Conway’s Law, working on purpose.
Shopify: services around merchant workflows
Shopify serves 2M+ merchants. Service boundaries follow how a merchant thinks — Products, Orders, Checkout, Shipping — not how the database is structured. Multi-tenancy is enforced at the data layer (partition by shop_id) and resource limits prevent noisy neighbors. Webhooks are domain events all the way down — billions per day — letting third-party apps integrate without coupling to Shopify’s internals.
LinkedIn: from monolithic Rails to 500+ services
LinkedIn evolved from a single Rails app to a microservices estate over several years using the Strangler Fig pattern (gradually extract services from a monolith, route traffic away, retire the old code). Identity was extracted first because it was the most critical. Comprehensive monitoring went in before decomposition started. Backward compatibility was maintained throughout. The result: hundreds of teams, thousands of deploys per day, four 9s of uptime.
Subdomain classification: where to invest engineering effort
Not every service deserves the same investment. DDD splits the world into three:
| Subdomain Type | Definition | Strategy | Examples |
|---|---|---|---|
| Core Domain | Your competitive advantage | Build in-house, invest heavily, best engineers | Amazon recommendations, Netflix streaming algorithm |
| Supporting Domain | Necessary, not differentiating | Build in-house, simpler implementations | Inventory, order processing |
| Generic Domain | Common to every business | Buy or use open-source | Auth (Auth0), payments (Stripe), email (SendGrid) |
A useful rule of thumb: spend 60–70% of engineering effort on the Core, 20–30% on Supporting, and as little as possible on Generic. The Core is what your competitors cannot copy; the Generic is what your competitors are also paying Stripe for.
Best Practices
The short list
- One bounded context per service. Not one endpoint, not one table — one coherent piece of the business.
- Database per service. No shared schemas. The API is the only door in.
- Stable, versioned contracts. Additive changes only. Deprecation windows in months, not days.
- Async by default, sync where required. Events decouple time as well as code.
- Stateless app tier. State lives in databases, caches, and object storage — not in pod memory.
- Resilience built in from day one. Timeouts, retries with jitter, circuit breakers, bulkheads — not added after the first outage.
- Observability is part of the contract. Structured logs, golden-signal metrics, distributed traces.
- One team, one service, one on-call. If five teams touch one service, the boundary is wrong.
- Resist orchestrator services. A service that calls eight others usually has stolen logic that belongs in one of them.
- Run game days. Quarterly is plenty. The team that practiced the failure recovers faster than the team that hasn’t.
Final checklist for a service boundary
Eight questions before you cut the boundary
- Business alignment: does it map to a clear business capability?
- Single responsibility: can you describe what it does in one sentence?
- Data ownership: does it own and manage its own data?
- Team ownership: can one small team (5–10 people) own it completely?
- Independent deployment: can you deploy without coordinating with other teams?
- Clear interface: is the API published, versioned, and documented?
- Loose coupling: does it depend on fewer than ~5 other services synchronously?
- Bounded context: do the names inside have one consistent meaning?
7–8 yeses: good boundary. 5–6: acceptable, refine. Below 5: rethink.
Signs you’ve cut the boundary wrong
- You can’t deploy Service A without redeploying Service B.
- One bug fix opens PRs in three repos.
- Two services share a database table.
- The standup spends more time on cross-team coordination than on shipping.
- A request traces through 8+ synchronous service calls.
- The integration tests are longer than the unit tests in any single service.
If two or more apply, you have built a distributed monolith. Merging the offending services back together is usually cheaper than continuing to split them.
The single most useful sentence about microservices architecture
Microservices are not a free lunch — they are a deliberate trade. You give up the simplicity of one process for the freedom of independent deployment, fault isolation, and team autonomy. If you are not collecting that freedom, you are paying the wire cost for nothing. The principles in this tutorial are how you collect it.
Canonical references
- Eric Evans — Domain-Driven Design (2003). Bounded contexts, aggregates, ubiquitous language, anti-corruption layers.
- Sam Newman — Building Microservices. The practitioner’s guide; covers nearly every principle here in depth.
- Robert C. Martin — SOLID. Single Responsibility applies just as well to services as to classes.
- Adam Wiggins / Heroku — The Twelve-Factor App. The canonical statelessness, config, and process model for cloud services.
- Martin Fowler & James Lewis — Microservices (2014). The original essay, where “smart endpoints, dumb pipes” comes from.
- Melvin Conway — How Do Committees Invent? (1968). Conway’s Law.
- Google SRE Book. Error budgets, golden signals, and the discipline of running production at scale.