Architecture Principles

A microservices system is held together by a handful of principles, not a framework. Get them right and the system stays maintainable at 50 services. Get them wrong and you build a distributed monolith — all of the cost, none of the benefits.

Medium30 min read

Why Architecture Principles Matter

Why Architecture Principles Matter

The Problem: Splitting a monolith into 30 services does not give you microservices. It gives you a distributed monolith — code that has to be deployed in lockstep, fails together, and is harder to debug than the original. The wire cost is paid; the benefits are not collected.

The Solution: A small, opinionated set of principles — bounded contexts, loose coupling, autonomy, design for failure, observability — that decide what belongs in a service and how services relate. Frameworks change. These principles do not.

Real Impact: Amazon, Netflix, Spotify, Shopify, LinkedIn — every shop that runs hundreds of services in production runs the same handful of principles underneath. The languages, queues, and orchestrators are interchangeable. The principles are not.

Real-World Analogy

Think of a city, not a single building.

  • Each neighborhood has its own purpose — residential, financial, industrial — with clear borders (bounded contexts).
  • Neighborhoods are connected by roads and a common postal address scheme (well-defined APIs), not by knocking through everyone’s walls.
  • If one neighborhood loses power, the others stay lit (design for failure).
  • The city plans for traffic, sewage, garbage, and 911 from day one (observability and resilience).
  • You do not need a city-wide vote to repaint a coffee shop (autonomy and independent deployment).

A monolith is one giant building. A microservices system is a city. The principles in this tutorial are the zoning code.

Most teams do not fail at microservices because they pick the wrong queue or the wrong language. They fail because they skip the principles and build services around technical layers (a “DB service,” a “CRUD service,” a “helper service”) instead of around business capability. The result is services that all change together, all release together, and all break together — a monolith with a network in the middle.

The principles below are the canon. They predate “microservices” as a buzzword: SOLID (Robert C. Martin), Domain-Driven Design (Eric Evans, 2003), Conway’s Law (1968), the Twelve-Factor App (Heroku, 2011), and Sam Newman’s Building Microservices codify them. If a tutorial argues against one of these, be suspicious.

PrincipleWhat it forcesWhat it prevents
Single Responsibility / Bounded ContextOne service, one business capabilityGod services that own everything
Loose CouplingChange one service without touching othersLock-step deploys
High CohesionRelated things live togetherLogic scattered across 5 services
AutonomyTeams ship on their own scheduleRelease-train coordination meetings
Smart Endpoints, Dumb PipesLogic in services, not the busESB-style hidden coupling
Design for FailureAssume every dependency will breakCascading outages
StatelessnessAny instance can serve any requestSticky sessions, scaling pain
Observability FirstYou can answer “is it working?”3 AM mystery outages

Single Responsibility & Bounded Contexts

Why Bounded Contexts Matter

The Problem: “User” means something different to Sales, Support, Shipping, and Billing. Try to model one universal “User” class and every team ends up fighting in the same file. The model collapses under contradictory requirements.

The Solution: A bounded context (Eric Evans, Domain-Driven Design) is an explicit boundary inside which a model has a single, consistent meaning. Different contexts can use the same word for different things — and that is fine.

The Single Responsibility Principle from SOLID, applied to a service, becomes: each service owns one bounded context and one business capability. Not one endpoint. Not one database table. One coherent piece of the business.

The same word means different things

The classic example: a “Customer” in Sales is not the same as a “Customer” in Support or Shipping. Forcing one model on all three is what kills monoliths.

# Sales context — Customer as buyer
class Customer:
    customer_id: str
    credit_limit: Decimal
    purchase_history: list[Order]
    loyalty_points: int
    payment_methods: list[PaymentMethod]

# Support context — Customer as case
class Customer:
    customer_id: str
    support_tier: str          # bronze, silver, gold
    open_tickets: list[Ticket]
    satisfaction_score: float
    contact_preferences: dict

# Shipping context — Customer as recipient
class Customer:
    customer_id: str
    shipping_addresses: list[Address]
    delivery_preferences: dict
    delivery_instructions: str

Three classes, all called Customer, all correct. They share an ID — that is the integration point — and nothing else. Each lives in a different service.

How to find a bounded context

You do not find bounded contexts at a whiteboard with a UML editor. You find them by listening to the business. The technique is Event Storming (Alberto Brandolini): get domain experts in a room and write every event the business cares about on sticky notes — OrderPlaced, PaymentAuthorized, ItemShipped, TicketEscalated. Group events that always travel together. Each cluster is a candidate bounded context.

Four questions that find a service boundary

  1. Can this functionality change independently? If the business rule changes for “pricing,” do you also have to change “shipping”? If yes, they are not separate.
  2. Does it own its own data? A real service owns its tables. If two services JOIN across the same DB, they are one service in two pods.
  3. Can one team own this completely? Conway’s Law (1968): the system mirrors your org chart. If five teams all touch the same service, that service is split wrong.
  4. Does it represent a distinct business capability? “Checkout” is a capability. “DatabaseHelper” is not.

Service decomposition by capability

A useful exercise: list everything a monolith does, then split by what the business calls each thing.

Monolith FunctionMicroserviceOwns
User ManagementAuth ServiceLogin, logout, tokens
Profile ServiceUser data, preferences, settings
Permissions ServiceRoles, access control
E-CommerceCatalog ServiceProduct listings, search
Cart ServiceShopping cart state
Checkout ServiceOrder placement, validation
Payment ServicePayment processing, refunds

What a single-responsibility service looks like in code

// User Service — ONLY manages user profiles.
// Does NOT handle authentication or payments.
const express = require('express');
const app = express();

class UserService {
    async createProfile(userId, profileData) {
        const profile = {
            userId,
            firstName: profileData.firstName,
            lastName: profileData.lastName,
            email: profileData.email,
            preferences: profileData.preferences || {},
            createdAt: new Date()
        };
        await db.users.insert(profile);
        return profile;
    }

    async updateProfile(userId, updates) {
        const profile = await db.users.findOne({ userId });
        if (!profile) throw new Error('Profile not found');
        const updated = { ...profile, ...updates, updatedAt: new Date() };
        await db.users.update({ userId }, updated);
        return updated;
    }
}

app.get('/users/:id', async (req, res) => {
    res.json(await userService.getProfile(req.params.id));
});

app.put('/users/:id', async (req, res) => {
    res.json(await userService.updateProfile(req.params.id, req.body));
});

app.listen(3001);

Bad service names that signal trouble

If you find yourself naming a service any of these, stop. The boundary is wrong.

  • DatabaseService, DataService — technical layer, not a capability.
  • HelperService, UtilService, CommonService — vague responsibility, magnet for unrelated code.
  • BusinessLogicService — a synonym for “the monolith.”
  • OrchestratorService that calls 8 others — usually a sign that one of the 8 should own the workflow.

Loose Coupling, High Cohesion

Why Coupling and Cohesion Are the Whole Game

The Problem: Two services with shared types, shared databases, or synchronous chains of five calls are coupled. They have to deploy together, scale together, and fail together. You paid for a network and got a monolith back.

The Solution: Couple loosely (talk only over well-defined contracts), cohere tightly (everything one service does is closely related). These two terms — coined by Larry Constantine in the 1970s — are still the best diagnostic for any service split.

Loose coupling: what it actually means

Loose coupling does not mean “no calls between services.” It means: if Service B changes its internals — language, database, deploy schedule — Service A does not have to change. That is achieved by:

High cohesion: things that change together live together

The flip side of loose coupling. If a single business change forces edits in three services, those three responsibilities probably belong in one service. The classic test: write down the next ten user stories. Color-code each one by which service it touches. If most stories paint multiple services, your boundaries are wrong.

The Anti-Corruption Layer

One of the most useful coupling tools from DDD: when you must integrate with a messy or external model, do not let it leak into your domain. Build a thin translator at the boundary.

# Your clean domain model
class Order:
    def __init__(self, order_id, customer, items):
        self.order_id = order_id
        self.customer = customer
        self.items = items

# Legacy system has a messy model
class LegacyOrderData:
    ORD_NUM: str
    CUST_CODE: str
    LINE_ITEMS: str      # comma-separated SKUs!

# Anti-Corruption Layer translates
class LegacyOrderAdapter:
    def to_domain(self, legacy: LegacyOrderData) -> Order:
        customer = self.customer_service.get_by_code(legacy.CUST_CODE)
        items = self._parse_line_items(legacy.LINE_ITEMS)
        return Order(order_id=legacy.ORD_NUM, customer=customer, items=items)

    def to_legacy(self, order: Order) -> LegacyOrderData:
        return LegacyOrderData(
            ORD_NUM=order.order_id,
            CUST_CODE=order.customer.code,
            LINE_ITEMS=",".join(i.sku for i in order.items),
        )

Without the adapter, every consumer of the legacy system bends to its shape. With it, the legacy system is contained. The pattern works equally well for third-party SaaS APIs you cannot change.

Context mapping: how services relate

PatternRelationshipWhen to use
Shared KernelTwo contexts share a small piece of model codeSame team owns both; rare and risky
Customer-SupplierUpstream provides, downstream dependsThe downstream team can influence upstream priorities
ConformistDownstream just accepts upstream’s modelUpstream is external and won’t change for you
Anti-Corruption LayerTranslation layer protects your modelLegacy systems, third-party APIs
Published LanguageStable, versioned, well-documented contractPublic APIs, integration platforms

Service Autonomy & Decentralization

Why Autonomy Is the Real Win

The Problem: A team that needs to coordinate with three other teams to ship a one-line change is not getting microservices’ benefit. The org has split the code but kept the coupling — meetings instead of imports.

The Solution: Autonomous services own their schema, their deploy pipeline, their on-call rotation, and their tech choices. Decentralize. Resist the urge to mandate a single language, single database, or single framework.

Sam Newman calls this independent deployability. Martin Fowler calls it decentralized governance. The principle is the same: push decisions out to the team that owns the service. The two-pizza team (Amazon’s phrase) does not ask permission to ship.

What autonomy requires

Conway’s Law works both directions

“Any organization that designs a system will produce a design whose structure is a copy of the organization’s communication structure.”

Melvin Conway, 1968. Read it twice. The shape of your services will mirror your org chart, whether you plan it or not.

The corollary — the “Inverse Conway Maneuver” — is to design the org you want, then let the architecture follow. Spotify’s squad / tribe / chapter / guild model is the famous example: small autonomous squads, each owning a slice of product, each owning the services for that slice.

Decentralized data

The most painful coupling in any distributed system is shared state. Two services that write to the same table are not two services — they are two clients of one database, with all of the lock contention, schema-change drama, and consistency questions that implies.

The principle: each service owns its database. If another service needs the data, it asks via API or subscribes to events. Yes, this means duplication. Yes, this means eventual consistency in places. That is the price of decoupling, and on a system at any real scale, it is worth paying.

# BAD: two services share a database
[order-service] ---> [orders DB] <--- [reporting-service]
                       ^                  ^
                  schema change           breaks both deploys

# GOOD: each service owns its data, events flow between them
[order-service] ---> [orders DB]
       |
       | publishes OrderPlaced
       v
   [event bus] ---> [reporting-service] ---> [reporting DB]

Smart Endpoints, Dumb Pipes

Why the Bus Should Stay Stupid

The Problem: Enterprise Service Buses (ESBs) tried to solve integration by putting business logic — routing, transformation, orchestration — into the message bus itself. The result: a giant central system that every team had to coordinate on. The bus became the new monolith.

The Solution: Put the smarts in the services (the endpoints). Keep the bus dumb — it just moves bytes. This is the phrase Martin Fowler used in the original microservices essay and it is still the cleanest summary of how services should communicate.

A “dumb pipe” is HTTP, gRPC, Kafka, RabbitMQ used as a transport. The transport routes; it does not interpret. A “smart endpoint” is a service that owns its own validation, business rules, and translation. If the endpoint is smart enough, the pipe can be gloriously simple.

Domain events are how you keep the pipe dumb

Instead of one service calling another, services announce things they did. Other services subscribe to what they care about. The bus does not know about “order” or “payment” — it just delivers messages.

// Domain event — past tense, immutable, named for the business
public class OrderPlacedEvent {
    private final String orderId;
    private final String customerId;
    private final Money totalAmount;
    private final Instant occurredAt;
}

// Aggregate raises the event as part of state change
public class Order {
    private List<DomainEvent> domainEvents = new ArrayList<>();

    public void place() {
        if (orderLines.isEmpty())
            throw new BusinessException("Cannot place empty order");
        this.status = OrderStatus.PLACED;
        this.domainEvents.add(new OrderPlacedEvent(id, customerId, totalAmount));
    }
}

// Service publishes after a successful save
@Service
public class OrderService {
    @Transactional
    public void placeOrder(Order order) {
        order.place();
        orderRepository.save(order);
        order.getDomainEvents().forEach(eventPublisher::publish);
        order.clearDomainEvents();
    }
}

The shipping service subscribes to OrderPlaced. So does the recommendation service. So does the analytics pipeline. None of them know about each other. The order service doesn’t know they exist. That is loose coupling delivered by a dumb pipe.

Benefits of the event-first style

  • Loose coupling: services don’t need each other’s addresses, only the event contract.
  • Audit trail: the event log is the truth of what happened.
  • Replayable: rebuild a downstream view from history.
  • Open extension: new subscribers don’t require changes upstream.

The Published Language

Whatever flows over the dumb pipe — JSON schema, Protobuf, Avro — is the Published Language between services. Treat it like a public API: versioned, additive-only changes, deprecation announcements. Internal model changes do not change the contract; the contract is what your consumers depend on.

Design for Failure

Why Failure Is the Default

The Problem: In a monolith, a function call either returns or throws. In a distributed system, a remote call can succeed, fail, time out, partially succeed, succeed but lose the response, or take 30 seconds. Code written assuming the monolith model breaks under the distributed one.

The Solution: Treat every cross-service call as “will fail eventually.” Build with timeouts, retries with jitter, circuit breakers, bulkheads, and graceful fallbacks. Werner Vogels’ rule: everything fails, all the time.

Cascading failures are the signature outage of a microservices system. Service B slows down, Service A’s threads pile up waiting on it, Service A starts rejecting requests, A’s callers retry and add load to a sick system, the blast radius doubles every hop. Within minutes the whole mesh is down. Designing for failure is how you stop step two — “the caller doesn’t notice.”

The non-negotiable patterns

PatternWhat it doesWithout it
TimeoutCaps how long a call can hangThreads pin forever on a sick downstream
Retry with backoff + jitterHandles transient failure without thundering-herd1,000 callers retry at the same instant
Circuit breakerFails fast when a dependency is sickSlow death by thread pool exhaustion
BulkheadIsolates resources per dependencyOne sick downstream starves all the others
FallbackGraceful degradation (cache, defaults)One service’s failure becomes the user’s 500
Idempotency keysSafe retries for writesCharging the customer twice

Each is covered in depth in the Circuit Breaker & Resilience tutorial. The principle here is: build them in from day one. Bolting them on after the first incident is twice as expensive and half as effective.

The retry that becomes an outage

A naive retry: 3 means that when 1,000 callers hit a flaky downstream, you immediately turn 1,000 requests into 3,000. The downstream stays sick longer, more callers retry, and the system spirals. Always pair retries with exponential backoff, jitter, and a budget. Never retry a non-idempotent write without an idempotency key.

Chaos engineering: prove it works

Resilience patterns that have not fired in production are theoretical. Netflix invented Chaos Monkey to kill random instances during business hours; the discipline is now standard at any shop running real-scale services. The point: practice the failure on your terms, with monitoring, in business hours, with a rollback plan, before the failure picks the time itself.

Statelessness vs Managed State

Why Stateless Services Scale

The Problem: A service that holds session state in memory cannot be killed without losing the user’s session. It cannot be horizontally scaled without sticky sessions. Rolling deploys are dangerous. Autoscaling is dangerous. Spot instances are out of the question.

The Solution: Push state out of the request-handling tier. The Twelve-Factor App calls this processes are stateless and share-nothing. Any instance can serve any request. Killing a pod has no business consequence.

The Twelve-Factor App (Heroku, 2011) codified the modern stateless-service style. Factor VI: Execute the app as one or more stateless processes. State that must persist goes to a backing service — a database, a cache, a session store, an object store. The application tier is replaceable.

What state can live where

StateWhere it goesWhy
Per-request dataThe request itselfLives only as long as the call
User sessionRedis, Memcached, signed JWTAny instance can read it; no sticky sessions needed
Business dataThe service’s own databaseDurable, transactional
Hot reads / computed viewsCache (Redis, CDN)Performance; rebuildable from source of truth
Files / blobsS3, GCS, blob storageDurable, cheap, separate scaling
Long-lived workflowsWorkflow engine (Temporal, Step Functions)Survives pod restarts

Stateful services exist — and that is fine

Databases. Stream processors. Cache nodes. Workflow engines. These are supposed to be stateful and they have their own scaling story (sharding, replication, consensus). The principle is not “eliminate state.” The principle is: be deliberate about which services hold state and which do not. The vast majority of your business services should be stateless replicas behind a load balancer. The handful that are not should be operated by people who know what they signed up for.

Observability as a First-Class Concern

Why You Cannot Bolt Observability On

The Problem: In a monolith, a stack trace and a log file get you 80% of the way to a root cause. In a distributed system, a single user request might touch 12 services, and the bug is in the 7th. Without instrumentation, debugging is archaeology.

The Solution: Treat metrics, logs, and traces as part of the service contract — not as something the SRE team adds later. Every service emits structured logs with a request ID, exposes Prometheus-style metrics, and propagates trace context.

The three pillars are Logs, Metrics, and Traces. Each answers a different question:

PillarAnswersTooling (representative)
LogsWhat exactly happened on this instance?JSON to stdout, Loki, Elasticsearch, Datadog
MetricsHow is the system behaving in aggregate?Prometheus, OpenTelemetry, Grafana
TracesWhere did this one request spend its time?Jaeger, Tempo, Zipkin, OpenTelemetry

The minimum bar for any service

# Structured log line — one JSON object per event
{
  "ts": "2026-05-12T14:22:01.337Z",
  "level": "info",
  "service": "checkout",
  "trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
  "span_id": "00f067aa0ba902b7",
  "request_id": "req_8f3kd9",
  "customer_id": "cus_19",
  "event": "order.placed",
  "order_id": "ord_551",
  "latency_ms": 182
}

The single useful question

For any service, ask: at 3 AM during a P0 incident, can the on-call engineer answer “is this service working?” in under 60 seconds, from a dashboard? If no, observability is not done.

Real-World Examples

The principles above sound abstract until you see them at scale. Here is how four well-known shops apply them.

Amazon: cells, two-pizza teams, and “you build it, you run it”

Amazon’s service architecture predates the term “microservices.” The doctrine: every team owns a service end-to-end — code, deploy, on-call. Teams are small (the “two-pizza team”), services are owned, and inter-team communication happens over APIs, not Slack threads. Amazon’s cell-based architecture takes design-for-failure further: each service is split into independent cells, customers are assigned to cells, and a bug in one cell affects ~1/N of users instead of 100%.

Netflix: bounded contexts and the Simian Army

Netflix runs thousands of services on AWS. Every cross-service call goes through a circuit breaker. Chaos Monkey kills random instances during business hours. The Simian Army extends that to latency injection, AZ failure, and region failure. The result: the blast radius of any single service failure is bounded by design, and the team that finds out in an outage is rare.

Spotify: the squad model and the Inverse Conway Maneuver

Spotify designed the org to get the architecture they wanted. Small autonomous squads own a slice of product end-to-end. Squads with related missions group into tribes. Specialists across squads form chapters and guilds for cross-cutting concerns. Each squad picks its tech, ships its services, and runs its on-call. Conway’s Law, working on purpose.

Shopify: services around merchant workflows

Shopify serves 2M+ merchants. Service boundaries follow how a merchant thinks — Products, Orders, Checkout, Shipping — not how the database is structured. Multi-tenancy is enforced at the data layer (partition by shop_id) and resource limits prevent noisy neighbors. Webhooks are domain events all the way down — billions per day — letting third-party apps integrate without coupling to Shopify’s internals.

LinkedIn: from monolithic Rails to 500+ services

LinkedIn evolved from a single Rails app to a microservices estate over several years using the Strangler Fig pattern (gradually extract services from a monolith, route traffic away, retire the old code). Identity was extracted first because it was the most critical. Comprehensive monitoring went in before decomposition started. Backward compatibility was maintained throughout. The result: hundreds of teams, thousands of deploys per day, four 9s of uptime.

Subdomain classification: where to invest engineering effort

Not every service deserves the same investment. DDD splits the world into three:

Subdomain TypeDefinitionStrategyExamples
Core DomainYour competitive advantageBuild in-house, invest heavily, best engineersAmazon recommendations, Netflix streaming algorithm
Supporting DomainNecessary, not differentiatingBuild in-house, simpler implementationsInventory, order processing
Generic DomainCommon to every businessBuy or use open-sourceAuth (Auth0), payments (Stripe), email (SendGrid)

A useful rule of thumb: spend 60–70% of engineering effort on the Core, 20–30% on Supporting, and as little as possible on Generic. The Core is what your competitors cannot copy; the Generic is what your competitors are also paying Stripe for.

Best Practices

The short list

  • One bounded context per service. Not one endpoint, not one table — one coherent piece of the business.
  • Database per service. No shared schemas. The API is the only door in.
  • Stable, versioned contracts. Additive changes only. Deprecation windows in months, not days.
  • Async by default, sync where required. Events decouple time as well as code.
  • Stateless app tier. State lives in databases, caches, and object storage — not in pod memory.
  • Resilience built in from day one. Timeouts, retries with jitter, circuit breakers, bulkheads — not added after the first outage.
  • Observability is part of the contract. Structured logs, golden-signal metrics, distributed traces.
  • One team, one service, one on-call. If five teams touch one service, the boundary is wrong.
  • Resist orchestrator services. A service that calls eight others usually has stolen logic that belongs in one of them.
  • Run game days. Quarterly is plenty. The team that practiced the failure recovers faster than the team that hasn’t.

Final checklist for a service boundary

Eight questions before you cut the boundary

  1. Business alignment: does it map to a clear business capability?
  2. Single responsibility: can you describe what it does in one sentence?
  3. Data ownership: does it own and manage its own data?
  4. Team ownership: can one small team (5–10 people) own it completely?
  5. Independent deployment: can you deploy without coordinating with other teams?
  6. Clear interface: is the API published, versioned, and documented?
  7. Loose coupling: does it depend on fewer than ~5 other services synchronously?
  8. Bounded context: do the names inside have one consistent meaning?

7–8 yeses: good boundary. 5–6: acceptable, refine. Below 5: rethink.

Signs you’ve cut the boundary wrong

  • You can’t deploy Service A without redeploying Service B.
  • One bug fix opens PRs in three repos.
  • Two services share a database table.
  • The standup spends more time on cross-team coordination than on shipping.
  • A request traces through 8+ synchronous service calls.
  • The integration tests are longer than the unit tests in any single service.

If two or more apply, you have built a distributed monolith. Merging the offending services back together is usually cheaper than continuing to split them.

The single most useful sentence about microservices architecture

Microservices are not a free lunch — they are a deliberate trade. You give up the simplicity of one process for the freedom of independent deployment, fault isolation, and team autonomy. If you are not collecting that freedom, you are paying the wire cost for nothing. The principles in this tutorial are how you collect it.

Canonical references