Why Migration Strategy Matters
Why Migration Strategy Matters
The Problem: Most monolith-to-microservices projects that aim for a “big bang” rewrite either get cancelled, ship a worse system than they replaced, or stretch to three years and burn out the team. The instinct — “the codebase is a mess, let’s start fresh” — is the most expensive instinct in our industry.
The Solution: Treat the migration as an incremental refactor. Strangle the monolith one route, one bounded context, one table at a time. Keep both systems running. Cut over traffic gradually. Roll back at any moment.
Real Impact: Amazon, Shopify, Etsy, Stack Overflow, and Uber all have public stories about this. The shape is the same: years, not months; incremental, not big-bang; led by product needs, not architecture purity.
Real-World Analogy
Migrating a monolith is like renovating a house while you live in it. You do not demolish the kitchen before the new one works. You build the new kitchen in the corner of the dining room, plumb it in, switch over for one meal, then come back and tear out the old kitchen. Then the bathroom. Then the bedroom.
The teams that “just want to rewrite the whole house” are the teams sleeping in a tent in the yard 18 months later, watching the architect bicker with the contractor.
The honest framing: a microservices migration is not a technical project. It is an organizational change that uses code as its medium. The architecture diagram is the easy part. The hard parts are deciding which team owns which service, how to migrate data without downtime, how to ship product features during the migration, and how to know when to stop.
Why big-bang rewrites fail
- The monolith’s real specification lives in its code. Nobody can write it down before the rewrite starts — they discover it as the new system fails to handle case after case.
- The product does not stop. Features ship to the monolith during the rewrite. The new system is always behind, never catching up.
- There is no intermediate value. You spend 12 months and have nothing in production. Leadership patience runs out before code does.
- Teams lose context. The engineers who knew the monolith leave. The new team rebuilds bugs because they never understood why the old code was “weird.”
- Cutover is terrifying. Everything-or-nothing means a single bad day can lose customers, data, or both.
Joel Spolsky called this the single worst strategic mistake that any software company can make. He was not exaggerating. Every successful migration in this tutorial is incremental.
When NOT to Migrate
The Honest Case for Keeping the Monolith
The Problem: Microservices have become a default answer in tech interviews and architecture decks. Most teams that adopt them do not need them, and pay the operational tax for years.
The Solution: Default to a monolith. Migrate only when a concrete business problem — scaling, team coordination, deploy frequency — is clearly bottlenecked on the monolith’s structure.
The cost of microservices is real and is often underestimated. You will need: a service mesh or equivalent, distributed tracing, centralized logging, a CI/CD pipeline per service, on-call rotations per team, contract testing, schema registries, secret management, blue/green or canary deploys, and the discipline to not share databases. None of that is free.
Here is a blunt heuristic table. If most rows are on the left, you do not have a microservices problem — you have a code problem. Fix that first.
| Stay on the monolith if… | Consider migrating if… |
|---|---|
| Team is one squad (5–10 engineers) | You have 5+ teams blocked on each other’s deploys |
| Domain is simple (one core noun: orders, posts, leads) | You have several distinct bounded contexts with different lifecycles |
| Traffic fits comfortably on a few boxes | One subsystem must scale 10× while the rest stays flat |
| You deploy < weekly and that is fine | Different parts of the system need very different release cadences |
| Reliability story is “we restart it” | You need real isolation: one component’s failure must not kill checkout |
| You cannot afford an SRE team | You have or are hiring platform engineers |
Conway’s Law check
Conway’s Law: any organization that designs a system will produce a design whose structure is a copy of the organization’s communication structure. If your org chart has one squad, microservices will create services that all the same people own — which gives you the operational cost of microservices and the coordination cost of a monolith. The worst of both worlds.
Migrate when the team boundaries genuinely require service boundaries. Not before.
In practice, most teams should not migrate. They should clean up the monolith. A modular monolith with clear internal boundaries gives you 80% of what microservices offer for 20% of the cost. Shopify, GitHub, Basecamp, and Stack Overflow all run very large modular monoliths in production. There is no shame in that.
Strangler Fig Pattern
Why the Strangler Fig
The Problem: You cannot freeze the monolith for 12 months while you build its replacement. The product needs to ship. Customers need uptime. The migration must happen under load.
The Solution: Put a routing layer in front of the monolith. Gradually move routes (one URL, one feature, one bounded context at a time) to new services living behind the same router. The monolith shrinks until nothing is left to call.
Martin Fowler named this pattern after the strangler fig tree, which grows around an existing tree, slowly takes over its space, and eventually leaves a hollow shell where the host tree used to stand. The metaphor is exact: the new architecture surrounds the old one and the old one withers.
The four moves
- Insert the proxy. Put nginx, Envoy, an API gateway, or a load balancer in front of the monolith. At this stage it does nothing — 100% of traffic still goes to the monolith. This is a no-op deploy with no risk.
- Build the replacement. Pick one route or capability. Build it as a new service. It can read the monolith’s database, call the monolith’s APIs, or own its own data — depending on the slice.
- Cut traffic over gradually. 1% → 10% → 50% → 100%, with monitoring at each step. Often you mirror traffic first — send the same request to both systems and compare responses, without using the new system’s response.
- Delete the old code. Once 100% of traffic is on the new path and you have run that way for a couple of weeks, remove the monolith’s implementation. If you do not delete, you will have two systems to maintain forever.
Routing by path with nginx
# nginx in front of a strangling monolith.
# Old monolith on :8080, new services on their own ports.
upstream monolith { server 10.0.0.10:8080; }
upstream catalog_svc { server 10.0.1.20:9001; }
upstream checkout_svc { server 10.0.1.21:9002; }
server {
listen 443 ssl;
server_name shop.example.com;
# Already extracted: send straight to the new service.
location /products/ { proxy_pass http://catalog_svc; }
location /checkout/ { proxy_pass http://checkout_svc; }
# Canary: cookie-based, 10% of /search to the new service.
location /search/ {
set $backend monolith;
if ($cookie_canary = "new") { set $backend search_svc; }
proxy_pass http://$backend;
}
# Everything else still goes to the monolith.
location / { proxy_pass http://monolith; }
}
Envoy and most service meshes give you the same shape with richer matching: route by header, by JWT claim, by request shadowing percentage. The mechanism does not matter. What matters is that traffic shifting is a config change, not a deploy.
Traffic mirroring during cutover
Before flipping the switch, run the request against both systems for read endpoints. Compare responses asynchronously and log diffs. This is how GitHub validated their Scientist refactors, how Twitter migrated user timelines, and how anyone sane validates a new search service. The new service’s response is discarded; the monolith’s response is what the user sees. After two weeks of zero diffs, you flip the switch with high confidence.
Never extract authentication first
Tempting target — everything depends on it, so it “feels” central. Wrong move. Auth is a horizontal concern that touches every other service. If you extract it first you have to migrate every other surface to the new auth in parallel with extracting them. Extract auth last, or leave it in a shared library, or buy it (Auth0, Cognito, Clerk). Start with a leaf bounded context that has clean inputs and outputs.
Branch by Abstraction
Why Refactor Inside First
The Problem: The monolith does not have clean seams. Calls into the “billing” logic are scattered across 40 controllers, three batch jobs, and a cron. You cannot extract billing because there is no single thing called billing.
The Solution: Branch by abstraction. Inside the monolith, create an interface, route every call site through it, then provide a second implementation behind that interface — one that calls the new service. Toggle implementations with a feature flag.
This pattern was named by Paul Hammant and is the workhorse refactor of every successful migration. It looks like over-engineering until you have done it once and shipped a clean cutover with no downtime. Then it looks like the only sane way.
The shape is always the same:
- Find every call site of the code you want to extract.
- Introduce an interface (abstract class, protocol, function pointer — whatever your language gives you).
- Move existing code behind the interface as
OldImpl. Ship. Verify nothing changed. - Build
NewImplthat calls the new service. Ship dark — behind a feature flag, default off. - Flip the flag for 1%, 10%, 100% of traffic. Watch metrics.
- Delete
OldImpland the flag. The interface remains; the monolith now calls the new service for this capability.
# Step 1: define the seam inside the monolith.
class PricingService(Protocol):
def price_for(self, sku: str, qty: int, country: str) -> Money: ...
# Step 2: wrap the existing tangled code in OldPricing.
class OldPricing:
def price_for(self, sku, qty, country):
# existing 400-line method, now reachable only via this seam
return _legacy_price_calc(sku, qty, country)
# Step 3: new implementation calls the extracted service.
class NewPricing:
def __init__(self, client: PricingClient):
self.client = client
def price_for(self, sku, qty, country):
return self.client.quote(sku=sku, qty=qty, country=country)
# Step 4: feature-flagged factory. Every caller in the monolith uses this.
def pricing_for_request(request) -> PricingService:
if flags.enabled("pricing.new_service", user=request.user):
return NewPricing(client=pricing_client)
return OldPricing()
Notice what this buys you: you can roll back instantly. If the new service misbehaves at 10% traffic, flip the flag off. No deploy, no panic. The monolith still has OldImpl — it is the safety net. You only delete it after the new service has been at 100% for long enough that you trust it.
Anti-Corruption Layer
Why Translate at the Boundary
The Problem: The monolith’s data model has 15 years of legacy crust — columns named flag_3, polymorphic associations, business logic encoded in nullability. If your new service speaks that vocabulary, the new service is just a new wrapper around old mess.
The Solution: Build an Anti-Corruption Layer (ACL) at the boundary. The new service has a clean domain model. The ACL translates between that clean model and the monolith’s ugly one. The corruption stays in the ACL.
This is a Domain-Driven Design term from Eric Evans. The point is to refuse to let legacy concepts leak into your new bounded context. If you do not draw this line, the “new” service ends up modeling the same things the monolith did, with the same names, the same nullability quirks, and the same surprising behaviors — and you have built nothing.
An ACL typically has three responsibilities:
- Schema translation. Map
customers.cust_typ_cd = 'PR'toCustomer.kind = "premium". The new service never sees the cryptic code. - Identity mapping. The monolith uses integer IDs assigned by an Oracle sequence. The new service uses ULIDs. The ACL keeps a translation table.
- Behavior reconciliation. The monolith treats null as “not yet decided”; the new service models that as an explicit state. The ACL converts.
# Anti-corruption layer between new OrderService and legacy DB.
class LegacyOrderTranslator:
def to_domain(self, row: dict) -> Order:
return Order(
id=ULID.from_legacy(row["ord_id"]),
status=self._status(row["st_cd"], row["cancel_dt"]),
total=Money.cents(row["tot_amt_cents"], row["curr_cd"]),
customer_id=ULID.from_legacy(row["cust_id"]),
)
def _status(self, code: str, cancel_dt) -> OrderStatus:
if cancel_dt is not None: return OrderStatus.CANCELLED
if code == "P": return OrderStatus.PENDING
if code == "S": return OrderStatus.SHIPPED
if code == "D": return OrderStatus.DELIVERED
raise ValueError(f"unknown legacy status {code}")
The translator is ugly. That is the entire point. The ugliness is contained in one file, with tests. Everything downstream consumes Order objects with sensible names. When the legacy system finally goes away, you delete this file and lose nothing.
Decomposition Heuristics
Why Boundaries Are Hard
The Problem: The most damaging decision in a microservices migration is drawing the wrong service boundaries. Wrong boundaries produce a distributed monolith — all the cost of microservices, none of the benefit.
The Solution: Multiple lenses. No single heuristic is sufficient. Look at bounded contexts, team boundaries, data ownership, change frequency, and scaling needs. Where the lenses agree, you have a service. Where they disagree, you do not yet have enough information.
| Heuristic | Question to Ask | If Yes |
|---|---|---|
| Bounded context (DDD) | Does this area have its own ubiquitous language — words that mean something different here than elsewhere? | Candidate boundary. “Customer” in billing is not the same noun as “customer” in marketing. |
| Team boundary (Conway) | Is one team going to own this end-to-end? | Service per team is sustainable. Service shared across teams is not. |
| Data ownership | Can this service own a set of tables that no one else writes to? | Yes → clean extract. No → you have not found the boundary yet. |
| Change frequency | Does this code change weekly while the rest changes quarterly? | Extract it to deploy independently. |
| Scale axis | Does this part need 10× the resources of the rest? | Extract so you can scale it independently. |
| Failure isolation | If this fails, must the rest stay up? | Extract for blast-radius reasons. |
The order I recommend extracting in
- A leaf with high change frequency. Cheap to extract, the team feels the deploy-frequency benefit immediately.
- A read-heavy capability you need to scale. Search, recommendations, product listings. The new service can read from the monolith’s DB while you stand it up.
- A capability owned end-to-end by one team. Conway aligns; ownership is unambiguous.
- A bounded context that wants its own data model. This is where the anti-corruption layer earns its keep.
- Cross-cutting concerns last. Auth, notifications, audit. They touch everything — extract them once everything else has stabilized.
Bounded contexts come from Domain-Driven Design and are the most reliable lens. If two parts of the monolith use the same word to mean different things — order, customer, product — you have found a context boundary. Force-fitting them into one shared model is what produced the mess in the first place.
Data Migration
The Hardest Part
The Problem: Splitting code is hours. Splitting a database is months. The monolith’s schema has joins and foreign keys across what you now want to call separate services. Cross-service joins do not exist; cross-service foreign keys do not exist. Something has to give.
The Solution: Multiple techniques layered over time: dual writes, change data capture (CDC), shadow reads, eventual consistency. Accept that the consistency model will get weaker. Engineer for that explicitly.
Splitting a database is where most migrations either succeed or quietly become a distributed monolith. The cardinal rule: each service owns its data, and no other service touches that data except through the owner’s API or events. The moment a second service starts reading the “orders” tables directly, you have lost. You can never change the schema again without coordinating across services.
The four phases of splitting a table
- Shared database, separate schemas. Both services hit the same physical DB but use different schemas. Forces ownership clarity without yet paying the operational cost.
- Dual write. When the monolith updates the table, it also calls the new service’s API. Or vice versa. Useful for a brief window; dangerous as a steady state because it has no atomicity guarantee.
- Change data capture (CDC). Run Debezium or similar against the monolith’s WAL/binlog. Stream every change to Kafka. The new service consumes the stream and builds its own copy of the data. No application code in the monolith changes.
- Cutover. Switch writes from the monolith to the new service. The CDC stream now flows the other way (or stops). Eventually the monolith’s tables are decommissioned.
Outbox pattern for safe dual writes
A naive dual write — write to the DB, then call the other service — can leave the two systems out of sync if the call fails after the DB commit. The outbox pattern solves this with a single local transaction that writes both the business row and the “message I owe to the other service.” A separate process publishes from the outbox.
-- The outbox lives in the same database as the business data,
-- so a single ACID transaction covers both writes.
CREATE TABLE outbox (
id BIGSERIAL PRIMARY KEY,
aggregate_id TEXT NOT NULL,
event_type TEXT NOT NULL,
payload JSONB NOT NULL,
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
published_at TIMESTAMPTZ
);
-- Application code writes business + outbox in one transaction.
BEGIN;
UPDATE orders SET status = 'shipped' WHERE id = 12345;
INSERT INTO outbox (aggregate_id, event_type, payload)
VALUES ('order:12345', 'OrderShipped',
'{"order_id":12345,"shipped_at":"2026-05-12T18:00:00Z"}');
COMMIT;
A small relay process polls the outbox (or, better, watches the WAL via Debezium) and publishes each row to Kafka. Once published, it stamps published_at. If the relay crashes, it resumes — consumers must be idempotent on event ID. This pattern is the backbone of every reliable monolith-to-services migration that involves event-driven communication.
Debezium CDC connector example
// Debezium connector for PostgreSQL: stream every change in the
// monolith’s public.orders table to a Kafka topic.
{
"name": "monolith-orders-cdc",
"config": {
"connector.class": "io.debezium.connector.postgresql.PostgresConnector",
"plugin.name": "pgoutput",
"database.hostname": "monolith-db.internal",
"database.port": "5432",
"database.user": "debezium",
"database.password": "${file:/secrets/cdc.pw}",
"database.dbname": "shop",
"topic.prefix": "monolith",
"table.include.list": "public.orders,public.order_items",
"snapshot.mode": "initial", // backfill once, then tail WAL
"publication.autocreate.mode": "filtered",
"heartbeat.interval.ms": "10000", // keep WAL position fresh
"key.converter": "org.apache.kafka.connect.json.JsonConverter",
"value.converter": "org.apache.kafka.connect.json.JsonConverter"
}
}
The new orders-service consumes monolith.public.orders and builds its own materialized view. While dual reads are running, you can shadow read — query both, compare, log diffs. After a few weeks of zero diffs you flip writes to the new service and reverse the CDC direction so the monolith stays in sync long enough for read paths to migrate.
Never share a database between services
Two services writing to the same physical tables is the single most common reason microservices migrations fail. It looks like a shortcut. It is not. The moment two deploy schedules touch the same schema, you have lost the ability to change either independently. You have a distributed monolith with extra network hops. If you take only one rule away from this tutorial, take this one.
Common Anti-Patterns
Most migrations fail in predictable ways. Recognize them early, refuse them by name.
| Anti-Pattern | What It Looks Like | What to Do Instead |
|---|---|---|
| Distributed monolith | Services that must deploy together; chained synchronous calls 5 deep | Asynchronous events; redraw boundaries by data ownership |
| Shared database | Two services writing the same table; cross-service joins | One owner per table; access via API or events only |
| Premature decomposition | 50 services for a 10-engineer team | Start with a modular monolith; extract only when boundaries are obvious |
| Big-bang rewrite | “v2 is in a separate repo, we’ll cut over in Q4” | Strangler fig; ship value every sprint |
| Microservices envy | Adopting because Netflix did, not because of a real bottleneck | Identify the actual constraint; pick the smallest fix that removes it |
| Service per entity | Customer service, address service, email service… | Service per bounded context, not per noun |
| Synchronous everywhere | HTTP call chains, no event spine, every failure cascades | Default to async events; keep sync only for read-after-write scenarios |
| Frozen monolith | “Stop shipping features so we can rewrite” | Migration must run alongside product work or it dies |
The big-bang rewrite is how teams die
If your migration plan involves a date on which everything switches over, redraw the plan. There is no monolith-to-services migration in the public record that succeeded that way. The successful ones are years of small steps with traffic shifting in single-digit percentages. Promise “parallel running for months” in your design doc and mean it.
Real-World Examples
Amazon (~2002). Jeff Bezos’s famous internal mandate: every team must expose its data and functionality through service interfaces, no other communication is permitted, all interfaces must be designed to be externalizable. The migration took years and was driven by a coordination problem — teams could not deploy independently of each other against a shared monolith. The result was the architecture that eventually became AWS.
Etsy. Famously stayed on a PHP monolith well into eight-figure user counts. They invested in continuous deployment (50+ deploys a day to one codebase), staff-plus engineering on infrastructure, and eventually carved out a small number of strategic services. The lesson: a disciplined monolith with great deploy infrastructure beats undisciplined microservices.
Shopify. Runs a Ruby on Rails “modular monolith” — one repo, one process, but with strict internal package boundaries enforced by tooling (Packwerk). They have extracted services where it was justified (storefront rendering, payments) but the core stays modular monolith. This is the direction most successful large Rails shops have gone.
Uber. Went from monolith to roughly 2,200 microservices, then publicly acknowledged they had over-decomposed. They consolidated some areas back into “domain-oriented microservice architecture” — bigger services aligned to business domains rather than tiny ones aligned to entities. The lesson: more services is not better; right-sized services aligned to teams is better.
Stack Overflow. Runs a tiny number of servers and a small number of large services. Their write-up of running “100 million page views per month on nine on-prem servers” is mandatory reading before any architect proposes 50 microservices. Often, you do not have the problem you think you have.
The pattern across all of them
Successful migrations are driven by specific organizational pain: too many teams stepping on each other, one subsystem unable to scale, deploys taking hours and blocking everyone. They are not driven by “the new architecture is more modern.” If you cannot name the concrete pain in one sentence, do not migrate.
Best Practices
The short list
- Default to a monolith. Migrate only when a concrete bottleneck demands it. “It will be cleaner” is not a bottleneck.
- Never freeze the product. The migration must run alongside ongoing feature work. If leadership has to choose, they will always choose features — and the migration will die quietly.
- Strangle, do not rewrite. Insert a routing layer; move routes one at a time; delete the old code after cutover.
- Branch by abstraction inside the monolith first. Create the seam before you try to extract through it.
- One service owns its data. No exceptions. No “just for now” shared tables. No cross-service joins.
- Outbox + CDC for cross-system data. Dual writes without an outbox will eventually drift. CDC via Debezium is the boring, correct answer.
- Anti-corruption layer at every boundary. Refuse to let legacy concepts leak into new contexts.
- Extract leaves first, cross-cutting concerns last. Auth, notifications, billing — those touch everything; do them after the easy wins.
- Mirror traffic before cutting traffic. Compare responses for weeks. Flip the switch only after diffs go to zero.
- Delete the old code. A migration that leaves both systems alive is a migration that doubled your maintenance load.
- Measure organizational outcomes, not service counts. Did deploy frequency go up? Did cross-team blocking go down? Service count is a vanity metric.
In practice, most teams should not migrate — they should clean up. The teams that should migrate should plan for years, not quarters, and should expect that the boundaries they draw on day one will be wrong. That is fine. Strangler fig, branch by abstraction, and anti-corruption layers exist precisely so you can change your mind safely.
The single most useful sentence about migration
If your plan ever requires a date by which everything is switched over, the plan is wrong. Migrations that work are migrations that can pause for two months in the middle and still leave the system in a shippable state.