Why Idempotency Matters
Why Idempotency Matters
The Problem: The network always lies. A request that “timed out” may have succeeded on the server, succeeded but lost the response on the wire, or never arrived at all. The client cannot tell. So clients retry — and the server, helpfully, processes the request a second time. The customer gets charged twice. The order ships twice. The email goes out twice.
The Solution: Make every write operation safe to repeat. Either the operation is naturally idempotent (it produces the same result no matter how many times you apply it), or the server uses an idempotency key to detect duplicates and return the original response.
Real Impact: Stripe’s entire write API is idempotency-key driven. PayPal, Square, Adyen, and every major payment processor follow the same pattern. It is the only thing that lets a customer’s phone retry on a flaky LTE connection without your accountant getting paged.
Real-World Analogy
Think about pressing the elevator call button:
- You press it once — the elevator is summoned.
- You press it five more times because you’re impatient — same outcome. One elevator, one trip.
- The button is idempotent. Repeated presses do not summon five elevators or charge you five times.
Now imagine an elevator that did come five times because you pressed five times. That is what a non-idempotent POST /charge looks like to your customer when their phone retries on a dropped connection. Idempotency is the property that turns “press again, just in case” from a bug into a feature.
Distributed systems make “did this happen?” an unanswerable question. The TCP connection drops at the wrong moment, the load balancer times out and gives up, the API gateway returns 504, the mobile app loses signal between request and response — and the client has no way to tell whether the server processed the write or not. Its only safe option is to retry. Your only safe option is to be idempotent.
The shape of a duplicate-charge bug
The exact failure mode you are defending against:
- Mobile client sends
POST /paymentswith the order details. - Server receives it, charges the card via Stripe, writes a payment row to Postgres, and sends back
201 Created. - The 201 never reaches the client — LTE handover, TLS reset, captive portal, anything.
- The client’s HTTP library times out and retries. It has no way to know the first attempt succeeded.
- Server receives the second request. It has no idea this is a retry. It charges the card again.
- Customer sees two charges on their statement, your support inbox lights up, your refund process kicks in. None of this had to happen.
The fix is one HTTP header and a tiny bit of server-side bookkeeping. The cost of not doing it is paid in chargebacks, refunds, and customer trust.
Definition and Math
An operation f is idempotent if applying it more than once produces the same result as applying it once:
# Mathematical definition
f(f(x)) == f(x)
f(f(f(x))) == f(x)
f(...f(x)...) == f(x) # for any number of applications
The state after n applications is identical to the state after one application. The network can lose, duplicate, or reorder requests as much as it likes; the system converges to the same place.
HTTP verbs and idempotency
RFC 9110 (the current HTTP semantics spec) is explicit about which verbs are idempotent:
| Verb | Idempotent? | Why |
|---|---|---|
| GET | Yes | Reads do not mutate state. |
| HEAD | Yes | Like GET, no body. |
| PUT | Yes | “Set the resource at this URL to this value.” Doing it twice sets it to the same value. |
| DELETE | Yes | “Resource gone” is the same after one or ten deletes. |
| OPTIONS | Yes | Metadata only. |
| POST | No | “Create a new resource” — doing it twice creates two resources. |
| PATCH | No (usually) | Partial update. {"balance": "+10"} applied twice adds 20. |
The verbs you have to defend
POST and PATCH are the dangerous ones. Every retried POST is a potential duplicate. Every PATCH that performs a delta (rather than a set) is non-idempotent by construction. These are the operations where you reach for an idempotency key.
Operations that look idempotent but aren’t
The most common mistake is assuming an operation is idempotent because the verb is PUT or because the code “feels” safe. A few traps:
- Counter increments.
UPDATE accounts SET balance = balance + 100is the canonical non-idempotent operation. Two retries means +200. - Append-only writes.
INSERT INTO audit_log ...appends a new row every time. Naturally non-idempotent. - External side effects. Sending an email, dispatching a webhook, publishing to Kafka. The DB write may be idempotent; the email is not.
- Time-relative updates.
SET expires_at = NOW() + interval '1 day'— the second call extends the expiry further. - UUID generation server-side. If the server picks the new ID, two retries create two rows with two different IDs.
The fix in each case is to rephrase the operation in absolute, deterministic terms: set the balance to a target, use a client-supplied ID, base time on the request payload not the wall clock.
The Retry + Idempotency Marriage
Why “Exactly-Once” Is a Lie
The Problem: True exactly-once delivery between two machines over an unreliable network is impossible — this is a result of the FLP impossibility theorem and the Two Generals problem, and it has been settled distributed-systems theory for decades. You cannot have it.
The Solution: Settle for at-least-once delivery + idempotent processing. The combination produces what looks indistinguishable from exactly-once to anyone observing the system. This is what every modern stack actually does.
The equation that runs every reliable system in production:
at-least-once delivery + idempotent receiver = effectively-once
The sender retries until acknowledged. The receiver sees duplicates but discards them. The end-to-end behaviour is “the operation happened exactly once” from the user’s perspective — without any of the costly distributed-coordination protocols (Paxos rounds across the request path, two-phase commit, transactional outbox synchronously coupled to the wire) that “real” exactly-once would require.
This is the loop you are designing for: the second request finds the first’s footprint and short-circuits. Card not charged twice. Order not placed twice. The customer experience is identical to a clean single-attempt success.
Idempotency Keys
Why a Client-Supplied Key
The Problem: The server cannot distinguish “the user clicked Pay twice” from “the network ate the response and the SDK retried.” Hashing the request body is not enough — the user might genuinely want to charge $50 twice in a row.
The Solution: The client generates a unique key per logical operation — a UUID v4 or ULID — and includes it as the Idempotency-Key header on every retry. Same key means “same logical request”; new key means “new operation.”
This is the pattern that Stripe popularized and that an in-progress IETF draft (draft-ietf-httpapi-idempotency-key-header) is trying to standardize. The contract is small and worth memorizing:
The Idempotency-Key contract
- Client generates a unique key (UUID v4, ULID, or any opaque random string ≤ 255 chars) for each logical operation.
- Client sends the key in the
Idempotency-KeyHTTP header on the request. - If the request fails (timeout, 5xx, connection drop), the client retries with the same key.
- If the user takes a new action (a second purchase), the client generates a new key.
- Server stores the key + the response for a TTL window (24 hours is Stripe’s window).
- Within the TTL, repeated requests with the same key return the original response without re-executing.
Generating a key on the client
# Python client — uuid4 is fine; ULID is nicer because it sorts by time
import uuid, requests
key = str(uuid.uuid4()) # generated ONCE per logical operation
def charge_with_retry(amount, customer_id, max_attempts=3):
for attempt in range(max_attempts):
try:
r = requests.post(
"https://api.example.com/charges",
json={"amount": amount, "customer": customer_id},
headers={"Idempotency-Key": key}, # SAME key on every retry
timeout=10,
)
if r.status_code < 500:
return r.json()
except (requests.Timeout, requests.ConnectionError):
pass
time.sleep(2 ** attempt)
raise RuntimeError("charge failed after retries")
The same call from curl
# Stripe-style. Same header, any HTTP client. Note the key is stable across retries.
curl https://api.example.com/charges \
-X POST \
-H "Authorization: Bearer sk_live_..." \
-H "Idempotency-Key: 01HMV8Q4Y6X9C3GZ8H1N7T2WPK" \
-H "Content-Type: application/json" \
-d '{"amount": 5000, "currency": "usd", "customer": "cus_123"}'
UUID v4 vs ULID
Both work. UUID v4 is universal but sorts randomly — bad for a B-tree index if you are storing keys in Postgres. ULID embeds a timestamp prefix, so keys are time-sortable and the index pages stay hot. For high-volume APIs, ULID (or UUID v7) makes the dedup lookup measurably cheaper.
Server-Side Storage
Why Storage Choice Matters
The Problem: The dedup store is on the hot path of every write. It has to be fast (sub-millisecond), correct under concurrent retries (no race-condition double-execution), and durable enough that you don’t lose dedup state during a deploy.
The Solution: Pick a backing store with an atomic “insert-if-absent” primitive. Redis with SET NX, Postgres with a unique constraint, or DynamoDB with a conditional write. Each has tradeoffs.
The three choices that actually ship
| Backing Store | Atomic Primitive | Pros | Cons |
|---|---|---|---|
| Redis | SET key value NX EX 86400 | Sub-ms latency, native TTL, simple | Memory-bound; not durable by default; one more failure domain |
| Postgres | UNIQUE (idempotency_key) + INSERT ... ON CONFLICT | Same DB as your business data — one transaction covers both | Slower than Redis; need a cron to GC expired rows |
| DynamoDB | PutItem with ConditionExpression: attribute_not_exists(pk) + TTL attribute | Auto-TTL, multi-region, no schema migrations | Eventual-consistent reads can confuse a fast retry; pay-per-request math |
Stripe-style middleware in Python with Redis
import json, hashlib, redis
from functools import wraps
from flask import request, jsonify, Response
r = redis.Redis(host="redis", port=6379, decode_responses=True)
TTL_SECONDS = 24 * 60 * 60 # 24h, like Stripe
def idempotent(scope: str):
"""Wrap a write endpoint. Requires Idempotency-Key header."""
def decorator(fn):
@wraps(fn)
def wrapper(*args, **kwargs):
key = request.headers.get("Idempotency-Key")
if not key:
return jsonify({"error": "Idempotency-Key header required"}), 400
# Scope by tenant + endpoint to avoid cross-customer key collisions.
tenant = request.headers.get("X-Tenant-Id", "anon")
redis_key = f"idem:{tenant}:{scope}:{key}"
# Fingerprint the request body so “same key, different body” is rejected.
body_hash = hashlib.sha256(request.get_data()).hexdigest()
# Atomic claim: SET ... NX returns True only on first insert.
claimed = r.set(
f"{redis_key}:lock",
body_hash,
nx=True,
ex=60, # short lock; releases if request crashes
)
# Already-completed request — replay the cached response.
cached = r.get(f"{redis_key}:resp")
if cached:
stored = json.loads(cached)
if stored["body_hash"] != body_hash:
return jsonify({"error": "key reused with different body"}), 422
return Response(stored["body"], status=stored["status"],
mimetype="application/json",
headers={"Idempotent-Replay": "true"})
# Concurrent retry while the original is still in flight.
if not claimed:
return jsonify({"error": "request in progress, retry shortly"}), 409
# Execute the wrapped handler.
response = fn(*args, **kwargs)
status = response.status_code if hasattr(response, "status_code") else 200
body = response.get_data(as_text=True)
# Cache the response so the next retry replays it.
r.set(
f"{redis_key}:resp",
json.dumps({"status": status, "body": body, "body_hash": body_hash}),
ex=TTL_SECONDS,
)
r.delete(f"{redis_key}:lock")
return response
return wrapper
return decorator
# Usage
@app.post("/charges")
@idempotent(scope="charges")
def create_charge():
body = request.json
charge = stripe_client.charge(body["amount"], body["customer"])
return jsonify(charge), 201
Postgres unique-constraint pattern
If your business writes are already going to Postgres, putting the dedup record in the same transaction is the cleanest pattern. Either both happen or neither does — no “wrote the row, forgot to record the key” failure mode.
-- one-time migration
CREATE TABLE idempotency_keys (
tenant_id UUID NOT NULL,
scope TEXT NOT NULL,
key TEXT NOT NULL,
body_sha256 BYTEA NOT NULL,
response JSONB NOT NULL,
status_code SMALLINT NOT NULL,
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
expires_at TIMESTAMPTZ NOT NULL DEFAULT now() + '24 hours'::interval,
PRIMARY KEY (tenant_id, scope, key)
);
CREATE INDEX ON idempotency_keys (expires_at); -- nightly GC
import psycopg, hashlib, json
def create_charge_idempotent(conn, tenant, key, payload):
body_hash = hashlib.sha256(json.dumps(payload, sort_keys=True).encode()).digest()
with conn.transaction():
# Single statement: try to insert a placeholder. If the key already exists,
# RETURNING is empty and we know it’s a replay.
row = conn.execute("""
INSERT INTO idempotency_keys (tenant_id, scope, key, body_sha256, response, status_code)
VALUES (%s, 'charges', %s, %s, '{}'::jsonb, 0)
ON CONFLICT (tenant_id, scope, key) DO NOTHING
RETURNING xmax
""", (tenant, key, body_hash)).fetchone()
if row is None:
# Replay path — fetch what we returned the first time.
existing = conn.execute("""
SELECT body_sha256, response, status_code
FROM idempotency_keys
WHERE tenant_id = %s AND scope = 'charges' AND key = %s
""", (tenant, key)).fetchone()
if existing[0] != body_hash:
raise ValueError("idempotency key reused with different body")
return existing[1], existing[2]
# First-time path — do the real work *in the same transaction*.
charge = stripe_client.charge(payload["amount"], payload["customer"])
conn.execute("INSERT INTO payments (id, amount, ...) VALUES (...)")
conn.execute("""
UPDATE idempotency_keys
SET response = %s, status_code = 201
WHERE tenant_id = %s AND scope = 'charges' AND key = %s
""", (json.dumps(charge), tenant, key))
return charge, 201
Race conditions during concurrent retries
Two retries arriving within milliseconds of each other is a real production case (the client’s SDK + your load balancer’s own retry policy can stack). Your dedup logic must use an atomic primitive — Redis SET NX, Postgres ON CONFLICT, DynamoDB conditional write. A read-then-write pair is a TOCTOU bug waiting to happen: both reads see “not present,” both writes execute the side effect, both insert their own dedup row.
Naturally Idempotent Designs
Why API Shape Matters
The Problem: Idempotency keys add bookkeeping. They are a runtime fix for an API design that wasn’t safe to begin with.
The Solution: Where you can, design the operation so that it is naturally idempotent. No key required. The math takes care of itself.
Three design moves that eliminate the problem before you have to solve it:
1. Content-addressable IDs
Let the client provide the resource ID, and derive it deterministically from the request payload (or just have the client supply a UUID). Then POST /orders becomes PUT /orders/{client-supplied-id} — and PUT is idempotent by spec.
# Client picks the order ID. Server upserts.
PUT /orders/01HMV8Q4Y6X9C3GZ8H1N7T2WPK
Content-Type: application/json
{"customer": "cus_123", "items": [...]}
If the request is retried, the second PUT lands on a row that already exists. The server can either no-op (idempotent INSERT-IF-NOT-EXISTS) or overwrite with the same payload. Either way, no duplicate.
2. Conditional updates with If-Match / ETag / version
For mutations that do change state, use optimistic concurrency. The client sends a precondition; the server applies the update only if the precondition holds. A retry hits a precondition that already advanced — it gets a 412 Precondition Failed and the client knows the update was already applied.
PATCH /orders/abc-123
If-Match: "v7"
Content-Type: application/json
{"status": "shipped"}
# Server logic:
# UPDATE orders SET status='shipped', version=8 WHERE id='abc-123' AND version=7;
# If 0 rows affected, return 412. The client now knows: either someone else moved it,
# or this is a retry of an already-applied update.
3. Set-based vs delta operations
The single most useful API-design rule for idempotency:
Prefer “set to X” over “add Y”
Compare SET balance = 500 (idempotent — ten retries leave you at 500) versus ADD 50 to balance (non-idempotent — ten retries leave you at +500). Same outcome on the happy path. Wildly different on a retried request. Whenever a domain operation can be expressed as “target state” rather than “delta,” do it.
You cannot always do this — financial postings are intrinsically deltas, “add 1 to view count” is intrinsically a delta — but every time you can, you remove an entire class of bug.
Idempotency in Async Systems
Why Brokers Make This Worse, Not Better
The Problem: Message brokers (Kafka, RabbitMQ, SQS, NATS) deliver at least once. A consumer that crashes after processing a message but before committing the offset will see that message again on restart. Consumer-side dedup is not optional.
The Solution: Every message gets a stable ID. Consumers maintain a “processed messages” table and short-circuit on duplicates. Same idempotency principle, different transport.
The same exactly-once-is-a-lie equation applies, just one layer down. Kafka’s “Exactly-Once Semantics” (EOS, introduced in 0.11) is genuinely at-least-once + idempotency on the producer side + transactional offset commits on the consumer side. The marketing name oversells it; the underlying mechanism is exactly what you just learned.
Idempotent Kafka consumer with a processed-IDs table
from confluent_kafka import Consumer
import psycopg, json
consumer = Consumer({
"bootstrap.servers": "kafka:9092",
"group.id": "order-processor",
"enable.auto.commit": False, # commit only after success
})
consumer.subscribe(["orders.created"])
def process(conn, msg):
payload = json.loads(msg.value())
msg_id = payload["event_id"] # producer-supplied stable ID, NOT offset
with conn.transaction():
# Atomic claim: insert into processed table; conflict means already done.
inserted = conn.execute("""
INSERT INTO processed_events (event_id, topic, processed_at)
VALUES (%s, %s, now())
ON CONFLICT (event_id) DO NOTHING
RETURNING event_id
""", (msg_id, msg.topic())).fetchone()
if inserted is None:
# Already processed — this is a replay. Acknowledge and move on.
return
# Real work happens in the SAME transaction as the dedup insert.
# If this commit fails, the dedup row also rolls back — we’ll retry cleanly.
conn.execute("INSERT INTO orders (id, customer, total) VALUES (%s, %s, %s)",
(payload["order_id"], payload["customer"], payload["total"]))
while True:
msg = consumer.poll(1.0)
if msg is None or msg.error():
continue
try:
process(conn, msg)
consumer.commit(msg) # commit offset only after DB tx succeeds
except Exception:
# DO NOT commit. Broker will redeliver; dedup row protects us.
conn.rollback()
The two non-negotiables for at-least-once consumers
- Producer-supplied event ID, not Kafka offset. Offsets reset across topic re-creation; an event ID generated by the producer is stable forever.
- Dedup write + business write in one DB transaction. If the order insert fails, the dedup row must roll back too — otherwise the next retry will see “already processed” and silently drop a real event.
Common Mistakes
Almost every production idempotency bug we have seen falls into one of these five buckets.
| Mistake | What Happens | Fix |
|---|---|---|
| TTL too short | Mobile client retries an hour later (queued offline) — key has expired, request executes again | 24h minimum; align with longest expected client retry window |
| Wrong key scope | User A’s key collides with User B’s key — one customer’s response served to another | Always namespace by tenant + endpoint: idem:{tenant}:{scope}:{key} |
| Storing only the request | Replay returns a fresh 201 instead of the cached body — client sees a new resource ID it can’t correlate | Cache the full response body + status code, not just “we’ve seen this” |
| Same key, different body | Client SDK bug reuses a key for two different operations — second one silently no-ops | Hash the body, store the hash, reject mismatches with 422 |
| Side effects outside the transaction | DB write rolls back, but the email already shipped — idempotent in storage, not in reality | Use the transactional outbox pattern: write the “send email” intent in the same DB tx, dispatch later |
Never retry a non-idempotent operation without an idempotency key
This is the single rule that prevents the “single network blip becomes a duplicate charge” class of incident. If your HTTP client has retries enabled (and most do, by default) and your endpoint is a POST without idempotency-key support, you have a bug shipping right now — you just haven’t had the right TCP reset to find it. Either turn off retries on writes, or make the writes idempotent. Pick one.
Partial-success traps
The trickiest bugs are the ones where the operation is partly idempotent. The DB row is unique-constrained; the email is not. The order insert is in a transaction; the call to the warehouse API is not. A retry de-dupes the part you protected and re-fires the part you didn’t.
The pattern that fixes this is the transactional outbox: every external side effect is recorded as a row in an outbox table inside the same DB transaction as the business write. A separate worker drains the outbox and dispatches. Now the “send email” intent is part of the dedup boundary, and a retry that hits a duplicate key never produces a duplicate email.
Real-World Examples
Every payment processor and every “serious” API uses some flavour of this pattern. The header name is roughly standardized; the semantics are nearly identical.
| System | Header / Mechanism | TTL / Notes |
|---|---|---|
| Stripe | Idempotency-Key on every POST | 24h window. Canonical implementation; the IETF draft cites them. |
| Square | idempotency_key field in JSON body for payments, refunds, checkouts | Required for write operations. Replay returns the original response. |
| PayPal | PayPal-Request-Id header on Orders v2 / Payouts | ~6 hours typical; documented per endpoint. |
| Adyen | Idempotency-Key header on payments & modifications | Replay returns the original response within the retention window. |
| AWS SDK | ClientToken field on EC2 RunInstances, S3 multipart, etc. | Built into the AWS retry policy; the SDK auto-supplies tokens. |
| Apple Pay | Per-transaction tokens (DPAN) bound to the device + nonce | Each tap is a different token; replay is rejected at the network. |
| Kafka EOS | Producer ID + sequence number; transactional offset commits | “Exactly-once” = at-least-once + producer idempotency + read-process-write tx. |
The IETF draft
There is an in-progress IETF draft (draft-ietf-httpapi-idempotency-key-header) that codifies the Stripe-style header for the broader HTTP world. It defines the header name (Idempotency-Key), allowed character set, recommended TTL behaviour, and how to signal replays back to the client (a 200/201 with an Idempotency-Replayed: true hint, or just the same response). It is not yet a finalized RFC, but it is the de facto standard for new APIs.
How Stripe’s implementation actually behaves
- Keys are accepted on every
POST.GETignores them (GET is already idempotent). - If the request body differs from the original, Stripe returns
400with an explicit error. Same key + different body is treated as a client bug, not a replay. - The TTL is 24 hours. Replays inside that window return the original response with full fidelity, including the
Idempotency-Replayedresponse header. - Concurrent retries arriving while the first is still in flight return
409 Conflictwith an instruction to back off and retry. - The official Stripe SDKs auto-generate a key on every write — you opt out, not in. This is the “safe by default” posture.
Best Practices
The short list
- Require an Idempotency-Key on every write. Reject requests without one with a 400. Make it impossible to forget.
- Scope keys properly.
{tenant}:{endpoint}:{key}. Never trust the raw client value as your storage key. - Cache the full response. Status code, headers, body. A replay should be byte-identical to the original where possible.
- Hash the request body and reject mismatches. Same key + different body is a client bug; surface it loudly with 422.
- Use atomic primitives. Redis
SET NX, PostgresON CONFLICT, DynamoDB conditional writes. Never read-then-write. - Pair the dedup write with the business write in one transaction. Otherwise partial failures will eat events.
- Set TTL to at least 24 hours. Mobile clients retry from offline queues hours later; short TTLs leak duplicates.
- Design for natural idempotency where possible. Client-supplied IDs (PUT not POST), set-based updates, conditional
If-Match. The best dedup table is the one you don’t need. - Wrap external side effects in a transactional outbox. Emails, webhooks, third-party API calls — treat them as data, not as direct calls.
- Return a replay signal. An
Idempotent-Replay: trueheader on cached responses is invaluable when debugging. - Monitor replay rate. A spike in replays means an upstream is shaky; a drop to zero means the SDK isn’t sending keys.
How the giants do it
Stripe built the modern playbook. Every write endpoint accepts Idempotency-Key; the official SDKs supply one automatically; the engineering blog post that introduced it (“Designing robust and predictable APIs with idempotency”) is required reading. Their entire payments business depends on the property that a retry never charges twice.
Amazon ships idempotency at the SDK layer. EC2 RunInstances, S3 multipart upload, DynamoDB transactions — all accept a ClientToken or equivalent, and the official SDKs add one to every request before retrying. You get idempotency without thinking about it, which is the only kind of idempotency that survives an on-call rotation.
Confluent / Apache Kafka spent years productionizing “exactly-once.” What shipped is producer-side idempotency (sequence numbers per producer ID prevent duplicate appends) plus transactional offset commits. End-to-end EOS for a Kafka pipeline is “at-least-once + dedup at every hop” — the same equation, scaled to streams.
The single most useful sentence about idempotency
The network does not get more reliable, your retry logic does not get smarter, and your customers do not get more patient. The only thing you control is whether the second attempt is safe. Make it safe, and most of the “impossible” bugs in distributed systems quietly disappear.