How a Database Deadlock Killed Our Production System (And How We Fixed It)
A deep dive into a critical production incident where database deadlocks brought down a high-traffic SaaS platform. Learn the root cause, our debugging process, and the architectural changes that prevented recurrence.
The Incident: 3 AM Wake-Up Call
It was 3:17 AM when PagerDuty woke me up. Our monitoring system detected that API response times had spiked from ~80ms to over 30 seconds. Within minutes, the entire platform became unresponsive. Over 15,000 active users were locked out, and our support channels exploded with complaints.
This is the story of how a subtle database deadlock pattern emerged under production load, how we diagnosed it under pressure, and the architectural changes we implemented to ensure it never happened again.
The Symptoms: What We Saw First
When I SSHed into our production Kubernetes cluster, the symptoms were clear but confusing:
- API latency: p99 > 30s (normal: ~100ms)
- Database connection pool: 100% exhausted (200/200 connections in use)
- CPU and memory: completely normal (no resource saturation)
- Error rate: 504 Gateway Timeouts spiking
The confusing part? CPU was at 15%, memory at 40%, disk I/O was minimal. This wasn't a resource exhaustion problem. Something else was blocking our database queries.
The Root Cause: A Hidden Deadlock Pattern
After connecting to our PostgreSQL instance and running SELECT * FROM pg_stat_activity WHERE state != 'idle';, I saw the smoking gun:
pid | state | wait_event_type | wait_event
-----|------------------|-----------------|------------
1234 | active | Lock | transactionid
5678 | active | Lock | transactionid
9012 | active | Lock | transactionid
(... 197 more rows ...)
Nearly all 200 connections were waiting on transaction locks. Checking pg_locks revealed a circular dependency:
- Transaction A held a lock on
userstable row ID 42, waiting forsubscriptionsrow ID 89 - Transaction B held a lock on
subscriptionsrow ID 89, waiting forusersrow ID 42
Classic deadlock. But why was it happening at 3 AM, and why so catastrophically?
Debugging Under Pressure: The Detective Work
Step 1: Immediate Mitigation
First priority: restore service. I killed the deadlocked transactions:
SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE wait_event = 'transactionid'
AND state = 'active';
Within 30 seconds, the connection pool drained, latency dropped to normal, and users could access the platform again. Total downtime: 12 minutes.
Step 2: Understanding the Pattern
I pulled database logs for the past hour and reconstructed what happened. Our application had two critical code paths that could execute concurrently:
Code Path A (User Update Endpoint):
BEGIN;
UPDATE users SET last_active = NOW() WHERE id = ?; -- Locks user row
UPDATE subscriptions SET usage_count = usage_count + 1
WHERE user_id = ?; -- Then locks subscription row
COMMIT;
Code Path B (Subscription Webhook Handler):
BEGIN;
UPDATE subscriptions SET status = 'active' WHERE id = ?; -- Locks subscription row
UPDATE users SET subscription_tier = 'premium'
WHERE id = ?; -- Then locks user row
COMMIT;
The deadlock occurred when:
- Transaction A locked
users.id=42 - Transaction B locked
subscriptions.id=89(for the same user) - Transaction A tried to lock
subscriptions.id=89→ blocked by B - Transaction B tried to lock
users.id=42→ blocked by A - Deadlock.
Step 3: Why Now? The Traffic Pattern
This code had been deployed for 3 months. Why did it suddenly break at 3 AM?
Checking our analytics, we discovered that a new marketing campaign had driven a surge of premium subscription signups (triggering Code Path B frequently). Simultaneously, those new users were highly active, triggering Code Path A repeatedly.
The higher concurrency made the race condition inevitable. With 15,000 active users and subscription webhooks firing every few seconds, the probability of these two transactions overlapping increased from ~0.1% to ~15%. That was enough to cascade into total deadlock.
The Fix: Three Layers of Defense
Layer 1: Consistent Lock Ordering
The immediate fix: ensure all transactions acquire locks in the same order. We standardized on always lock users first, then subscriptions:
-- BEFORE (Code Path B)
BEGIN;
UPDATE subscriptions SET status = 'active' WHERE id = ?;
UPDATE users SET subscription_tier = 'premium' WHERE id = ?;
COMMIT;
-- AFTER (Code Path B)
BEGIN;
-- CHANGED: Lock users first, even if we don't update it yet
SELECT id FROM users WHERE id = ? FOR UPDATE;
UPDATE subscriptions SET status = 'active' WHERE id = ?;
UPDATE users SET subscription_tier = 'premium' WHERE id = ?;
COMMIT;
This eliminated circular dependencies. If Transaction A locks users first, Transaction B will wait for that lock before locking subscriptions. No more cycles.
Layer 2: Reduce Transaction Scope
We also realized that Code Path A didn't actually need to run in a transaction at all. The last_active timestamp update was purely informational and didn't require atomic consistency with the subscription update.
We split the operations:
// BEFORE: Single transaction
await db.transaction(async (tx) => {
await tx.update(users).set({ lastActive: new Date() });
await tx.update(subscriptions).set({ usageCount: sql`usage_count + 1` });
});
// AFTER: Separate operations
await db.update(users).set({ lastActive: new Date() }); // No transaction needed
await db.update(subscriptions).set({ usageCount: sql`usage_count + 1` });
This reduced lock duration by 60% and further decreased deadlock probability.
Layer 3: Deadlock Detection and Retry Logic
As a final safety net, we added deadlock detection with exponential backoff retry:
async function executeWithDeadlockRetry<T>(
operation: () => Promise<T>,
maxRetries = 3
): Promise<T> {
for (let attempt = 0; attempt < maxRetries; attempt++) {
try {
return await operation();
} catch (error) {
// PostgreSQL deadlock error code
if (error.code === '40P01' && attempt < maxRetries - 1) {
const backoffMs = Math.pow(2, attempt) * 100; // 100ms, 200ms, 400ms
await new Promise(resolve => setTimeout(resolve, backoffMs));
continue;
}
throw error;
}
}
throw new Error('Max retries exceeded');
}
This catches the rare remaining deadlocks (PostgreSQL detects them and aborts one transaction) and retries with backoff. In practice, retries succeed on the first attempt 99.9% of the time.
Prevention: Monitoring and Alerting
To catch similar issues before they reach production, we added:
- Database lock monitoring: Alert if >10 queries are waiting on locks for >5 seconds
- Connection pool monitoring: Alert if pool utilization exceeds 80%
- Deadlock rate tracking: Log and graph deadlock occurrences (should be ~0 after our fix)
- Load testing with concurrency: Our CI now runs tests with 50 concurrent users to surface race conditions early
The Results: What Changed
After deploying these fixes to production:
- Zero deadlocks in 8 weeks of production traffic (previously: 2-5 per day)
- p99 latency reduced from ~120ms to ~75ms (lower lock contention)
- Connection pool utilization dropped from average 85% to 40%
- Database CPU reduced by 25% (fewer blocked queries)
Key Lessons Learned
1. Race Conditions Are Probability Functions
This bug existed for months but only surfaced when traffic patterns changed. Low-probability race conditions become certainties at scale. Assume that if something can happen, it will happen under production load.
2. Consistent Lock Ordering Is Non-Negotiable
If your application uses transactions that lock multiple tables, enforce lock ordering in code reviews and linting rules. Document the canonical order (e.g., "always lock users, then subscriptions, then audit_logs").
3. Transaction Scope Should Be Minimal
Every line of code inside a transaction extends lock duration and increases deadlock risk. Aggressively minimize transaction scope. If an operation doesn't need atomicity, move it outside the transaction.
4. Monitoring Must Detect Lock Contention Early
By the time users report slow performance, you're already in crisis mode. Monitor pg_stat_activity and pg_locks proactively. Alert on abnormal lock wait times before they cascade into total failure.
5. Retries Are a Safety Net, Not a Solution
Deadlock retry logic is useful as a last line of defense, but if you're relying on it frequently, you haven't solved the underlying architectural problem. Fix the root cause first, then add retries for the remaining edge cases.
Conclusion: Production Reliability Is About Architecture
This incident reinforced a core principle: production reliability isn't about heroic debugging at 3 AM—it's about architectural discipline that prevents 3 AM wake-up calls in the first place.
The real fix wasn't the code changes. It was the systematic approach:
- Diagnose with data (logs, metrics, database introspection)
- Understand the root cause (not just the symptoms)
- Implement layered defenses (prevention + detection + mitigation)
- Add monitoring to catch recurrence early
- Load test to verify the fix under realistic concurrency
If you're building a SaaS platform that processes real-world traffic, database deadlocks aren't a matter of "if" but "when." The difference between a minor incident and a catastrophic outage is whether you've designed your system to handle them gracefully.
Need help ensuring your production systems can handle scale without breaking? Let's talk. We specialize in diagnosing, fixing, and preventing exactly these kinds of production incidents.