June 18, 2026reliabilitysystems

10 Production Failure Patterns I Keep Seeing (And Why They Never Really Go Away)

Dozens of post-incident reports, the same handful of failure shapes — each time wearing different clothes.

Over my career at high-scale commerce and fintech companies, I’ve read through dozens of post-incident reports across different teams and systems. Different services, different languages, different years — but when you line the reports up side by side, the same handful of failure shapes keep showing up wearing different clothes.

This isn’t a “10 best practices” listicle. It’s the opposite: a look at why experienced engineers keep hitting the same walls, even when the specific bug is new. If you work on any backend system at scale, I’d bet at least six of these will feel uncomfortably familiar.

1. Missing Null / Empty Value Handling

This is, by far, the most common root cause I’ve come across. It’s popular precisely because it’s invisible until it isn’t — the compiler is happy, the tests are green, and then real-world data shows up with a field that “should always be there.”

What it looks like in practice:

A service assumes a user record always has a certain attribute. Some legacy accounts don’t. Crash.
Code calls .replace() or similar string methods on a value that’s null because an older record never had that field populated.
A config value is programmatically set to an empty string instead of being unset. strconv.Atoi("") (or your language’s equivalent) returns an error, not a panic — but the caller ignores the error, carries on with the zero value, and blows up a few lines downstream.
A boolean condition meant to be A && B is accidentally written as A || B, which quietly makes the check pass in cases it never should — skipping a fallback path nobody notices is gone.

Why it keeps happening: we test the happy path exhaustively and the “this field is missing” path almost never. Production is the only place with years of accumulated messy data — staging never has it.

2. Shipping Without Testing the Actual Failure Scenario

This is less “no QA happened” and more “QA happened, but on the wrong thing.” There are usually four flavors:

No testing at all, because the change felt too small to matter.
Testing happened, but on a scenario that wasn’t the one that broke.
A debug/test value got left in the code and shipped to production.
The blast radius of the change was bigger than the person deploying it realized.

What it looks like in practice:

A queue consumer gets deployed without dev testing or sign-off, quietly stops processing messages, and nobody notices until a backlog alarm fires hours later.
QA validates one order type or user segment thoroughly, but the bug only manifests for a different segment that was never in the test plan — so the team ships with false confidence.
A hardcoded debug value (like a placeholder ID that should map to “no override”) never gets reverted before the production push.
A one-day “quick fix” adds a multi-table join inside a hot code path with no load or performance review, and it works fine — until real traffic hits it.

Why it keeps happening: the pressure to ship fast is real, and “it’s a small change” is a very persuasive (and very wrong) heuristic for risk.

3. Cache Design Failures

Caching bugs are disproportionately painful because a cache is, by definition, shared across every request that hits it. A caching mistake isn’t a bug affecting one user — it’s a bug affecting everyone, instantly.

What it looks like in practice:

A diagnostic/monitoring command gets left running against a production cache cluster and slowly eats memory until the cluster falls over hours later.
A race condition between two writers means an empty object gets cached instead of real data — and the code checks if value is None, but an empty dict or empty list is not None. The check passes, the empty value serves for hours.
A schema change means a new service version writes a different data shape under the same cache key that older instances still read — old code gets a shape it doesn’t expect and panics.
A filter that used to exclude a subset of records from caching gets accidentally removed, and suddenly the cache is asked to hold every record instead of a fraction of them.
A feature flag rollout unintentionally triggers a full keyspace scan (SCAN/KEYS-style operation) across the entire cache instead of a targeted lookup, pinning CPU at 100%.

Why it keeps happening: caches sit at the intersection of correctness and performance, and most people reason carefully about one but not both at the same time.

4. Shared Infrastructure Blast Radius

This is the “I only changed my service” trap. Infrastructure is often more shared than the org chart suggests, and a change scoped to “just my team’s ingress rule” or “just my team’s plugin” can quietly affect every other team riding the same shared component.

What it looks like in practice:

A team adds an SSL-redirect annotation to their own ingress config — not realizing the annotation applies at the level of the entire shared load balancer group, breaking TLS for every other service behind it.
A canary deployment clones all existing gateway plugin configs and then re-adds a plugin that already exists, creating a duplicate ID that causes the entire gateway to reject its whole config — every route goes down, not just the one being changed.
A new business rule gets added directly to a shared global configuration file with no feature flag, and it goes live everywhere at once instead of rolling out gradually.
A third-party integration gets enabled for 100% of traffic in one go instead of a staged rollout; request volume jumps 40x in minutes and the vendor’s rate limiter blocks the whole account.

Why it keeps happening: “shared infrastructure” is invisible in most mental models of a system. You can see your own service’s dependency graph; you usually can’t see who else depends on the same ALB, the same API gateway, or the same config file.

5. Database Connection Pool Misconfiguration

This one tends to cluster — once a team hits it, a sibling team hits the exact same shape of bug weeks later, because the lesson didn’t generalize past the original incident.

What it looks like in practice:

A connection pool is sized for QA-level traffic and never revisited before a production launch; real load exhausts it immediately.
A default configuration value silently sets max connections to zero, so the very first request after a deploy fails.
DNS caching causes all database connections to resolve to a single replica instead of load-balancing across a cluster, spiking CPU on one node while others sit idle.

Why it keeps happening: connection pool settings are “set once and forget,” but traffic patterns change continuously. Nobody revisits a config that isn’t currently on fire.

6. Kafka (or Any Message Queue) Reliability Assumptions

Message queues promise durability and ordering, but only if you actually configure them for it. Default settings optimize for throughput, not guaranteed delivery — and teams discover this the hard way.

What it looks like in practice:

An older client library defaults to a low-durability acknowledgment mode. During a routine broker upgrade, a partition leadership change causes in-flight messages to be silently dropped — no error, no retry, just gone.
A hardware issue on one broker increases publish latency just enough that a request timeout cancels the publish mid-flight, leaving records stuck in an intermediate state.
After a migration, a consumer resets to the earliest offset instead of the last committed one and replays months of historical messages, causing a latency spike that looks like an outage.
A consumer hits a malformed message, enters a crash loop, and nobody notices because lag alerting isn’t wired up — it’s caught by user complaints instead of monitoring.

Why it keeps happening: “the queue will handle it” is a comforting assumption that’s only true if someone actually configured the durability guarantees you’re relying on.

7. Config and Secrets Management Failures

Config systems (Vault, environment variables, feature flag services) are meant to be the safe, auditable way to change behavior without a deploy. But they become a single point of failure the moment a bad value can reach production without going through the same review a code change would.

What it looks like in practice:

A secret gets rotated, but running services aren’t restarted, so they keep using the old value until something downstream starts rejecting it.
A trailing newline character sneaks into a config value that isn’t parsed until first use. Startup validation passes and the health check reports green — then the first request to touch that value hits the numeric parse and the runtime panics.
A gateway misconfiguration silently routes an entire category of requests to the wrong destination, and it takes a while to notice because the requests aren’t failing — they’re just going somewhere wrong.

Why it keeps happening: config changes feel lower-risk than code changes, so they often skip the review, staging, and rollback processes that code gets by default.

8. Logic Inversions During Migration

When business logic moves from an old service to a new one, it’s incredibly easy to invert a priority order or flip a condition without noticing — because the new code compiles fine and looks reasonable on its own. It only looks wrong next to the old code, and nobody does that comparison.

What it looks like in practice:

The old system says “system default overrides user preference.” The new implementation flips it to “user preference overrides system default.” Nobody wrote a test that would have caught the swap, because both versions are individually defensible.
Logic gets migrated to a new service, and the cleanup of the old logic ships before anyone verifies the new implementation actually matches the old behavior side by side.
A standard code path initializes an important internal flag; a legacy path that’s supposed to do the same thing never did. Everything works until a new feature is added to only the standard path, and the legacy path silently breaks.

Why it keeps happening: migrations optimize for “does the new code work,” not “does the new code match the old code’s exact semantics” — and those are very different bars.

9. Retry Amplification Without Circuit Breakers

Retries are supposed to make a system more resilient. But the reflex to “just retry it” quietly assumes the retry is cheap and the downstream can absorb it — and when that assumption is wrong, retries stop being a safety net and start being the thing that keeps a struggling dependency from ever recovering.

Most teams get this right for external, third-party API calls — timeouts, circuit breakers, backoff, all standard practice by now. The gap is everywhere else: databases, internal caches, other internal services. Because they’re “ours,” they don’t get the same defensive treatment, even though they fail in exactly the same ways.

What it looks like in practice:

A cache read times out under load. Instead of falling back to the source of truth and moving on, the code treats the timeout as “the value wasn’t there” and immediately issues a write back to the same cache to repopulate it — adding more load to a cache that’s struggling precisely because it’s overloaded. The read timeout and the resulting write compound each other, and the cache never gets the breathing room to recover.
A workflow step calling a degraded downstream is configured to retry with exponential backoff for up to 30 minutes. On paper this looks resilient — “we’ll keep trying, no data is lost.” In practice, with no circuit breaker in place, every one of those retries still lands on the downstream the entire time it’s trying to recover, and a dependency that might have healed in a couple of minutes under reduced load instead stays saturated for the full 30.

Why it keeps happening: retry logic is usually added to handle transient blips, not sustained degradation — nobody explicitly decides “let’s keep hammering this for 30 minutes,” it just falls out of a reasonable-looking backoff config applied without a circuit breaker to cut it off. And because circuit breakers are so associated with “external API hygiene,” it’s easy to forget that a database or an internal cache under load needs the exact same protection.

10. Observability Gaps That Delay Detection

This one is rarely the cause of an incident, but it’s almost always a multiplier — the difference between a five-minute blip and a five-hour outage. The pattern: the right metric either doesn’t exist, or its alert threshold was set for a traffic volume from a year ago.

What it looks like in practice:

Logs are fully redacted for compliance reasons, which is correct — except it also hides the exact error and line number needed to diagnose a crash quickly.
A downstream service swallows an error instead of propagating it, so the upstream service’s error rate dashboard looks perfectly healthy the entire time something is actually broken.
An alert threshold of 0.1% error rate sounds tight, until you realize that on a high-traffic endpoint, 0.1% is thousands of failed requests a day — and the issue isn’t caught until it’s ten or twenty times worse than the threshold implies.
Nobody is watching database-internal metrics (like which index the query planner is choosing), so a silent planner regression after a bulk data load goes unnoticed until CPU alarms fire.

Why it keeps happening: observability is an ongoing investment, not a one-time setup, and traffic patterns outgrow yesterday’s thresholds constantly.