Systems April 3, 2026 5 min read

What to Do When You Inherit Infrastructure You Don't Trust

Every inherited system has a load-bearing piece that nobody documented. The question is whether you find it before or after it fails under pressure.

By Yury

What to Do When You Inherit Infrastructure You Don't Trust

Taking over a system built by someone else — particularly someone who is no longer available to answer questions — is one of the most common situations in engineering and one of the least discussed in any structured way. Most guidance on it amounts to “read the documentation,” which assumes documentation exists and is accurate, and “talk to whoever was responsible,” which assumes they are there and remember.

In practice, inherited systems usually have gaps in documentation, undocumented assumptions baked into the code, and a handful of things that are quietly load-bearing in ways nobody realised when they were built.

Why inherited systems are actually risky

The risk in an inherited system is not primarily what you can see. A system that is visibly broken gets fixed. The risk is in what is invisible: the assumptions that were made under time pressure and never revisited, the manual steps that were added as temporary measures and became permanent, the configurations that were correct for a situation that no longer exists.

These risks are hard to find because the system is running. Running systems create a kind of false confidence. It worked yesterday; it will probably work today. The problem is that “worked yesterday” doesn’t mean “will work under higher load,” or “will handle an edge case that hasn’t occurred yet,” or “will survive the departure of the person who knew which flag to flip when the batch job fails.”

The failure mode is usually not gradual degradation. It is abrupt: a load spike, a security incident, a configuration change that turns out to have been load-bearing, a payment provider that changes their API behaviour. At that point, understanding the system under pressure is extremely difficult, and the cost of not understanding it is high.

The questions the previous team probably didn’t ask

There are four questions that experienced engineers learn to ask about any system they are taking over responsibility for. They are not complicated, but they require deliberately stepping back from the assumption that a running system is a safe system.

What fails first under stress? Load the system in staging, or trace through the architecture to find the component with the lowest headroom. This is usually a database connection pool, a queue that does not drain under peak conditions, or an external API that has its own rate limits. Knowing where the first failure point is means knowing what to watch and what to address before it becomes a production incident.

What is unencrypted that shouldn’t be? This is a search, not a guess. Credentials, tokens, customer data, payment information. Where are they stored, how are they transmitted, and by what mechanism. Systems built quickly often have encryption that was added as an afterthought and applied inconsistently. A regulated environment (finance, healthcare, government) will make this visible at an audit. It is better to find it first.

What is manual that should be automated? Manual steps in critical paths are a risk category of their own. They introduce human error, depend on institutional knowledge, and scale badly. They are also usually the first thing to break under pressure because they require the right person to be available at the right time. The answer is not always to automate everything immediately, but it should be to document every manual step and assess which ones carry the most risk.

What is load-bearing that nobody knows is load-bearing? This is the hardest question because it requires inferring intent from code that may not express it. The usual candidates are: configuration values that look like defaults but were deliberately set, scheduled jobs that are not monitored but are critical to downstream processes, and integrations that are not in the primary documentation because they were added by someone who has since left.

Payment systems as a specific case

Payment infrastructure deserves its own category because the failure modes are different and the consequences are more severe. A slow page load is a user experience problem. A payment that charges twice, or fails silently, or retries in a way the provider interprets as a new transaction, is a financial and legal problem.

The most common hidden risks in inherited payment systems are in the error-handling logic. A payment flow that was tested under normal conditions may behave incorrectly when the provider returns an unusual status code, when a network timeout occurs at a specific point in the transaction, or when a currency conversion edge case triggers a rounding error. These situations don’t occur in development, occur rarely in production, and are catastrophic when they do.

The right approach is to read the error-handling code as carefully as the happy-path code, trace through every branch that handles an unexpected provider response, and check whether the retry logic can produce duplicate charges. In more than twenty payment systems I have worked with, the majority had at least one case where the retry logic could cause a charge to appear twice under specific conditions.

What to do with what you find

The output of this kind of assessment is a risk map: a document that lists what was found, how severe it is, and what addressing it would cost. This is different from a list of problems, which tends to produce anxiety rather than prioritised action.

A risk map lets the business make informed decisions about which risks to address immediately, which to schedule, and which to accept. Not every risk needs to be fixed urgently. Some risks are low-probability even if high-consequence, and addressing them immediately would displace work on risks that are both high-probability and high-consequence.

The goal of inheriting a system is not to rebuild it in your own image. It is to understand it well enough that you are not surprised by it, and to address the risks that are most likely to cause harm before they do.

That is a more modest goal than it sounds, and it is almost always the right one.

Talk about your infrastructure →

Topics