I've been involved in more cloud migrations than I can count at this point. The technical work varies enormously — different stacks, different scales, different legacy constraints. What doesn't vary much is the pattern of how problems start.

The most expensive migration mistakes tend to happen before a single resource is provisioned. They're decisions made in week one, under time pressure, with incomplete information, that then constrain everything that follows for the next two years.

The questions that get skipped

There are three questions that should be answered in writing before a migration project begins. In my experience they're often not asked at all, or answered superficially.

The first is: what are we actually moving? This sounds trivial and it isn't. A complete inventory of workloads, databases, file storage, third-party dependencies, scheduled jobs, and anything else that runs in the current environment takes real time to produce and is almost always incomplete the first time. The assumption that "we know what we have" tends to be wrong in small but consequential ways.

The second question is: what order does it need to move in? Applications that share databases, services that call each other, batches that depend on file outputs from upstream processes — these dependencies determine the migration sequence. Moving things in the wrong order creates gaps that are genuinely painful to fix mid-migration.

The third question is the one people most often skip entirely: what does success look like, and how will we know we've achieved it? Not just "it works" — specific, testable criteria for each workload that tell you whether the migration met the requirements. Without these, the definition of "done" becomes whatever the team agrees feels about right, which is a difficult place to be when something subtle goes wrong in production three weeks after cutover.

Lift-and-shift vs re-architecture: a real choice, not a default

There's a recurring conversation in migration projects about whether to lift and shift — move the existing setup more or less as-is to cloud infrastructure — or to use the migration as an opportunity to re-architect toward cloud-native services. Both are legitimate approaches and both have genuine costs and benefits. The problem is when the choice is made implicitly rather than explicitly.

Lift-and-shift is faster and lower risk for individual workloads. It's also more expensive in ongoing cloud costs, because on-premise architectures typically don't take advantage of cloud elasticity or managed services. You're essentially paying cloud prices for something that runs like it's still on a physical server.

Re-architecting takes longer, costs more in engineering time upfront, and introduces risk — you're changing the thing as well as moving it. But the resulting workload usually runs more efficiently and is easier to scale.

The practical question is whether you're trying to get off on-premise quickly (in which case lift-and-shift with a plan to optimise later is usually correct) or whether the migration is part of a broader modernisation effort (in which case doing it once properly makes more sense than doing it twice).

The worst outcome is treating lift-and-shift as a temporary step and then never finding the time to revisit it. That's not a migration strategy, it's a deferral.

What the discovery phase is actually for

A discovery phase at the start of a migration project is often framed as "understanding the current environment." That's accurate but undersells it. The more important output is a list of things you didn't know before you started looking.

In most environments of any age, discovery surfaces applications that nobody actively knows about but that turn out to be running something critical. Legacy scheduled jobs writing to shared directories. Services with hardcoded IPs that will break when infrastructure changes. Databases that are replicated in ways the documentation doesn't mention.

Two or three weeks of discovery before any migration work begins is rarely time wasted. The usual objection is cost and schedule pressure. The counter-argument is that the problems discovery finds are much cheaper to address before cutover than after.

Rollback is a deliverable, not an afterthought

Every migration project should have documented rollback procedures for each workload, tested before the cutover window. This sounds obvious. It's less commonly done than it should be.

Rollback procedures that exist only in someone's head, or that have never been rehearsed, are not rollback procedures. They're optimism. If the cutover goes badly and you need to reverse it quickly, a procedure that takes 40 minutes to improvise is a procedure that takes 40 minutes while something is broken in production.

Testing the rollback path during pre-migration validation adds time. It's the right kind of time to spend.

After cutover

The period immediately after a migration cutover — the first two to four weeks — is when the things you didn't find in testing show up. Workloads behave differently under real traffic. Integrations that worked in staging have edge cases that only appear in production. Users do things that weren't in the test scripts.

This is the argument for having whoever manages the environment post-migration actively involved during cutover rather than brought in a week later. The handover shouldn't be a formal point-in-time event — it should be an overlap where the people who did the migration and the people who will maintain it are working together for long enough that the institutional knowledge transfers properly.

The alternative — a handover document and a sign-off call — is how tribal knowledge about "how it works in practice" gets lost.