How we structure a cloud migration.

A walk-through of the methodology we've shipped across forty-plus migration programmes since 2010 — together with a concrete anonymised case study from a 2025 UK challenger-bank engagement. Co-written by the five engineers who staffed it.

LEGACY · ON-PREM MIGRATING → STRANGLER · PHASED · BIDIRECTIONAL TARGET · CLOUD-NATIVE SHARED PLATFORM SERVICES
Phased migration from a legacy on-prem estate to cloud-native, using a strangler pattern with bidirectional data flow during the transition window — the topology used in the case study below.

Every cloud migration is different. Every cloud migration that goes well is different in roughly the same ways. After forty-plus programmes, here's the structure we've settled on — and a concrete case study from a 2025 engagement that exercised every part of it.

Why have a methodology at all

Methodology gets a bad name because it's usually presented as a 60-slide framework that nobody on the delivery team actually reads. What we mean by methodology is much narrower: a set of standing decisions about how we approach the problem, so that we're not re-debating first principles on every engagement.

The point of a good methodology isn't rigidity. It's shared default — so that when a real surprise comes up (and one always does), the team has more energy to think about it. The bits that aren't surprising should be on autopilot.

The case study: a UK challenger bank, 2025

To make the next six sections concrete, we'll thread a real engagement through them. The client was a UK challenger bank, ~£280M GMV through their card products, ~420 staff, FCA-regulated. They came to us with a legacy estate hosted across two co-lo data centres in Slough and Manchester — Java monoliths against an Oracle Exadata, batch jobs running through TWS, modest Linux fleet, on-prem Splunk for SOC.

The brief was straightforward to write and hard to scope: move to public cloud, satisfy FCA operational resilience (SS1/21), reduce infrastructure spend, and have it done before the data-centre lease at Slough expired in eighteen months. We ran the diagnostic over two weeks, then the full programme over thirty-six.

Sector Banking · FCA
Duration 14wk pilot + 9-month programme
Team 5 from us, 18 from client
Target AWS multi-account landing zone
Spend reduction 41% steady-state
RTO improvement 4h → 18min

The five of us each owned a workstream. Muffaddal ran the engagement and held the architecture record. Vijeet led the cloud landing zone and infrastructure-as-code. Taha ran the security and FCA-resilience workstream end-to-end. Jimmy owned the data-migration strategy, including the Oracle-to-Aurora cutover that turned out to be the hardest single piece of work. Ashok embedded with the bank's two Java teams to refactor and re-platform the monoliths a slice at a time.

1. Discovery and current-state

The first two weeks of every migration are spent understanding what you actually have — which is almost always different from what the architecture diagrams say.

Specifically, we want answers to three questions. What's running? (an inventory at the workload level, not the server level). What does it cost? (current run-rate, broken down by workload, including the bits that aren't on the cloud bill — licensing, on-prem hardware refresh, people). What does it depend on? (the messy graph of which workloads talk to which, including the integrations nobody documented).

The architecture diagram is always a lie of omission. The current-state map is the truth.

Jimmy on the case: the bank had a documented system catalogue listing 47 applications. After two weeks of discovery interviews and traffic analysis we'd identified 71. The 24 missing ones were a mixture of shadow IT, vendor-managed appliances nobody had updated the catalogue for, and three batch jobs that had been running on a Windows desktop under a developer's desk since 2017. None of this is unusual.

2. Target-state architecture

With the current state mapped, we work backwards from the outcome. Not "we'll move to AWS" — that's a destination, not a target state. The target state we care about is the operating model: how the platform will be supported, who'll have access, how changes flow, what the SLOs are, how cost is tracked and attributed.

Three options at this stage, always. The recommendation, the alternative we'd be happy with, and the one we'd advise against. Even if the recommendation is obvious, naming the others forces clarity about what's being chosen and what's being given up.

Muffaddal on the case: we presented three options — full AWS landing zone with multi-account separation, AWS plus a UK sovereign-cloud island for the most sensitive PII, or Azure with a Microsoft-led identity story. Recommendation: option one. The bank's CIO had been on the Azure side of a previous role and pushed back. We wrote the alternative-options summary specifically so the board could see what they'd be choosing if they overruled us. They didn't.

3. Sequencing the move

The most common cause of migration regret isn't a technical failure. It's a sequencing failure — moving the workloads in the wrong order. The principle we use is straightforward: migrate the workloads that earn the most leverage soonest. Usually that means starting with the ones that are most painful in the current environment (slow to deploy, hard to scale, dependent on legacy capacity) — because the relief is immediate and visible, and you build credibility for the rest of the programme.

The opposite mistake is starting with the easiest workload because it's the easiest. That's the slowest path: nothing meaningful improves, sponsors get nervous, and the difficult workloads still have to be done at the end with less goodwill.

Ashok on the case: the obvious "easy first" candidate was the company intranet — three pages of HTML and a search box. Boring, low-risk, no users at risk. We deliberately skipped it. The first workload we moved was the card-issuance API — the slowest-deploying, most-on-call-paged service in the estate. Six weeks in, the on-call team's pager nights dropped from twelve to two. That's the credibility budget for everything that came after.

4. The data question

If the migration has any complexity, it lives here. Compute and code are relatively easy to move. Data — especially live transactional data — is where careers go wrong. There are really only three strategies, and you must pick one consciously:

  • Cutover migration: Maintenance window, dump-and-restore, downtime. Simple but only works for tolerant systems.
  • Dual-write: The application writes to both old and new during the transition. Excellent for safety, terrible for consistency unless carefully engineered.
  • Change-data-capture (CDC): Streams from the legacy source into the new platform. Best of both worlds, most operationally complex.

We pick based on the answer to one question: what's the worst acceptable outcome if data drifts for an hour? If the answer involves regulators or refunds, CDC. If the answer is "we just re-run a reconciliation overnight", cutover.

Jimmy on the case: banking transaction data sits squarely in the "regulators or refunds" category, so CDC was the only option. We used Debezium against Oracle GoldenGate to stream into Aurora Postgres, with reconciliation jobs running every fifteen minutes against a deliberate two-week shadow period. The hardest single problem in the entire programme was a clock-drift issue between the Oracle source and the Aurora target that introduced a 47-millisecond skew on transaction ordering. We caught it because the reconciliation jobs were tracking it. Without the shadow period, it would have gone live and been very difficult to unwind.

5. Cutover and rollback

We've never run a migration cutover without a documented rollback path — even for clients who insisted they didn't need one. Twice in fifteen years we've actually had to use it. Both times the team that designed it didn't believe we'd ever need it, and both times it saved the engagement.

The rollback isn't just a technical procedure. It's a decision tree: at each stage of cutover, what would have to go wrong for us to abort, who has the authority to make that call, and what the recovery path looks like for everything that's already moved. Most teams write the technical procedure and skip the decision tree. The decision tree is what matters.

Taha on the case: on the final cutover weekend we had a four-stage rollback decision tree, with named authority at each gate — the bank's COO at the highest, our engagement lead (Muffaddal) at the lowest. At stage two we caught a packet-loss spike on the new ingress (turned out to be a misconfigured route in the transit gateway). We were forty-three minutes from the no-return point. Muffaddal called the abort, we rolled the in-flight transactions back to the legacy stack, fixed the route, and restarted the cutover six hours later. The bank's regulator was watching this in real time and signed off the cutover the following Tuesday — partly because the rollback worked cleanly.

6. After it's done

Two weeks of post-migration operation, then we leave. That's the design. We hand back runbooks, dashboards, an updated architecture record, and recorded training sessions. The team that takes over is the team you already had, augmented by whatever hiring we recommended during the engagement.

The temptation, on both sides, is to convert the engagement into ongoing managed services. Sometimes that's the right call — but it's a separate decision, scoped fresh, with the cost-benefit honestly examined. Migration energy and ongoing-operations energy are different things, and you don't want one bleeding into the other unconsciously.

Vijeet on the case: we left the bank with twenty-eight Terraform modules, a documented landing-zone account structure, runbooks for the seven operational scenarios that mattered, and a hiring brief for two SREs they recruited inside three months. The bank chose not to retain us for ongoing managed services — they had the team in place. Eighteen months later they came back for a separate FinOps engagement to consolidate three years of reserved-instance and savings-plan commitments. That's the relationship the methodology is designed to produce.


Where the numbers landed

For the case study above, at twelve-month steady state:

  • Infrastructure spend down 41% versus the legacy on-prem run-rate (including the people we no longer needed for hardware refresh and OS patching).
  • RTO for the card-issuance service down from 4 hours to 18 minutes. RPO down from 60 minutes to under 30 seconds.
  • Deployment frequency across the migrated services up from monthly to multiple-times-daily.
  • FCA operational-resilience assessment passed first time, with the regulator citing the rollback discipline and reconciliation evidence as exemplary.
  • Two-year payback on the engagement cost, against eighteen-month commercial case in the original business case.

The shortest version of all this

Two weeks discovering what's really there. Two weeks designing what should replace it. Most of the rest is sequencing, data strategy, and cutover discipline. The methodology isn't magic — it's a set of standing decisions that free the team to think about the surprises that actually matter on your specific engagement. The case study walks through one of those engagements; we've done forty more like it.

Share:

Got a migration coming up?

Email us