Cloud Migration Checklist: Moving from On-Premise Without Downtime

Cloud migrations fail for predictable reasons: underestimated dependencies, no rollback plan, treating cutover as a single event. This checklist eliminates all three.

The promise of cloud migration is real: reduced infrastructure costs, elastic scaling, managed services that eliminate undifferentiated heavy lifting, better availability, and global reach. The gap between that promise and the actual project is where most teams get hurt.

The failure modes are well-documented: a dependency that wasn't in the inventory comes down during cutover, an application assumes a filesystem that doesn't exist in the cloud, the database has a replication lag that nobody tested under load, or the rollback plan turns out to require 4 hours to execute while the maintenance window was 2. These aren't edge cases — they're the norm for migrations attempted without a systematic approach.

This checklist is built around eliminating each of those failure modes. It covers every phase from discovery to post-migration cleanup, with the specific items that actually determine whether the migration succeeds without downtime.

Phase 0 — Discovery and Risk Assessment

This is the phase most teams skip or rush, and it's where migrations are won or lost before a single server is touched. The goal is to eliminate surprises. Every dependency, every compliance requirement, every integration, and every performance baseline must be documented before the migration design begins.

Application inventory. List every application, service, scheduled job, batch process, and background worker running in the on-premise environment. Don't rely on documentation — run discovery tooling against your network (AWS MGN's agentless discovery, Azure Migrate, or open source options like Nmap-based discovery) to find what's actually running versus what the docs say is running. The gap between these two lists is usually the first surprise.

Dependency mapping. For each application, document: inbound connections (what calls it?), outbound connections (what does it call?), external services (payment gateways, email providers, third-party APIs), shared filesystems, shared databases (multiple applications reading the same database is one of the most common migration complications), and implicit timing dependencies (application A writes at midnight, application B reads at 12:05 — any replication lag breaks this).

Data classification. Classify all data by sensitivity level (public, internal, confidential, regulated). Regulated data (PII, PHI, payment card data, financial records) has compliance implications for where it can be stored, who can access it, how it must be encrypted, and what audit logging is required. Discover this in Phase 0, not during the security audit after go-live.

Compliance requirements. Identify applicable regulations: GDPR, HIPAA, PCI DSS, SOC 2, industry-specific regulations. Map each requirement to a technical control that must be implemented in the cloud environment. For GDPR, this includes data residency requirements that constrain which AWS/Azure/GCP regions you can use. For HIPAA, this includes Business Associate Agreements with your cloud provider.

Performance baselines. Instrument everything before migration: CPU, memory, disk I/O, network throughput, database query times, cache hit rates, API response latencies. These become your acceptance criteria post-migration. If you don't measure before, you can't prove you haven't degraded performance after.

Choosing Your Migration Strategy — The 7 Rs

AWS formalized the "6 Rs" (originally 5 Rs from Gartner) and a seventh has since been added by practitioners. Each strategy has a different cost-effort-risk trade-off:

Retire. The application is no longer needed and can be decommissioned. Discovery typically reveals 10–20% of the inventory falls here. Retiring these applications before migration reduces scope, cost, and risk.

Retain. Keep the application on-premise for now — either because it has a specific compliance requirement, because it's due for replacement, or because the migration ROI doesn't justify the effort. Retained applications still need to be documented because they often have dependencies with migrated applications that require network connectivity between on-premise and cloud during and after migration.

Rehost (lift and shift). Move the application to cloud VMs with minimal changes. Fastest execution time, lowest risk, but captures the least cloud benefit (you're paying cloud prices for a workload architected for on-premise). Right choice for: applications that are time-constrained for migration, are low-traffic, or are scheduled for replacement within 18 months.

Relocate. Move containers or VMware VMs to cloud equivalents with no application changes (AWS VMware Cloud, Azure VMware Solution). Similar to rehost but for virtualised environments. Right for organisations with significant VMware footprints.

Re-platform (lift, tinker, and shift). Make targeted optimisations without changing the core architecture. Examples: move from self-managed MySQL to RDS, move from on-premise object storage to S3, move background jobs to managed queues. Moderate effort, meaningful benefit. Right for most databases and storage layers.

Re-purchase. Move from a licensed on-premise application to a SaaS equivalent. Move from an on-premise ERP to a cloud ERP, from on-premise email to Microsoft 365, from on-premise CRM to Salesforce. High disruption (requires data migration and user retraining) but eliminates infrastructure management entirely. Evaluate this option before committing to a lift-and-shift of legacy licensed software.

Re-architect. Redesign the application to be cloud-native — microservices, serverless, managed databases, event-driven architecture. Highest effort and risk, but the only path to elastic scaling, zero-downtime deployments, and full cloud economics. Reserve for business-critical applications where the performance or scalability benefits justify the investment.

Network Architecture Design

Network architecture must be designed before any compute resources are provisioned. Retrofitting network architecture onto an existing cloud deployment is expensive and disruptive.

VPC and subnet design. Design at least two tiers of subnets: public subnets (for load balancers and NAT gateways, which need internet-routable addresses) and private subnets (for application servers, databases, and cache layers, which should never be directly internet-accessible). For sensitive environments, add a third tier of isolated subnets for databases with no outbound internet access.

Plan subnet CIDR blocks to avoid conflicts with existing on-premise networks (you'll need both connected simultaneously during the hybrid phase) and to allow for future expansion. A common mistake is using 10.0.0.0/8 for the VPC when the on-premise network already uses that range — this makes VPN or Direct Connect connectivity impossible.

Security groups and NACLs. Security groups are stateful firewalls applied at the instance/service level. NACLs are stateless firewalls applied at the subnet level. The recommended pattern: use security groups as the primary access control mechanism (they're easier to reason about and audit), and use NACLs only for broad subnet-level restrictions (blocking specific IP ranges, preventing internet access to database subnets).

NAT gateways. Private subnet instances that need outbound internet access (for package downloads, external API calls, etc.) route through NAT gateways in the public subnet. Deploy NAT gateways in each availability zone to avoid a single point of failure. At high data transfer volumes, NAT gateway costs become significant — audit outbound traffic patterns as part of Phase 0.

VPN or Direct Connect for the hybrid phase. During migration, you need encrypted connectivity between on-premise and cloud — for data sync, for applications that span both environments, and for management access. AWS Site-to-Site VPN is the fastest to establish (hours) and adequate for moderate bandwidth requirements. AWS Direct Connect is a dedicated physical connection with predictable latency and no bandwidth caps — necessary for high-volume data sync or latency-sensitive hybrid workloads. Plan a minimum of 4–8 weeks for Direct Connect provisioning.

DNS architecture. DNS is the glue of the migration. Plan how DNS resolves during the hybrid phase: private hosted zones in Route 53 for internal service discovery in the cloud, conditional forwarding from on-premise DNS to Route 53 for cloud-hosted names, and the exact sequence of DNS record updates during cutover. A DNS architecture document is not optional — improvised DNS changes during cutover are a common cause of extended downtime.

Data Migration Strategy

Data is the hardest part of any migration. Compute can be reprovisioned in minutes; data cannot. The choice of data migration approach determines the maximum achievable downtime window at cutover.

Offline migration (dump and restore) is the simplest approach: stop the application, export the database, transfer the dump file, restore in the cloud, start the cloud application. Downtime = export time + transfer time + restore time. For databases over ~100GB, this typically means hours of downtime. Acceptable for: development/staging environments, low-traffic applications with maintenance windows, or applications where a brief planned outage is contractually permissible.

Online migration with replication achieves near-zero downtime by establishing continuous replication from the on-premise database to the cloud database while the application continues to serve traffic from on-premise. AWS Database Migration Service (DMS) supports this for most common database engines. The cutover sequence is: replication is running and the cloud database is caught up, the application is paused for seconds to minutes (not hours), replication is stopped, the cloud database is promoted, and the application is pointed at the new endpoint.

Storage migration. For block storage (application data on disk), AWS MGN continuously replicates the entire server — data, OS, applications — to the cloud. For object storage, AWS DataSync or S3 Transfer Acceleration handles large-scale file transfers. For databases stored on SAN/NAS, DMS is typically the right tool.

Data validation. Before any cutover, validate data integrity: row counts match, checksums match for critical tables, application-level functional tests pass against the cloud dataset. Automate as much of this validation as possible — manual spot-checking is insufficient for large datasets.

The Hybrid Phase — Running Both Environments in Parallel

Zero-downtime migration requires a hybrid phase during which both the on-premise and cloud environments are running simultaneously. This phase typically lasts 2–8 weeks depending on complexity.

Traffic splitting strategy. During the hybrid phase, use a load balancer or DNS-based traffic splitting to route a configurable percentage of traffic to the cloud environment. Start with 1–5% of production traffic to validate cloud infrastructure under real load, then increase gradually as confidence grows. Route 53 weighted routing with health checks makes this straightforward. The key requirement is that the split can be adjusted instantly and reverted instantly — the load balancer is your rollback lever.

Bidirectional data sync. While traffic is split, both environments are accepting writes. You need bidirectional database replication to keep them consistent, or you need to route all writes to one environment (typically on-premise) while routing reads across both. True bidirectional replication with conflict resolution is complex; for most migrations, routing writes to one environment and replicating one-way is simpler and more reliable.

Monitoring for drift. The cloud environment should be producing identical outputs for identical inputs. Run synthetic tests that execute the same requests against both environments and compare responses. Alert on any divergence immediately — a divergence during the hybrid phase is a defect that must be fixed before full cutover.

Go/no-go criteria. Define in advance, in writing, the criteria that must be met before increasing traffic percentage or proceeding to full cutover. Typical criteria: error rate below X%, p95 latency below Y ms, database query times within Z% of baseline, no data integrity failures in validation suite. These must be objective metrics, not gut feelings, and must be sign-off points by both the engineering team and the business stakeholders.

Database Migration in Detail

Databases deserve their own phase discussion because they're the most failure-prone component in any migration.

The expand-contract pattern for schema changes. If the cloud database needs a different schema than the on-premise database (a common situation when re-platforming from a self-managed DB to a managed service with different capabilities), use expand-contract: first deploy the "expand" migration that adds new columns/tables while keeping old ones, run both old and new code paths simultaneously, then run the "contract" migration that removes old structure once all code uses the new schema. Never run a breaking schema change during migration — it eliminates your rollback option.

Dual-write during transition. For the most critical data, implement dual-write in the application layer: writes go to both on-premise and cloud databases simultaneously. This is more complex than replication-based sync but eliminates the possibility of data loss from replication lag during cutover. The trade-off is application complexity and the need for a reconciliation process to handle any dual-write failures.

Cutover sequencing. The database cutover sequence for a zero-downtime migration: (1) verify replication lag is at or near zero; (2) enable application-level write queuing or pause new write transactions for 30–120 seconds; (3) verify lag reaches zero; (4) stop replication; (5) promote cloud database to primary; (6) update connection strings or endpoint configuration; (7) drain write queue / resume writes; (8) verify new writes are landing in cloud database; (9) run post-cutover validation suite. The entire window for steps 2–8 should be under 2 minutes.

Connection string management. Don't hardcode database connection strings in application code or configuration files. Use a secrets manager (AWS Secrets Manager, HashiCorp Vault) so that the connection string update at cutover is a single secrets rotation, not a redeployment across every service that accesses the database.

DNS Cutover and Traffic Management

DNS is the lever that switches production traffic from on-premise to cloud. Getting this sequence right is the difference between a clean cutover and an extended incident.

Low TTL preparation. At least 48 hours before cutover, reduce the TTL on all DNS records for migrating services to 60 seconds. This ensures that when you change the DNS record, the change propagates to clients within 60 seconds rather than hours. Don't wait until cutover day to change TTLs — if the existing TTL is 86400 (24 hours), the change you made at 9 AM won't be seen by all clients until 9 AM the next day.

Weighted routing with health checks. During the hybrid phase, use Route 53 weighted routing with health checks on both the on-premise and cloud endpoints. The health checks provide automatic failback: if the cloud environment starts failing health checks during cutover, Route 53 automatically routes traffic back to on-premise without manual intervention. This is your safety net.

The cutover sequence. On cutover day: (1) database cutover as described above; (2) set cloud weight to 100%, on-premise weight to 0% — do NOT delete the on-premise DNS record yet; (3) monitor error rates and health check status; (4) if metrics are green after 15 minutes, proceed to post-cutover checklist; (5) if metrics degrade, set on-premise weight back to 100% — this is your instant rollback.

Instant rollback via DNS. Keep the on-premise environment running and serving zero traffic for a minimum of 72 hours post-cutover. The rollback procedure should be documented as a 3-step runbook that any on-call engineer can execute in under 5 minutes: (1) set on-premise DNS weight to 100%, cloud to 0%; (2) reverse database replication direction (cloud back to on-premise) if writes occurred in cloud; (3) notify stakeholders. Practice this rollback in staging before cutover day.

Security Hardening for Cloud

Cloud security defaults are not equivalent to on-premise security controls. Several important differences must be addressed explicitly:

IAM least privilege. Cloud IAM is more granular than on-premise file permissions but also more complex to get right. Every service, application, and human user should have the minimum permissions required for their function. Use IAM roles for applications rather than IAM users with access keys. Audit IAM permissions with AWS IAM Access Analyzer or a third-party tool before go-live.

Secrets management migration. Any secrets that were stored as plaintext in configuration files, environment variables, or on-premise secrets management must be migrated to a cloud-native secrets manager. Applications must be updated to retrieve secrets at runtime from AWS Secrets Manager or Parameter Store, not from config files that get baked into container images or deployed alongside application code. Scan your codebase for hardcoded credentials as part of Phase 0.

Network security. On-premise networks often rely on network perimeter security ("everything inside the firewall is trusted"). Cloud security must be zero-trust: all traffic between services is authenticated and authorised, regardless of network location. Security groups should enforce service-to-service access controls: the application security group should allow inbound traffic only from the load balancer security group, the database security group should allow inbound traffic only from the application security group.

Audit logging. Enable CloudTrail for all API calls, VPC Flow Logs for network traffic, and application-level access logs. Configure log retention per your compliance requirements (HIPAA requires 6 years, PCI DSS requires 1 year). Route logs to a separate AWS account or an immutable log store so they can't be deleted even with compromised credentials in the primary account.

The Go-Live Checklist

A 40-item checklist for migration day:

Pre-Cutover (Day Before)

Replication lag confirmed at <30 seconds and holding
DNS TTLs confirmed at 60 seconds for all migrating records
Cloud environment health checks passing for >4 hours
Load test run against cloud environment at 120% of expected peak traffic — passed
Database connection string secrets updated in secrets manager, not yet active
Rollback runbook printed and physically in the hands of the on-call engineer
On-call engineer confirmed available and not running other deployments
Stakeholders notified of maintenance window
Customer support team briefed on status page update procedure
Post-cutover validation suite tested and confirmed running cleanly against cloud environment

Cutover Day — Before the Window

Replication lag confirmed at <10 seconds
All application services health check green on cloud
Monitoring dashboards open and ready
Incident channel open with all engineers on standby
On-premise environment confirmed healthy (you want a known-good baseline)
Backup of on-premise database taken within last 2 hours
Feature flags set to disable non-essential features during cutover window

Cutover Window

Write traffic paused / write queue enabled
Replication lag confirmed at zero
Replication stopped
Cloud database promoted to primary
Connection string secret rotated to cloud endpoint
Application services restarted or connection pools refreshed
Write queue drained / write traffic resumed
Test write confirmed landing in cloud database
DNS weight updated: cloud 100%, on-premise 0%
DNS propagation confirmed from multiple geographic locations
Error rate monitoring — no increase
Latency monitoring — within baseline
Post-cutover validation suite executed — all assertions passing
Business smoke tests confirmed by non-engineering stakeholder

Post-Cutover Immediate

Status page updated to reflect maintenance complete
Support team notified of go-live
On-premise environment left running but receiving zero traffic
Alerts confirmed routing to cloud environment metrics
On-call schedule confirmed for 72-hour post-migration watch
Incident retro scheduled for 5 business days post-cutover
Decommission timeline documented with named owner
Cost baseline initiated in cloud cost management tool
All temporary security group rules used for migration cleaned up
Migration tracking ticket updated to complete

Post-Migration: The First 30 Days

The migration isn't done when traffic moves to the cloud. The first 30 days determine whether you've actually succeeded.

Days 1–7: Intensive monitoring. Keep the on-call rotation elevated. Watch closely for: intermittent errors that weren't caught in testing (network timeouts from services that assumed local disk I/O speed, cache miss rate spikes, database connection pool exhaustion). Most post-migration issues surface within the first 72 hours under real production load patterns.

Days 7–14: Cleanup. Remove temporary firewall rules and security group entries created to facilitate migration. Revoke migration-specific IAM permissions. Clean up staging/test resources created during migration that are no longer needed. Each of these represents ongoing cost and security exposure.

Days 14–21: On-premise decommission timeline. Confirm that no on-premise services are still receiving traffic or being used. Coordinate with the infrastructure team on the decommission schedule — removing physical servers or returning leased hardware has lead times. Don't pay for on-premise infrastructure you no longer need.

Days 21–30: Cost baseline and optimisation. By day 21, you have enough cloud billing data to establish a cost baseline. Compare against the pre-migration on-premise cost estimate. Common optimisations at this stage: rightsizing instances based on actual utilisation metrics, enabling auto-scaling for workloads with variable load, purchasing Reserved Instances or Savings Plans for steady-state workloads (typically 30–40% discount vs on-demand), and enabling S3 Intelligent Tiering for infrequently accessed data.

Tools That Make Migrations Safer

Several tools have become standard practice for zero-downtime migrations:

AWS Application Migration Service (MGN). Continuous block-level replication from on-premise to cloud. Supports cutover testing — you can launch a cloud instance from the replicated data, test it, and send it back to replication mode, repeating this as many times as needed before committing to cutover. The final cutover typically achieves downtime measured in minutes.

AWS Database Migration Service (DMS) with Schema Conversion Tool (SCT). DMS handles ongoing replication between databases (including cross-engine: Oracle to PostgreSQL, SQL Server to MySQL). SCT automates the conversion of database schemas and stored procedures between engines. For heterogeneous database migrations (changing database engine as part of the move), these two tools together dramatically reduce the manual conversion effort.

Terraform for Infrastructure as Code. The cloud environment should be fully defined in Terraform from day one. This gives you: reproducible environments (staging mirrors production exactly), a single source of truth for infrastructure state, the ability to destroy and recreate the environment from scratch if needed (critical for rollback and disaster recovery), and a review process for infrastructure changes (PRs, not manual console clicks).

ArgoCD for GitOps. Once applications are containerised and running on Kubernetes, ArgoCD provides continuous synchronisation between your Git repository (the desired state) and your cluster (the actual state). Deployments become Git commits. Rollbacks become Git reverts. The deployment process is auditable, reproducible, and doesn't require direct cluster access for routine changes.

Feature flags for traffic management. Tools like LaunchDarkly or Unleash let you roll out new cloud-based code paths to a percentage of users independently of deployment. This is particularly powerful during re-architecture migrations where a new service is being tested in production alongside the old one — you control which users hit the new path via a feature flag rather than via DNS routing.

Cloud migrations are complex projects that reward systematic preparation and punish improvisation. If you're planning a migration and want a team with deep experience in zero-downtime cutovers, see how we approach cloud migration projects or book a scoping call to talk through your specific architecture.