The True Cost of Cloud Infrastructure (What Nobody Tells You Before You Scale)

The AWS bill that arrives after your first growth spike is a rite of passage. The number is always larger than expected. This post explains why — and what you can do about it before the spike, not after.

There is a specific kind of panic that hits engineering teams around the 20th of the month when they open the cloud console and see a bill that is two, three, or four times what they budgeted. It usually follows a launch, a press mention, a successful campaign, or just a month of steady growth. The product is working. The cloud bill is the punishment.

The frustrating thing is that most of these surprises are predictable — if you know what to look for. Cloud pricing is designed to be legible at the surface and opaque at the edges. The compute cost is visible. Everything else is not. This post is about everything else.

The Iceberg: What You Budget vs What You Actually Pay

When engineering teams build a cost estimate, they almost always start with compute. EC2 instances, Lambda invocations, ECS tasks — these are easy to model because they map directly to the infrastructure you're thinking about. A t3.medium running 24/7 costs around $30 a month. Multiply by your cluster size and you have a number.

The problem is that compute typically represents only 40–60% of a real-world cloud bill. The remaining costs are distributed across services that don't make it into early estimates:

Data transfer (egress) — You pay to send data out of AWS. Inbound traffic is free; outbound to the internet is not. At scale, this becomes significant fast.
NAT Gateway — AWS charges per hour and per gigabyte processed. If your private subnets talk to the internet through a NAT gateway, every byte costs money. This is one of the most common sources of shock bills.
CloudWatch log ingestion — $0.50 per GB ingested sounds trivial until your microservices are logging at debug level in production.
Support tier — Business support is 10% of monthly spend (minimum $100). Enterprise is 10% of the first $150k, 7% of the next tranche, and so on. These numbers add up fast at scale.
API calls — S3 PUT/GET requests, DynamoDB read/write units, API Gateway calls — all have per-request charges that don't appear in initial estimates.

The single most useful thing you can do today is open your cloud bill and filter by service. If you've been running for more than three months, you will find line items you didn't know existed.

The Five Surprise Bills

Based on infrastructure audits across dozens of companies, these are the five costs that consistently catch teams off guard:

1. Data Egress Costs

AWS charges $0.09 per GB for data transferred from your infrastructure to the internet (the first 100 GB per month is free). On GCP and Azure the rates are similar. If your application serves video, large file downloads, or high-frequency API responses, egress can easily eclipse your compute cost. The fix involves CDNs (CloudFront, Cloudflare) which serve content from edge locations — the internet-facing transfer is from the CDN to the user, priced at CDN rates rather than origin egress rates.

2. NAT Gateway Per-GB Charges

NAT Gateway charges $0.045 per GB of data processed — in both directions. An application in a private subnet that makes frequent API calls to external services (Stripe, Twilio, external data providers) can generate hundreds of gigabytes of NAT traffic per month. The mitigation is to use VPC endpoints for AWS services (S3, DynamoDB, SQS, and many others can be accessed without going through a NAT gateway) and to route external API traffic through a proxy or directly from public subnets where appropriate.

3. CloudWatch Log Ingestion

This one surprises teams that have instrumented their applications thoroughly. Debug-level logging from a moderately busy service can generate gigabytes of logs per day. At $0.50/GB for ingestion and $0.03/GB for storage, the monthly bill from logging alone can exceed the bill from the service doing the work. The fix: log at INFO level in production, use structured logging to reduce verbosity, and implement log retention policies (most compliance frameworks require 90 days; you don't need to keep everything in CloudWatch forever).

4. Overprovisioned RDS Instances

A db.r6g.2xlarge RDS instance running 24/7 costs around $700/month. Many companies start with a large instance "just in case" during early development and never revisit it. Worse, RDS Multi-AZ doubles the cost. If your database CPU and memory utilization sit at 5–15% for most of the month — which is typical for early-stage production environments — you're paying for 85–95% of an instance you're not using.

5. Support Tier Creep

Teams often upgrade to Business or Enterprise support after an incident and then forget about it. As spend grows, so does the support cost — it's percentage-based, not flat. A $50k/month AWS account pays $5k/month for Business support, which is $60k/year for access to a support team many engineering organizations rarely contact. Evaluate annually whether the support tier matches actual usage.

How Cloud Providers Make Pricing Confusing

This isn't accidental. The complexity in cloud pricing serves the providers — it makes it harder to build accurate models and easier to overspend.

The most confusing dimension is compute pricing models. For any given workload you can pay:

On-demand — Pay by the hour or second, no commitment, full price
Reserved Instances — 1-year or 3-year commitment, up to 72% discount
Savings Plans — Flexible commitment to a dollar amount per hour, applies across instance families (Compute Savings Plans) or specific families (EC2 Instance Savings Plans)
Spot Instances — Unused capacity, up to 90% discount, can be interrupted with 2-minute notice

Each option has different trade-offs, different billing mechanics, and different conditions under which it's the right choice. On top of this, pricing varies by region (us-east-1 is typically the cheapest AWS region), by instance generation (newer generations are usually cheaper per vCPU than older ones), and by OS (Linux vs Windows).

Free tier expiry is another common surprise. AWS free tier gives you 12 months of certain services at no charge. When the 12 months expire, billing starts silently. Many teams don't notice until they see an unexpected charge.

Right-Sizing: The Highest-ROI Cost Action

The single highest-return cost optimization action is almost always right-sizing — running instances that match your actual resource requirements rather than what you guessed you'd need.

Industry data consistently shows that most production workloads run at 20–40% of provisioned CPU and memory. A team that provisioned m5.xlarge instances (4 vCPU, 16GB RAM) during early development often finds that t3.large instances (2 vCPU, 8GB RAM, ~30% cheaper) are adequate for the actual load.

To right-size without breaking things:

Pull CloudWatch metrics for CPU utilization and memory utilization over the past 30 days for every instance
Identify instances where peak CPU stays below 40% — these are candidates for downsizing
Use AWS Compute Optimizer (free service) to get automated right-sizing recommendations based on actual usage
Downsize in staging first, measure for 48 hours, then roll out to production one instance at a time
Set CloudWatch alarms on CPU and memory so you catch any issues immediately after resizing

A disciplined right-sizing exercise across a mid-size AWS account typically yields 25–40% reduction in compute costs with no impact on performance.

Reserved Instances and Savings Plans

If you've been running on AWS for more than six months and your workload is reasonably stable, you should be using Reserved Instances or Savings Plans. Running on-demand for stable workloads is the equivalent of paying rack rate at a hotel when you knew months ago you'd be staying there.

The decision tree:

Compute Savings Plans — Most flexible. A commitment to spend $X/hour on EC2, Fargate, or Lambda compute, regardless of instance family, size, or region. Ideal if you're not sure which instance types you'll use 12 months from now.
EC2 Instance Savings Plans — Slightly more discount than Compute Savings Plans but locked to a specific instance family in a specific region (e.g., m5 in us-east-1). Good if your architecture is stable.
Standard Reserved Instances — Highest discount (up to 72%), locked to specific instance type, region, and OS. Good for well-understood, long-running workloads like databases.
Convertible Reserved Instances — Slightly lower discount than Standard, but you can exchange the reservation if your instance needs change. Better for growing teams.

The 1-year commitment typically breaks even in 7–9 months of on-demand equivalent usage, making it a straightforward decision for any workload with predictable baseload. Commit only to your baseline; keep variable capacity on Spot or on-demand.

Spot Instances: Free Money With Caveats

AWS Spot Instances offer 60–90% discounts over on-demand pricing in exchange for one constraint: AWS can reclaim the instance with a 2-minute warning when demand for the underlying capacity increases. For the right workloads, this discount is genuinely enormous.

The workloads where Spot shines:

Batch processing jobs (image processing, ETL pipelines, ML training runs)
CI/CD build agents
Stateless web tier nodes in an auto-scaling group (with at least 2 on-demand instances as fallback)
Development and staging environments

Spot Fleets let you specify multiple instance types and sizes, and AWS fills the request with whatever capacity is available at the current spot price. This dramatically reduces the probability of interruption — you're no longer dependent on capacity availability for a single instance type.

The critical engineering requirement: your application must handle interruption gracefully. For web services, this means the load balancer drains connections before the instance terminates. For batch jobs, it means checkpointing progress so you don't reprocess from scratch. Neither of these is especially difficult to implement, but they must be designed in from the start.

Data Transfer Cost Reduction

Data transfer costs are one of the most impactful areas to optimize because they compound with scale — every additional user or API call generates more transfer.

Same-AZ transfers are free. Data transferred between instances in the same Availability Zone costs nothing. Cross-AZ transfer costs $0.01/GB. For architectures that move large amounts of data between services, placing them in the same AZ reduces this cost significantly (with a trade-off in availability).
VPC endpoints eliminate NAT Gateway costs for AWS services. Accessing S3, DynamoDB, SQS, or any other AWS service from a private subnet through a VPC Gateway Endpoint (free) or Interface Endpoint (small hourly charge) bypasses the NAT gateway entirely, removing the per-GB charge.
CDN for internet-facing content. Serving static assets and API responses through CloudFront or Cloudflare reduces origin egress costs. CloudFront-to-internet pricing is lower than EC2-to-internet, and Cloudflare's CDN is free.
S3 Transfer Acceleration for uploads. If users are uploading large files from around the world, S3 Transfer Acceleration routes traffic through AWS edge locations, improving upload speeds. It costs more per GB but reduces upload failure rates — evaluate based on your specific use case.

Storage Cost Optimisation

Storage is the category where costs accumulate most silently. Unlike compute, storage doesn't get cleaned up when a service stops.

S3 Lifecycle Policies — Automatically transition objects to cheaper storage classes (S3 Infrequent Access at 40% the cost of Standard, Glacier Instant Retrieval at 68% cheaper) after a specified number of days. Objects that haven't been accessed in 90 days almost certainly don't need to be in Standard.
S3 Intelligent-Tiering — Automatically moves objects between access tiers based on actual access patterns. A small monitoring fee per object applies, but for large buckets with unpredictable access patterns, it consistently reduces costs.
EBS Volume Audit — Detached EBS volumes continue to incur charges. A gp3 volume at 100GB costs ~$8/month. Across a large AWS account with frequent EC2 instance turnover, orphaned volumes accumulate. Run a monthly audit and delete detached volumes.
Snapshot Retention — Automated EBS snapshots are cheap per GB but accumulate. A 500GB production database with daily snapshots retained for a year accumulates 182TB of snapshot data. Set retention policies: daily for 7 days, weekly for 4 weeks, monthly for 12 months is a reasonable baseline.

FinOps Culture and Tooling

Cost optimisation as a one-time exercise doesn't work. Infrastructure evolves, teams grow, new services get provisioned, and the savings from last quarter's optimisation erode. The organisations that consistently manage cloud costs well treat it as an ongoing practice — FinOps.

The key practices:

Cost allocation tags — Tag every resource with at minimum: environment (prod/staging/dev), team, and project. Without tags, you can't attribute costs and you can't hold teams accountable.
Per-team budgets with alerts — Use AWS Budgets to set monthly spend limits per team and alert at 80% and 100% of budget. This shifts cost awareness to the teams creating the infrastructure.
AWS Cost Explorer — The built-in tool for analysing cost trends, identifying anomalies, and getting savings plan recommendations. Use it monthly at minimum.
Infracost — Open source tool that estimates cost changes in pull requests. When an engineer adds a new RDS instance or changes an instance type, the cost delta appears in the PR before it's merged.
CloudHealth or Spot.io — Third-party platforms that provide deeper cost analytics, automated right-sizing recommendations, and Spot Instance management. Worth the cost for accounts spending more than $10k/month.

A Cost Optimisation Checklist

Use this as a starting point for an infrastructure cost audit. Estimated savings are based on typical accounts; your mileage will vary.

Enable Cost Allocation Tags and set up Cost Explorer dashboards — prerequisite for everything else; no direct savings but enables all other work
Right-size EC2 instances using Compute Optimizer — typically 20–40% reduction on compute costs
Purchase Savings Plans for baseline compute — 30–60% reduction on committed compute
Move batch and CI/CD workloads to Spot — 60–80% reduction on those specific workloads
Implement VPC endpoints for S3 and DynamoDB — eliminates NAT gateway charges for these services; $0.045/GB savings
Audit and delete orphaned EBS volumes and old snapshots — varies but often $200–$500+/month on mature accounts
Implement S3 Lifecycle Policies — 40–60% reduction on object storage costs for older data
Reduce CloudWatch log retention and lower log verbosity in production — can cut logging costs by 50–80%
Move static assets behind CloudFront — reduces egress costs and improves performance
Review RDS instance sizing and Multi-AZ requirements — right-sizing databases often yields $200–$2,000+/month in savings
Set budget alerts per team and per environment — prevents uncontrolled spend growth going forward
Review and right-size support tier — can save $1,000–$10,000+/month depending on spend level

Cloud costs are manageable — but they require active management. The teams that run lean cloud infrastructure treat cost optimisation with the same rigor they apply to performance or security: scheduled reviews, clear ownership, tooling, and accountability. If you're ready to audit your infrastructure and identify where the money is going, see what we cover in our cloud infrastructure services.