Cloud Costs Are Rising - How to Optimize for Efficiency
Last quarter, our CFO called me into an unexpected meeting. "Our AWS bill has doubled in the past year," she said, sliding a chart across the table. "We need to get this under control without slowing down product development."
This wasn't a unique situation. Across the industry, companies are facing the same challenge: cloud costs are spiraling upward while budgets are tightening. The days of treating cloud resources as essentially unlimited are over.
After three months of focused effort, we reduced our cloud spend by 42% without compromising performance or reliability. In this post, I'll share the strategies, tools, and architectural patterns that worked for us, along with the hard-earned lessons from approaches that didn't.
Why Cloud Costs Are Rising
Before diving into solutions, it's worth understanding why cloud costs have become such a pressing issue:
- Cloud provider price increases - AWS, Azure, and GCP have all implemented price hikes on various services
- Scale of adoption - As more workloads move to the cloud, total bills naturally increase
- Complexity - Modern architectures with microservices, managed services, and data pipelines create intricate cost structures
- Inefficient defaults - Many cloud services have default settings optimized for convenience, not cost
- Lack of visibility - Complex billing makes it hard to attribute costs to specific teams or features
Understanding these factors helps frame a more strategic approach to optimization.
Our Systematic Approach to Cost Optimization
After analyzing our situation, we developed a methodical approach that balanced quick wins with sustainable long-term changes.
Phase 1: Visibility and Governance
You can't optimize what you can't measure. Our first step was implementing proper cost visibility tools and governance structures.
Tagging Strategy
The foundation of our cost visibility was a comprehensive tagging strategy:
# Required tags for all resources
Environment: [production, staging, development, test]
Team: [backend, frontend, data, platform, shared]
Product: [core-app, analytics, admin, api]
Project: [customer-feature-x, internal-initiative-y]
ManagedBy: [terraform, cloudformation, manual, service-name]
We enforced these tags through organizational policies and built automation to catch untagged resources:
# Sample Python script to find untagged resources on AWS
import boto3
import csv
from datetime import datetime
required_tags = ['Environment', 'Team', 'Product', 'Project', 'ManagedBy']
def check_ec2_tags():
ec2 = boto3.resource('ec2')
untagged = []
for instance in ec2.instances.all():
instance_tags = {tag['Key']: tag['Value'] for tag in instance.tags or []}
missing_tags = [tag for tag in required_tags if tag not in instance_tags]
if missing_tags:
untagged.append({
'ResourceId': instance.id,
'ResourceType': 'EC2',
'MissingTags': ', '.join(missing_tags),
'Owner': instance_tags.get('Owner', 'Unknown')
})
return untagged
# Additional resource checks for RDS, S3, etc.
# ...
# Export results
def export_results(untagged_resources):
with open(f'untagged_resources_{datetime.now().strftime("%Y-%m-%d")}.csv', 'w') as f:
writer = csv.DictWriter(f, fieldnames=['ResourceType', 'ResourceId', 'MissingTags', 'Owner'])
writer.writeheader()
writer.writerows(untagged_resources)
# Main execution
all_untagged = []
all_untagged.extend(check_ec2_tags())
# Add other resource checks
export_results(all_untagged)
Cost Allocation Tools
Next, we implemented tools to visualize costs across different dimensions:
- AWS Cost Explorer with custom views for each team
- CloudHealth for deeper analysis and recommendations
- Custom Grafana dashboards showing costs alongside application metrics
The key insight here was correlating costs with business metrics. Instead of just looking at absolute dollars, we tracked metrics like "cost per user," "cost per transaction," and "cost per API call." This changed the conversation from "reduce costs" to "improve efficiency."
Budget Alerts and Anomaly Detection
We set up automated alerts for:
- Budget overruns at team and service levels
- Unusual spending patterns
- Resources approaching reserved instance expirations
- Underutilized reserved instances
Here's a sample AWS CLI command to create a budget alert:
aws budgets create-budget \
--account-id 123456789012 \
--budget '{"BudgetName":"Backend Team Monthly Budget","BudgetLimit":{"Amount":"5000","Unit":"USD"},"TimeUnit":"MONTHLY","BudgetType":"COST","CostFilters":{"TagKeyValue":["user:Team$Backend"]}}' \
--notifications-with-subscribers '[{"Notification":{"ComparisonOperator":"GREATER_THAN","NotificationType":"ACTUAL","Threshold":80,"ThresholdType":"PERCENTAGE"},"Subscribers":[{"Address":"backend-team@example.com","SubscriptionType":"EMAIL"}]}]'
These alerts caught several cost spikes early, including a runaway data transfer issue that would have cost thousands if left unchecked.
Phase 2: Quick Wins
With visibility in place, we identified several quick optimization opportunities that delivered immediate savings with minimal effort.
Right-sizing Compute Resources
Many of our EC2 instances and RDS databases were overprovisioned. Using CloudWatch metrics, we identified instances consistently running at low utilization.
For EC2 instances, we created a simple script to analyze CloudWatch metrics and recommend right-sizing:
import boto3
import datetime
def get_instance_metrics(instance_id, metric_name, statistic, days=14):
cloudwatch = boto3.client('cloudwatch')
end_time = datetime.datetime.utcnow()
start_time = end_time - datetime.timedelta(days=days)
response = cloudwatch.get_metric_statistics(
Namespace='AWS/EC2',
MetricName=metric_name,
Dimensions=[{'Name': 'InstanceId', 'Value': instance_id}],
StartTime=start_time,
EndTime=end_time,
Period=3600, # 1 hour
Statistics=[statistic]
)
if not response['Datapoints']:
return None
# Return the 95th percentile (for sizing based on peak with some headroom)
datapoints = sorted(response['Datapoints'], key=lambda x: x[statistic])
index = int(len(datapoints) * 0.95)
return datapoints[index][statistic]
def recommend_instance_size(instance_id, instance_type):
# Get CPU utilization at 95th percentile
cpu_utilization = get_instance_metrics(instance_id, 'CPUUtilization', 'Average')
if cpu_utilization is None:
return "No data available"
# Simple sizing logic - can be expanded for more instance types
if cpu_utilization < 10:
recommendation = "Severely underutilized - consider downsizing by two sizes"
elif cpu_utilization < 20:
recommendation = "Underutilized - consider downsizing by one size"
elif cpu_utilization > 80:
recommendation = "Nearly overutilized - consider upsizing by one size"
else:
recommendation = "Properly sized"
return f"Current utilization: {cpu_utilization:.2f}% - {recommendation}"
# Example usage
ec2 = boto3.resource('ec2')
for instance in ec2.instances.filter(Filters=[{'Name': 'instance-state-name', 'Value': 'running'}]):
recommendation = recommend_instance_size(instance.id, instance.instance_type)
print(f"Instance {instance.id} ({instance.instance_type}): {recommendation}")
This analysis led to:
- Downsizing 30% of our EC2 instances
- Converting on-demand RDS instances to read replicas where appropriate
- Identifying several forgotten development instances that could be terminated
The savings from right-sizing alone covered 18% of our total cost reduction.
Storage Optimization
Storage was our second-largest cost center. We focused on:
- S3 Lifecycle Policies - Moving older data to cheaper storage classes:
{
"Rules": [
{
"ID": "Move to Infrequent Access after 30 days",
"Status": "Enabled",
"Prefix": "logs/",
"Transitions": [
{
"Days": 30,
"StorageClass": "STANDARD_IA"
},
{
"Days": 90,
"StorageClass": "GLACIER"
}
],
"Expiration": {
"Days": 365
}
}
]
}
- EBS Volume Cleanup - Identifying and removing unused volumes:
# Find unattached EBS volumes
aws ec2 describe-volumes \
--filters Name=status,Values=available \
--query 'Volumes[*].{ID:VolumeId,Size:Size,Type:VolumeType,Created:CreateTime}' \
--output table
- RDS Storage Optimization - Analyzing database usage patterns and implementing table partitioning and cleanup routines
These storage optimizations delivered an additional 12% in cost savings.
Spot Instances and Reserved Instances
For predictable workloads, we implemented:
- Reserved Instances for baseline production capacity:
# Example of purchasing reserved instances
aws ec2 purchase-reserved-instances-offering \
--reserved-instances-offering-id r-123456 \
--instance-count 10
- Spot Instances for batch processing and test environments:
# Example AWS CDK code for EC2 Auto Scaling with spot instances
from aws_cdk import (
aws_ec2 as ec2,
aws_autoscaling as autoscaling,
core
)
class SpotInstanceStack(core.Stack):
def __init__(self, scope: core.Construct, id: str, **kwargs) -> None:
super().__init__(scope, id, **kwargs)
vpc = ec2.Vpc.from_lookup(self, "VPC", vpc_id="vpc-12345")
# Create launch template with spot instances
launch_template = ec2.LaunchTemplate(
self, "LaunchTemplate",
instance_type=ec2.InstanceType("c5.large"),
machine_image=ec2.AmazonLinuxImage(),
user_data=ec2.UserData.custom('#!/bin/bash\necho "Hello, World!"')
)
# Create Auto Scaling group with spot instances
autoscaling.AutoScalingGroup(
self, "ASG",
vpc=vpc,
launch_template=launch_template,
min_capacity=2,
max_capacity=10,
spot_price="0.04" # Maximum price you're willing to pay per hour
)
- Savings Plans for more flexible compute commitments
By implementing a mix of these purchasing options, we reduced our effective compute costs by about 45%.
Phase 3: Architectural Optimizations
The most sustainable cost reductions came from architectural changes. These took longer to implement but had the biggest long-term impact.
Implement Proper Data Lifecycle Management
We discovered we were storing and processing far more data than necessary. We implemented a comprehensive data lifecycle:
- Data classification - Categorizing data by business value and retention requirements
- Tiered storage - Using the appropriate storage for each data class
- Automated archiving and deletion - Enforcing retention policies
- Data sampling - For logs and metrics that don't need 100% retention
One key change was implementing log aggregation and filtering at the edge before sending to our centralized logging system:
// Example log filtering in Node.js
const winston = require('winston');
// Only log errors and warnings to remote log storage
const logger = winston.createLogger({
level: process.env.NODE_ENV === 'production' ? 'warn' : 'info',
format: winston.format.json(),
transports: [
// Write all logs error and warning to CloudWatch
new winston.transports.CloudWatch({
logGroupName: 'application-logs',
logStreamName: `${process.env.SERVICE_NAME}-${process.env.NODE_ENV}`,
awsAccessKeyId: process.env.AWS_ACCESS_KEY_ID,
awsSecretKey: process.env.AWS_SECRET_ACCESS_KEY,
awsRegion: process.env.AWS_REGION
}),
// Write all logs to local console for debugging
new winston.transports.Console({
level: 'debug'
})
]
});
This reduced our logging costs by over 60%.
Serverless for Variable Workloads
We identified several services with highly variable load patterns that were inefficiently running on always-on infrastructure. Converting these to serverless significantly reduced costs:
// Example TypeScript code using AWS CDK for a Lambda function with API Gateway
import * as cdk from 'aws-cdk-lib';
import * as lambda from 'aws-cdk-lib/aws-lambda';
import * as apigateway from 'aws-cdk-lib/aws-apigateway';
import { NodejsFunction } from 'aws-cdk-lib/aws-lambda-nodejs';
import { Construct } from 'constructs';
export class ServerlessApiStack extends cdk.Stack {
constructor(scope: Construct, id: string, props?: cdk.StackProps) {
super(scope, id, props);
// Create Lambda function from TypeScript code
const handler = new NodejsFunction(this, 'Handler', {
runtime: lambda.Runtime.NODEJS_16_X,
entry: 'lambda/api-handler.ts',
handler: 'handler',
bundling: {
minify: true,
externalModules: ['aws-sdk'],
},
environment: {
NODE_ENV: 'production',
},
memorySize: 256,
timeout: cdk.Duration.seconds(10),
});
// Create API Gateway
const api = new apigateway.RestApi(this, 'Endpoint', {
deployOptions: {
stageName: 'prod',
// Only cache responses for GET methods
methodOptions: {
'/*/*': {
cachingEnabled: false,
},
'/*/GET': {
cachingEnabled: true,
cacheTtl: cdk.Duration.minutes(5),
},
},
},
});
// Add Lambda integration
const integration = new apigateway.LambdaIntegration(handler);
api.root.addMethod('GET', integration);
// Add a resource
const items = api.root.addResource('items');
items.addMethod('GET', integration);
items.addMethod('POST', integration);
}
}
This serverless approach reduced costs for variable workloads by 70-80% compared to running dedicated servers.
Caching Strategy Overhaul
We implemented a multi-tier caching strategy:
- Browser caching with appropriate Cache-Control headers
- CDN caching for static assets and API responses
- Application-level caching using Redis
- Database query caching
For our API caching, we implemented a system that automatically adjusted TTLs based on data volatility:
// Example TypeScript code for dynamic cache TTL
interface CachingStrategy {
getKey(request: Request): string;
getTtl(resource: string, data: any): number;
shouldCache(request: Request): boolean;
}
class DynamicApiCache implements CachingStrategy {
private volatilityMap: Map<string, number> = new Map();
private updateFrequency: Map<string, number[]> = new Map();
constructor() {
// Initialize with default volatility scores
this.volatilityMap.set('products', 0.3); // Changes infrequently
this.volatilityMap.set('prices', 0.8); // Changes frequently
this.volatilityMap.set('inventory', 0.9); // Changes very frequently
}
getKey(request: Request): string {
return `${request.method}:${request.path}:${JSON.stringify(request.query)}`;
}
getTtl(resource: string, data: any): number {
// Extract resource type from path
const resourceType = resource.split('/')[1]; // e.g., /products/123 -> products
// Get volatility score (0-1, where 1 is highly volatile)
const volatility = this.volatilityMap.get(resourceType) || 0.5;
// Calculate TTL based on volatility (inverse relationship)
// Max 1 hour, min 10 seconds
const maxTtl = 60 * 60; // 1 hour in seconds
const minTtl = 10; // 10 seconds
return Math.round(minTtl + (1 - volatility) * (maxTtl - minTtl));
}
shouldCache(request: Request): boolean {
// Only cache GET requests
return request.method === 'GET';
}
// Update volatility based on write frequency
trackWrite(resource: string): void {
const resourceType = resource.split('/')[1];
if (!this.updateFrequency.has(resourceType)) {
this.updateFrequency.set(resourceType, []);
}
const timestamps = this.updateFrequency.get(resourceType)!;
timestamps.push(Date.now());
// Keep only last hour of updates
const oneHourAgo = Date.now() - 3600000;
const recentUpdates = timestamps.filter(t => t > oneHourAgo);
this.updateFrequency.set(resourceType, recentUpdates);
// Recalculate volatility based on update frequency
const updatesPerHour = recentUpdates.length;
// Normalize to 0-1 scale (assuming >60 updates/hour is max volatility)
const normalizedFrequency = Math.min(updatesPerHour / 60, 1);
this.volatilityMap.set(resourceType, normalizedFrequency);
}
}
This caching strategy reduced our database load by 40% and API Gateway costs by 60%.
Phase 4: Operational Efficiency
The final phase focused on operational practices that continually optimize costs.
Automated Shutdown of Non-Production Resources
We implemented automated shutdown of development and testing environments during off-hours:
# Terraform example for scheduled start/stop of dev resources
resource "aws_autoscaling_schedule" "scale_down_night" {
scheduled_action_name = "scale-down-night"
min_size = 0
max_size = 0
desired_capacity = 0
recurrence = "0 20 * * 1-5" # 8 PM Monday-Friday
autoscaling_group_name = aws_autoscaling_group.dev_asg.name
}
resource "aws_autoscaling_schedule" "scale_up_morning" {
scheduled_action_name = "scale-up-morning"
min_size = 1
max_size = 3
desired_capacity = 2
recurrence = "0 8 * * 1-5" # 8 AM Monday-Friday
autoscaling_group_name = aws_autoscaling_group.dev_asg.name
}
This reduced our non-production costs by approximately 70% with minimal impact on developer productivity.
Cost-Aware CI/CD Pipeline
We integrated cost analysis into our CI/CD pipeline:
- Infrastructure as Code scanning for cost impact
- Cost estimates in pull requests
- Cost regression testing
Here's an example of using Infracost in a GitHub Actions workflow:
# .github/workflows/infracost.yml
name: Infracost
on:
pull_request:
paths:
- '**.tf'
- '.github/workflows/infracost.yml'
jobs:
infracost:
runs-on: ubuntu-latest
name: Infracost
steps:
- name: Checkout
uses: actions/checkout@v2
- name: Setup Infracost
uses: infracost/actions/setup@v1
with:
api-key: ${{ secrets.INFRACOST_API_KEY }}
- name: Run Infracost
run: |
infracost breakdown --path=. \
--terraform-var-file=environments/dev.tfvars \
--format=json \
--out-file=/tmp/infracost.json
- name: Post Infracost comment
uses: infracost/actions/comment@v1
with:
path: /tmp/infracost.json
behavior: update
This made cost visibility a standard part of our development workflow.
FinOps Culture
Finally, we established a FinOps practice within our organization:
- Regular cost review meetings with all engineering teams
- Cost efficiency KPIs for engineering managers
- Recognition for cost-saving initiatives
- "Cost Efficiency Champions" program to promote best practices
Results and Lessons Learned
Over three months, we achieved a 42% reduction in overall cloud costs while actually improving application performance. The cost per transaction dropped by 68%, making our business fundamentally more scalable.
Here's the breakdown of our savings:
- Right-sizing and purchasing optimizations: 28%
- Storage and data lifecycle improvements: 22%
- Architectural changes: 32%
- Operational efficiency: 18%
What Worked Best
- Making costs visible to the teams responsible for them
- Focusing on business metrics rather than absolute costs
- Automating cost governance rather than relying on manual reviews
- Building cost awareness into the development process
What Didn't Work
- Arbitrary cost-cutting mandates without considering performance impacts
- One-size-fits-all approaches to resource allocation
- Overoptimizing rarely-used services with minimal cost impact
- Ignoring developer experience in optimization efforts
Conclusion: Sustainable Cost Optimization
Cloud cost optimization isn't a one-time project but an ongoing discipline. By embedding cost awareness into our engineering culture and processes, we've created a sustainable approach that balances innovation with efficiency.
The key insight from our journey is that cost optimization is ultimately about eliminating waste, not cutting corners. By focusing on architectural efficiency, appropriate resource sizing, and automating routine tasks, we've built a more resilient and cost-effective infrastructure.
As cloud costs continue to rise, these practices will only become more important. The companies that thrive will be those that treat cost efficiency as a competitive advantage rather than a necessary evil.
What cost optimization strategies have worked for your organization? Have you found specific tools or approaches particularly effective? Share your experiences in the comments below.