Cloud Costs Are Rising - How to Optimize for Efficiency

May 15, 2023 · 13 min read

Software Developer & Tech Enthusiast

Last quarter, our CFO called me into an unexpected meeting. "Our AWS bill has doubled in the past year," she said, sliding a chart across the table. "We need to get this under control without slowing down product development."

This wasn't a unique situation. Across the industry, companies are facing the same challenge: cloud costs are spiraling upward while budgets are tightening. The days of treating cloud resources as essentially unlimited are over.

After three months of focused effort, we reduced our cloud spend by 42% without compromising performance or reliability. In this post, I'll share the strategies, tools, and architectural patterns that worked for us, along with the hard-earned lessons from approaches that didn't.

Why Cloud Costs Are Rising

Before diving into solutions, it's worth understanding why cloud costs have become such a pressing issue:

Cloud provider price increases - AWS, Azure, and GCP have all implemented price hikes on various services
Scale of adoption - As more workloads move to the cloud, total bills naturally increase
Complexity - Modern architectures with microservices, managed services, and data pipelines create intricate cost structures
Inefficient defaults - Many cloud services have default settings optimized for convenience, not cost
Lack of visibility - Complex billing makes it hard to attribute costs to specific teams or features

Understanding these factors helps frame a more strategic approach to optimization.

Our Systematic Approach to Cost Optimization

After analyzing our situation, we developed a methodical approach that balanced quick wins with sustainable long-term changes.

Phase 1: Visibility and Governance

You can't optimize what you can't measure. Our first step was implementing proper cost visibility tools and governance structures.

Tagging Strategy

The foundation of our cost visibility was a comprehensive tagging strategy:

# Required tags for all resources
Environment: [production, staging, development, test]
Team: [backend, frontend, data, platform, shared]
Product: [core-app, analytics, admin, api]
Project: [customer-feature-x, internal-initiative-y]
ManagedBy: [terraform, cloudformation, manual, service-name]

We enforced these tags through organizational policies and built automation to catch untagged resources:

# Sample Python script to find untagged resources on AWS
import boto3
import csv
from datetime import datetime

required_tags = ['Environment', 'Team', 'Product', 'Project', 'ManagedBy']

def check_ec2_tags():
    ec2 = boto3.resource('ec2')
    untagged = []
    
    for instance in ec2.instances.all():
        instance_tags = {tag['Key']: tag['Value'] for tag in instance.tags or []}
        missing_tags = [tag for tag in required_tags if tag not in instance_tags]
        
        if missing_tags:
            untagged.append({
                'ResourceId': instance.id,
                'ResourceType': 'EC2',
                'MissingTags': ', '.join(missing_tags),
                'Owner': instance_tags.get('Owner', 'Unknown')
            })
    
    return untagged

# Additional resource checks for RDS, S3, etc.
# ...

# Export results
def export_results(untagged_resources):
    with open(f'untagged_resources_{datetime.now().strftime("%Y-%m-%d")}.csv', 'w') as f:
        writer = csv.DictWriter(f, fieldnames=['ResourceType', 'ResourceId', 'MissingTags', 'Owner'])
        writer.writeheader()
        writer.writerows(untagged_resources)

# Main execution
all_untagged = []
all_untagged.extend(check_ec2_tags())
# Add other resource checks
export_results(all_untagged)

Cost Allocation Tools

Next, we implemented tools to visualize costs across different dimensions:

AWS Cost Explorer with custom views for each team
CloudHealth for deeper analysis and recommendations
Custom Grafana dashboards showing costs alongside application metrics

The key insight here was correlating costs with business metrics. Instead of just looking at absolute dollars, we tracked metrics like "cost per user," "cost per transaction," and "cost per API call." This changed the conversation from "reduce costs" to "improve efficiency."

Budget Alerts and Anomaly Detection

We set up automated alerts for:

Budget overruns at team and service levels
Unusual spending patterns
Resources approaching reserved instance expirations
Underutilized reserved instances

Here's a sample AWS CLI command to create a budget alert:

aws budgets create-budget \
    --account-id 123456789012 \
    --budget '{"BudgetName":"Backend Team Monthly Budget","BudgetLimit":{"Amount":"5000","Unit":"USD"},"TimeUnit":"MONTHLY","BudgetType":"COST","CostFilters":{"TagKeyValue":["user:Team$Backend"]}}' \
    --notifications-with-subscribers '[{"Notification":{"ComparisonOperator":"GREATER_THAN","NotificationType":"ACTUAL","Threshold":80,"ThresholdType":"PERCENTAGE"},"Subscribers":[{"Address":"backend-team@example.com","SubscriptionType":"EMAIL"}]}]'

These alerts caught several cost spikes early, including a runaway data transfer issue that would have cost thousands if left unchecked.

Phase 2: Quick Wins

With visibility in place, we identified several quick optimization opportunities that delivered immediate savings with minimal effort.

Right-sizing Compute Resources

Many of our EC2 instances and RDS databases were overprovisioned. Using CloudWatch metrics, we identified instances consistently running at low utilization.

For EC2 instances, we created a simple script to analyze CloudWatch metrics and recommend right-sizing:

import boto3
import datetime

def get_instance_metrics(instance_id, metric_name, statistic, days=14):
    cloudwatch = boto3.client('cloudwatch')
    end_time = datetime.datetime.utcnow()
    start_time = end_time - datetime.timedelta(days=days)
    
    response = cloudwatch.get_metric_statistics(
        Namespace='AWS/EC2',
        MetricName=metric_name,
        Dimensions=[{'Name': 'InstanceId', 'Value': instance_id}],
        StartTime=start_time,
        EndTime=end_time,
        Period=3600,  # 1 hour
        Statistics=[statistic]
    )
    
    if not response['Datapoints']:
        return None
    
    # Return the 95th percentile (for sizing based on peak with some headroom)
    datapoints = sorted(response['Datapoints'], key=lambda x: x[statistic])
    index = int(len(datapoints) * 0.95)
    return datapoints[index][statistic]

def recommend_instance_size(instance_id, instance_type):
    # Get CPU utilization at 95th percentile
    cpu_utilization = get_instance_metrics(instance_id, 'CPUUtilization', 'Average')
    
    if cpu_utilization is None:
        return "No data available"
    
    # Simple sizing logic - can be expanded for more instance types
    if cpu_utilization < 10:
        recommendation = "Severely underutilized - consider downsizing by two sizes"
    elif cpu_utilization < 20:
        recommendation = "Underutilized - consider downsizing by one size"
    elif cpu_utilization > 80:
        recommendation = "Nearly overutilized - consider upsizing by one size"
    else:
        recommendation = "Properly sized"
    
    return f"Current utilization: {cpu_utilization:.2f}% - {recommendation}"

# Example usage
ec2 = boto3.resource('ec2')
for instance in ec2.instances.filter(Filters=[{'Name': 'instance-state-name', 'Value': 'running'}]):
    recommendation = recommend_instance_size(instance.id, instance.instance_type)
    print(f"Instance {instance.id} ({instance.instance_type}): {recommendation}")

This analysis led to:

Downsizing 30% of our EC2 instances
Converting on-demand RDS instances to read replicas where appropriate
Identifying several forgotten development instances that could be terminated

The savings from right-sizing alone covered 18% of our total cost reduction.

Storage Optimization

Storage was our second-largest cost center. We focused on:

S3 Lifecycle Policies - Moving older data to cheaper storage classes:

{
  "Rules": [
    {
      "ID": "Move to Infrequent Access after 30 days",
      "Status": "Enabled",
      "Prefix": "logs/",
      "Transitions": [
        {
          "Days": 30,
          "StorageClass": "STANDARD_IA"
        },
        {
          "Days": 90,
          "StorageClass": "GLACIER"
        }
      ],
      "Expiration": {
        "Days": 365
      }
    }
  ]
}

EBS Volume Cleanup - Identifying and removing unused volumes:

# Find unattached EBS volumes
aws ec2 describe-volumes \
    --filters Name=status,Values=available \
    --query 'Volumes[*].{ID:VolumeId,Size:Size,Type:VolumeType,Created:CreateTime}' \
    --output table

RDS Storage Optimization - Analyzing database usage patterns and implementing table partitioning and cleanup routines

These storage optimizations delivered an additional 12% in cost savings.

Spot Instances and Reserved Instances

For predictable workloads, we implemented:

Reserved Instances for baseline production capacity:

# Example of purchasing reserved instances
aws ec2 purchase-reserved-instances-offering \
    --reserved-instances-offering-id r-123456 \
    --instance-count 10

Spot Instances for batch processing and test environments:

# Example AWS CDK code for EC2 Auto Scaling with spot instances
from aws_cdk import (
    aws_ec2 as ec2,
    aws_autoscaling as autoscaling,
    core
)

class SpotInstanceStack(core.Stack):
    def __init__(self, scope: core.Construct, id: str, **kwargs) -> None:
        super().__init__(scope, id, **kwargs)
        
        vpc = ec2.Vpc.from_lookup(self, "VPC", vpc_id="vpc-12345")
        
        # Create launch template with spot instances
        launch_template = ec2.LaunchTemplate(
            self, "LaunchTemplate",
            instance_type=ec2.InstanceType("c5.large"),
            machine_image=ec2.AmazonLinuxImage(),
            user_data=ec2.UserData.custom('#!/bin/bash\necho "Hello, World!"')
        )
        
        # Create Auto Scaling group with spot instances
        autoscaling.AutoScalingGroup(
            self, "ASG",
            vpc=vpc,
            launch_template=launch_template,
            min_capacity=2,
            max_capacity=10,
            spot_price="0.04"  # Maximum price you're willing to pay per hour
        )

Savings Plans for more flexible compute commitments

By implementing a mix of these purchasing options, we reduced our effective compute costs by about 45%.

Phase 3: Architectural Optimizations

The most sustainable cost reductions came from architectural changes. These took longer to implement but had the biggest long-term impact.

Implement Proper Data Lifecycle Management

We discovered we were storing and processing far more data than necessary. We implemented a comprehensive data lifecycle:

Data classification - Categorizing data by business value and retention requirements
Tiered storage - Using the appropriate storage for each data class
Automated archiving and deletion - Enforcing retention policies
Data sampling - For logs and metrics that don't need 100% retention

One key change was implementing log aggregation and filtering at the edge before sending to our centralized logging system:

// Example log filtering in Node.js
const winston = require('winston');

// Only log errors and warnings to remote log storage
const logger = winston.createLogger({
  level: process.env.NODE_ENV === 'production' ? 'warn' : 'info',
  format: winston.format.json(),
  transports: [
    // Write all logs error and warning to CloudWatch
    new winston.transports.CloudWatch({
      logGroupName: 'application-logs',
      logStreamName: `${process.env.SERVICE_NAME}-${process.env.NODE_ENV}`,
      awsAccessKeyId: process.env.AWS_ACCESS_KEY_ID,
      awsSecretKey: process.env.AWS_SECRET_ACCESS_KEY,
      awsRegion: process.env.AWS_REGION
    }),
    // Write all logs to local console for debugging
    new winston.transports.Console({
      level: 'debug'
    })
  ]
});

This reduced our logging costs by over 60%.

Serverless for Variable Workloads

We identified several services with highly variable load patterns that were inefficiently running on always-on infrastructure. Converting these to serverless significantly reduced costs:

// Example TypeScript code using AWS CDK for a Lambda function with API Gateway
import * as cdk from 'aws-cdk-lib';
import * as lambda from 'aws-cdk-lib/aws-lambda';
import * as apigateway from 'aws-cdk-lib/aws-apigateway';
import { NodejsFunction } from 'aws-cdk-lib/aws-lambda-nodejs';
import { Construct } from 'constructs';

export class ServerlessApiStack extends cdk.Stack {
  constructor(scope: Construct, id: string, props?: cdk.StackProps) {
    super(scope, id, props);

    // Create Lambda function from TypeScript code
    const handler = new NodejsFunction(this, 'Handler', {
      runtime: lambda.Runtime.NODEJS_16_X,
      entry: 'lambda/api-handler.ts',
      handler: 'handler',
      bundling: {
        minify: true,
        externalModules: ['aws-sdk'],
      },
      environment: {
        NODE_ENV: 'production',
      },
      memorySize: 256,
      timeout: cdk.Duration.seconds(10),
    });
    
    // Create API Gateway
    const api = new apigateway.RestApi(this, 'Endpoint', {
      deployOptions: {
        stageName: 'prod',
        // Only cache responses for GET methods
        methodOptions: {
          '/*/*': {
            cachingEnabled: false,
          },
          '/*/GET': {
            cachingEnabled: true,
            cacheTtl: cdk.Duration.minutes(5),
          },
        },
      },
    });
    
    // Add Lambda integration
    const integration = new apigateway.LambdaIntegration(handler);
    api.root.addMethod('GET', integration);
    
    // Add a resource
    const items = api.root.addResource('items');
    items.addMethod('GET', integration);
    items.addMethod('POST', integration);
  }
}

This serverless approach reduced costs for variable workloads by 70-80% compared to running dedicated servers.

Caching Strategy Overhaul

We implemented a multi-tier caching strategy:

Browser caching with appropriate Cache-Control headers
CDN caching for static assets and API responses
Application-level caching using Redis
Database query caching

For our API caching, we implemented a system that automatically adjusted TTLs based on data volatility:

// Example TypeScript code for dynamic cache TTL
interface CachingStrategy {
  getKey(request: Request): string;
  getTtl(resource: string, data: any): number;
  shouldCache(request: Request): boolean;
}

class DynamicApiCache implements CachingStrategy {
  private volatilityMap: Map<string, number> = new Map();
  private updateFrequency: Map<string, number[]> = new Map();
  
  constructor() {
    // Initialize with default volatility scores
    this.volatilityMap.set('products', 0.3);  // Changes infrequently
    this.volatilityMap.set('prices', 0.8);    // Changes frequently
    this.volatilityMap.set('inventory', 0.9); // Changes very frequently
  }
  
  getKey(request: Request): string {
    return `${request.method}:${request.path}:${JSON.stringify(request.query)}`;
  }
  
  getTtl(resource: string, data: any): number {
    // Extract resource type from path
    const resourceType = resource.split('/')[1]; // e.g., /products/123 -> products
    
    // Get volatility score (0-1, where 1 is highly volatile)
    const volatility = this.volatilityMap.get(resourceType) || 0.5;
    
    // Calculate TTL based on volatility (inverse relationship)
    // Max 1 hour, min 10 seconds
    const maxTtl = 60 * 60; // 1 hour in seconds
    const minTtl = 10; // 10 seconds
    
    return Math.round(minTtl + (1 - volatility) * (maxTtl - minTtl));
  }
  
  shouldCache(request: Request): boolean {
    // Only cache GET requests
    return request.method === 'GET';
  }
  
  // Update volatility based on write frequency
  trackWrite(resource: string): void {
    const resourceType = resource.split('/')[1];
    
    if (!this.updateFrequency.has(resourceType)) {
      this.updateFrequency.set(resourceType, []);
    }
    
    const timestamps = this.updateFrequency.get(resourceType)!;
    timestamps.push(Date.now());
    
    // Keep only last hour of updates
    const oneHourAgo = Date.now() - 3600000;
    const recentUpdates = timestamps.filter(t => t > oneHourAgo);
    this.updateFrequency.set(resourceType, recentUpdates);
    
    // Recalculate volatility based on update frequency
    const updatesPerHour = recentUpdates.length;
    
    // Normalize to 0-1 scale (assuming >60 updates/hour is max volatility)
    const normalizedFrequency = Math.min(updatesPerHour / 60, 1);
    this.volatilityMap.set(resourceType, normalizedFrequency);
  }
}

This caching strategy reduced our database load by 40% and API Gateway costs by 60%.

Phase 4: Operational Efficiency

The final phase focused on operational practices that continually optimize costs.

Automated Shutdown of Non-Production Resources

We implemented automated shutdown of development and testing environments during off-hours:

# Terraform example for scheduled start/stop of dev resources
resource "aws_autoscaling_schedule" "scale_down_night" {
  scheduled_action_name  = "scale-down-night"
  min_size               = 0
  max_size               = 0
  desired_capacity       = 0
  recurrence             = "0 20 * * 1-5"  # 8 PM Monday-Friday
  autoscaling_group_name = aws_autoscaling_group.dev_asg.name
}

resource "aws_autoscaling_schedule" "scale_up_morning" {
  scheduled_action_name  = "scale-up-morning"
  min_size               = 1
  max_size               = 3
  desired_capacity       = 2
  recurrence             = "0 8 * * 1-5"   # 8 AM Monday-Friday
  autoscaling_group_name = aws_autoscaling_group.dev_asg.name
}

This reduced our non-production costs by approximately 70% with minimal impact on developer productivity.

Cost-Aware CI/CD Pipeline

We integrated cost analysis into our CI/CD pipeline:

Infrastructure as Code scanning for cost impact
Cost estimates in pull requests
Cost regression testing

Here's an example of using Infracost in a GitHub Actions workflow:

# .github/workflows/infracost.yml
name: Infracost
on:
  pull_request:
    paths:
      - '**.tf'
      - '.github/workflows/infracost.yml'

jobs:
  infracost:
    runs-on: ubuntu-latest
    name: Infracost
    steps:
      - name: Checkout
        uses: actions/checkout@v2

      - name: Setup Infracost
        uses: infracost/actions/setup@v1
        with:
          api-key: ${{ secrets.INFRACOST_API_KEY }}

      - name: Run Infracost
        run: |
          infracost breakdown --path=. \
                              --terraform-var-file=environments/dev.tfvars \
                              --format=json \
                              --out-file=/tmp/infracost.json

      - name: Post Infracost comment
        uses: infracost/actions/comment@v1
        with:
          path: /tmp/infracost.json
          behavior: update

This made cost visibility a standard part of our development workflow.

FinOps Culture

Finally, we established a FinOps practice within our organization:

Regular cost review meetings with all engineering teams
Cost efficiency KPIs for engineering managers
Recognition for cost-saving initiatives
"Cost Efficiency Champions" program to promote best practices

Results and Lessons Learned

Over three months, we achieved a 42% reduction in overall cloud costs while actually improving application performance. The cost per transaction dropped by 68%, making our business fundamentally more scalable.

Here's the breakdown of our savings:

Right-sizing and purchasing optimizations: 28%
Storage and data lifecycle improvements: 22%
Architectural changes: 32%
Operational efficiency: 18%

What Worked Best

Making costs visible to the teams responsible for them
Focusing on business metrics rather than absolute costs
Automating cost governance rather than relying on manual reviews
Building cost awareness into the development process

What Didn't Work

Arbitrary cost-cutting mandates without considering performance impacts
One-size-fits-all approaches to resource allocation
Overoptimizing rarely-used services with minimal cost impact
Ignoring developer experience in optimization efforts

Conclusion: Sustainable Cost Optimization

Cloud cost optimization isn't a one-time project but an ongoing discipline. By embedding cost awareness into our engineering culture and processes, we've created a sustainable approach that balances innovation with efficiency.

The key insight from our journey is that cost optimization is ultimately about eliminating waste, not cutting corners. By focusing on architectural efficiency, appropriate resource sizing, and automating routine tasks, we've built a more resilient and cost-effective infrastructure.

As cloud costs continue to rise, these practices will only become more important. The companies that thrive will be those that treat cost efficiency as a competitive advantage rather than a necessary evil.

What cost optimization strategies have worked for your organization? Have you found specific tools or approaches particularly effective? Share your experiences in the comments below.

Why Cloud Costs Are Rising​

Our Systematic Approach to Cost Optimization​

Phase 1: Visibility and Governance​

Tagging Strategy​

Cost Allocation Tools​

Budget Alerts and Anomaly Detection​

Phase 2: Quick Wins​

Right-sizing Compute Resources​

Storage Optimization​

Spot Instances and Reserved Instances​

Phase 3: Architectural Optimizations​

Implement Proper Data Lifecycle Management​

Serverless for Variable Workloads​

Caching Strategy Overhaul​

Phase 4: Operational Efficiency​

Automated Shutdown of Non-Production Resources​

Cost-Aware CI/CD Pipeline​

FinOps Culture​

Results and Lessons Learned​

What Worked Best​

What Didn't Work​

Conclusion: Sustainable Cost Optimization​