Post

AWS ECS Cost Optimization: Migrate Fargate to EC2

awsecswithec2

Image by Arief JR

Today I will share my experience about how to cost optimize with AWS ECS and EC2, and use Terraform to automate the process. AWS ECS is a powerful service for container orchestration, AWS ECS provide two type of instance, EC2 and Fargate. The fargate is a serverless service, currently the company where i working is using fargate.

AWS ECS with fargate is very good option for have large service (microservice) without provision, configure, or manage the underlying EC2 instance.

Currently the traffic is small not high, so i use EC2 instance for cost optimization. with EC2 instance, we can use spot instance to minimize the cost and also enable vpc trunking so 1 instance can handle multiple service. You can refer of the documentation if you want to know more about vpc trunking and supported instance type in here VPC Trunking Supported Instances.

Comparison between AWS ECS with EC2 and AWS ECS with Fargate

FeatureAWS ECS with EC2AWS ECS with Fargate
InfrastructureFully managed by AWS (serverless compute engine for containers)User-managed EC2 instances
ManagementVery low operational overheadHigher operational overhead (managing OS, ECS agent, patching, etc.)
ControlTask-level control only, limited flexibilityFull instance-level control, high flexibility
Pricing ModelPer vCPU/memory per second (pay-as-you-go for resources consumed)Per instance hour (plus EBS, data transfer, even for idle capacity)
CostGenerally higher per hour, but no idle costPotentially lower for high utilization/commitments (e.g., with Reserved Instances/Spot Instances)
ScalingAutomatic scaling at the task levelAutomatic scaling at the instance level (via Auto Scaling Group)
Instance TypesPredefined Fargate configurations onlyChoose from all available EC2 instance types
GPU SupportNoYes
Host AccessNo direct access to the underlying host OSYes, direct access to the EC2 instance
Use CasesBursty or unpredictable workloads, microservices, development/testing environmentsConsistent workloads, workloads requiring specific instance types or specialized hardware (e.g., GPUs), fine-grained control, custom requirements

Simulation Cost betweenn AWS ECS with EC2 and AWS ECS with Fargate

For example the task have 45 and all service running on linux x86_64:

  • ECS With Fargate Spot

    The cost calculation is as follows:

    UnitPrice
    vCPU0.015168 USD/vCPU/hour
    Memory0.001659 USD/GB/hour
    Storage0.000133 USD/GB/month

    Monthly CPU Charges Total vCPU charges = (# of Tasks) x (# vCPUs) x (price per CPU-second) x (CPU duration per day by second) x (# of days) » Because the duration is 24 hours, so i calculate the duration perhour. The total monthly charges of cpu = 45 * 2 * 0.015168 * 24 * 30 = 982.886 USD

    Monthly Memory Charges Total memory charges = (# of Tasks) x (# GB of memory) x (price per GB-second) x (memory duration per day by second) x (# of days) » Because the duration is 24 hours, so i calculate the duration perhour. The total monthly charges of memory = 45 * 4 * 0.001659 * 24 * 30 = 215.006 USD

    Monthly Storage Charges Total storage charges = (# of Tasks) x (# GB of storage) x (price per GB-second) x (storage duration per day by second) x (# of days) » Because the duration is 24 hours, so i calculate the duration perhour. The total monthly charges of storage = 45 * 20 * 0.000133 * 24 * 30 = 86.184 USD

    The total monthly charges = 982.886 + 215.006 + 86.184 = 1.284 USD

  • ECS With EC2 Spot

    Assume the AWS ECS has enabled VPC Trunking, for example i will use 4 m5.xlarge instance type with spot instance. The cost calculation is as follows:

    UnitPrice
    Instance0.0829 USD/hour
    Storage0.096 USD/GB/month

    Monthly Instance Charges 4 instances x 0.24 USD On Demand hourly cost x 730 hours in a month = 700.800000 USD 700.800000 USD - (700.800000 USD x 0.72) = 196.224000 USD Spot instances (monthly): 196.224000 USD 2,920 total EC2 hours / 730 hours in a month = 4.00 instance months 100 GB x 4.00 instance months x 0.096 USD = 38.40 USD (EBS Storage Cost) EBS Storage Cost: 38.40 USD Amazon Elastic Block Store (EBS) total cost (monthly): 38.40 USD

    The total monthly charges = 196.224000 + 38.40 = 234.624000 USD

Implementation Overview

For this setup, i’ll configure AWS ECS with an Auto Scaling group using EC2 instances. The infrastructure specifications are as follows: Instance Types:

m5.xlarge (Spot Instance):

1
2
4 vCPUs
16 GB of memory

The Auto Scaling group will be configured to launch a new instance when the current instance count is below the desired capacity. The Auto Scaling group will also be configured to terminate instances when the current instance count is above the desired capacity.

But i will use terraform to create the AutoScaling group and attach to the ECS cluster for capacity provider. Here are a few select examples of Terraform code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
module "ecs" {
  source = "../modules/aws-ecs/"

  cluster_name = local.name

  cluster_configuration = {
    execute_command_configuration = {
      logging = "OVERRIDE",
      kmsKeyId = "arn:aws:kms:{fill your aws region}:xxxxxxxx:key/xxxxxxx-4bb9-xxxxx-xxxxx-xxxxxx"
      log_configuration = {
        cloud_watch_log_group_name = "/aws/ecs/${local.name}"
      }
    }
  }

  default_capacity_provider_use_fargate = false
  autoscaling_capacity_providers = {
    # Spot instances
    prod-env = {
      auto_scaling_group_arn         = module.autoscaling_prod["prod-env"].autoscaling_group_arn
      managed_termination_protection = "ENABLED"

      managed_scaling = {
        maximum_scaling_step_size = 1
        minimum_scaling_step_size = 1
        status                    = "ENABLED"
        target_capacity           = 100
      }

      lifecycle = {
        create_before_destroy = true
      }

      default_capacity_provider_strategy = {
        base = 1
        weight = 100
      }
    }
  }

  tags = local.tags
}

module "autoscaling_prod" {
  source  = "terraform-aws-modules/autoscaling/aws"
  version = "~> 6.5"

  for_each = {
    # Spot instances
    prod-env = {
      instance_type       = "m5.xlarge"
      min_size            = 0
      max_size            = 4
      desired_capacity    = 1
      use_mixed_instances_policy = true
      }
      mixed_instances_policy = {
        instances_distribution = {
          on_demand_base_capacity                  = 0
          on_demand_percentage_above_base_capacity = 0
          spot_allocation_strategy                 = "price-capacity-optimized"
        }

        override = [
          {
            instance_type     = "m5.xlarge"
            weighted_capacity = "1"
          }
        ]
      }
      user_data = <<-EOT
        #!/bin/bash

        cat <<'EOF' >> /etc/ecs/ecs.config
        ECS_CLUSTER=${local.name}
        ECS_LOGLEVEL=debug
        ECS_CONTAINER_INSTANCE_TAGS=${jsonencode(local.tags)}
        ECS_ENABLE_TASK_IAM_ROLE=true
        ECS_ENABLE_SPOT_INSTANCE_DRAINING=true
        EOF
      EOT
    }
  }

  name = "${local.name}-${each.key}"

  image_id      = jsondecode(data.aws_ssm_parameter.ecs_optimized_ami.value)["image_id"]
  instance_type = each.value.instance_type

  security_groups                 = [module.ecs_sg.security_group_id]
  user_data                       = base64encode(each.value.user_data)
  ignore_desired_capacity_changes = false

  create_iam_instance_profile = false
  iam_instance_profile_arn = "arn:aws:iam::xxxxxxx:instance-profile/XXXXX"

  block_device_mappings = [
    {
      # Root volume
      device_name = "/dev/xvda"
      no_device   = 0
      ebs = {
        delete_on_termination = true
        encrypted             = true
        volume_size           = 100
        volume_type           = "gp3"
      }
      }
  ]

  vpc_zone_identifier = local.subnet_id
  health_check_type   = "EC2"
  min_size            = each.value.min_size
  max_size            = each.value.max_size
  desired_capacity    = each.value.desired_capacity
  health_check_grace_period = 0
  create_scaling_policy = false

  termination_policies = [
    "Default"
  ]
  
  autoscaling_group_tags = {
    AmazonECSManaged = true
  }

  capacity_rebalance = true

  protect_from_scale_in = false

  instance_refresh = {
    strategy = "Rolling"
    preferences = {
      checkpoint_delay       = 300
      checkpoint_percentages = [35, 70, 100]
      instance_warmup        = 300
      min_healthy_percentage = 50
      max_healthy_percentage = 100
      skip_matching          = true
    }
    triggers = ["tag"]
  }

  initial_lifecycle_hooks = [
    {
      name                 = "ecs-managed-draining-termination-hook"
      default_result       = "CONTINUE"
      heartbeat_timeout    = 3600
      lifecycle_transition = "autoscaling:EC2_INSTANCE_TERMINATING"
      notification_metadata = jsonencode({ "event" = "instance_terminating", "cluster" = local.name, "asg" = "${local.name}-${each.key}" })
    }
  ]

  use_mixed_instances_policy = each.value.use_mixed_instances_policy
  mixed_instances_policy     = each.value.mixed_instances_policy

  schedules = each.value.schedules

  tags = local.tags_development
}

And change or update the existing task also service, select to use EC2 and choose capacity provider previously created. Select your existing task definition and create a new revision with JSON.

Example a few of task definition:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
{
    "family": "{fill the task name, usually should same with container name}",
    "containerDefinitions": [
        {
            "name": "{fill the container name}",
            "image": "{fill the container image}",,
            "cpu": 0,
            "memoryReservation": 256,
            "portMappings": [
                {
                    "name": "{fill the container name}",
                    "containerPort": 3000,
                    "hostPort": 3000,
                    "protocol": "tcp",
                    "appProtocol": "http"
                }
            ],
            "essential": true,
            "environment": [
              ........
            ],
            "mountPoints": [],
            "volumesFrom": [],
            "secrets": [
                {
                    "name": "XXXXXXXX",
                    "valueFrom": "arn:aws:ssm:{fill your aws region}:{fill your aws account id}:parameter/production/broom-ml-mrp-service/XXXXXXXX"
                },
                {
                    "name": "XXXXXXXXXXX",
                    "valueFrom": "arn:aws:ssm:{fill your aws region}:{fill your aws account id}:parameter/production/broom-ml-mrp-service/XXXXXXXXXXX"
                },
                {
                    "name": "XXXXXXXXXXX",
                    "valueFrom": "arn:aws:ssm:{fill your aws region}:{fill your aws account id}:parameter/xxxxxxxx/production/xxxxx/xxxxx"
                }
            ],
            "logConfiguration": {
                "logDriver": "awslogs",
                "options": {
                    "awslogs-group": "/ecs/{log_group_name what you want}",
                    "mode": "non-blocking",
                    "awslogs-create-group": "true",
                    "max-buffer-size": "25m",
                    "awslogs-region": "{fill your aws region}",
                    "awslogs-stream-prefix": "ecs"
                }
            },
            "healthCheck": {
                "command": [
                    "CMD-SHELL",
                    "curl -f https://localhost:3000/api/v1/health/ready -H 'action_by: 0' || exit 1"
                ],
                "interval": 30,
                "timeout": 5,
                "retries": 3,
                "startPeriod": 90
            },
            "systemControls": []
        }
    ],
    "taskRoleArn": "arn:aws:iam::xxxxxxxxx:role/xxxxxxx",
    "executionRoleArn": "arn:aws:iam::xxxxxxxxx:role/xxxxxxxxxx",
    "networkMode": "awsvpc",
    "volumes": [],
    "placementConstraints": [],
    "requiresCompatibilities": [
        "EC2"
    ],
    "runtimePlatform": {
        "cpuArchitecture": "X86_64",
        "operatingSystemFamily": "LINUX"
    },
    "enableFaultInjection": false
}

Then for service, change to use EC2 as capacity provider like this example:

AWS-ECS-Update-Service

And choose button update, and you can see the service is running on EC2 instance.

Conclusion

  • Don’t forget to remove fargate as capacity provider or you can add in terraform line with value default_capacity_provider_use_fargate = false.
  • In part kmsKeyId = "arn:aws:kms:{fill your aws region}:xxxxxxxx:key/xxxxxxx-4bb9-xxxxx-xxxxx-xxxxxx" and iam_instance_profile_arn = "arn:aws:iam::xxxxxxx:instance-profile/XXXXX change your existing kms key and iam instance profile. If you don’t have iam instance profile, set to true in value create_iam_instance_profile = false in terraform code that can create new iam instance profile.

Thanks for reading!

This post is licensed under CC BY 4.0 by the author.