Regional Fault-tolerant Training Cluster Deployment

Question

Pulumi · Accepted Answer

To create a regional fault-tolerant training cluster deployment, a common approach is to use managed services provided by cloud providers that handle much of the complexity of creating a fault-tolerant system. For AWS, using Elastic Container Service (ECS) along with an EC2 Spot Fleet or ECS Fargate for serverless container execution are good choices. This provides both fault-tolerance and cost-efficiency.

In our Pulumi program, we will use the following resources:

- **ECS Cluster**: This acts as the logical grouping of our resources for our training application.
- **Task Definitions**: This defines our application's container and the necessary parameters.
- **Service**: This maintains a specified number of instances of a task definition simultaneously.
- **Spot Fleet Request/Spot Fleet**: This allows us to request a fleet of Spot Instances at reduced rates, which makes it cost-effective. Note that Spot Instances can be interrupted by AWS with two minutes of notice if the spot price exceeds your bid. Therefore, it's used when you have flexible start and end times.
- **Auto Scaling**: To ensure fault-tolerance, it's best to set up auto-scaling policies to automatically adjust the desired count of tasks in response to increased load.

Here is a Pulumi program that sets up such a training cluster:

```python
import pulumi
import pulumi_aws as aws

# Create an ECS cluster to host our services
ecs_cluster = aws.ecs.Cluster("training-cluster")

# Register a task definition for the training application
task_definition = aws.ecs.TaskDefinition("app-task",
    family="training-app",
    cpu="256",
    memory="512",
    network_mode="awsvpc",
    requires_compatibilities=["FARGATE"],  # Specify FARGATE to run containers without managing servers
    execution_role_arn=aws_iam_role.execution.arn,
    container_definitions=pulumi.Output.all(application_image_url, container_name).apply(
        lambda args: f"""
        [
            {{
                "name": "{args[1]}",
                "image": "{args[0]}",
                "portMappings": [
                    {{
                        "containerPort": 80,
                        "hostPort": 80
                    }}
                ]
            }}
        ]
        """
    )
)

# Create a service that runs our task definition with ECS Fargate
service = aws.ecs.Service("app-svc",
    cluster=ecs_cluster.arn,
    task_definition=task_definition.arn,
    desired_count=3,  # Start with three instances for high availability
    launch_type="FARGATE",  # Serverless launch type
    network_configuration={
        "assign_public_ip": "ENABLED",
        "subnets": ["subnet-xxxxxxxxxxxxxxxxx"],  # Specify your subnet
        "security_groups": ["sg-xxxxxxxxxxxxxxxxx"],  # Specify your security group
    },
)

# Request a spot fleet to reduce costs
spot_fleet = aws.ec2.SpotFleetRequest("spot-fleet",
    spot_price="0.03",  # Specify the max price you are willing to pay per hour per instance
    target_capacity=5,  # Define the number of instances you want
    iam_fleet_role="arn:aws:iam::123456789012:role/aws-ec2-spot-fleet-tagging-role",  # Specify your IAM role
    launch_specifications=[
        {
            "ami": "ami-0abcdef1234567890",  # Specify the AMI ID
            "instance_type": "t2.micro",  # Specify instance type
            "subnet_id": "subnet-xxxxxxxxxxxxxxxxx",  # Specify your subnet
        },
    ],
)

pulumi.export("ecs_cluster_name", ecs_cluster.name)
pulumi.export("service_name", service.name)
pulumi.export("spot_fleet_request_id", spot_fleet.id)
```

Let's break down this program:

1. We first create an ECS cluster, which is simply a logical collection of ECS tasks or services.
2. Then we define a "task definition", which specifies how our containers should be run, including configurations like the container image to use, CPU and memory allocations, network mode, and the IAM role that ECS should assume when executing the task definition.
3. We then create an ECS service that maintains a desired count of instances of the task definition and uses AWS Fargate for serverless execution.
4. We also create a spot fleet request. Spot Fleet helps to manage the Spot Instances, and it automates the process of requesting spot instances at the lowest available price and organizing them into a fleet that acts like a normal EC2 fleet.

The `desired_count` parameter in our service will maintain three instances of the task definition. This ensures that if one task fails or is interrupted, there are at least two more running, providing fault tolerance. The use of EC2 Spot Fleet alongside provides a cost-efficient way of running additional workloads, but keep in mind that Spot Instances are not suitable for critical jobs that can't handle interruptions.

Make sure the subnet ID and security group specified in both the service and spot fleet sections correlate with the VPC and security setup that you have. You should also have appropriate IAM roles with the required permissions.

Lastly, we export the cluster name, service name, and spot fleet request ID for convenient access to these resources through the Pulumi CLI.

Please ensure you replace the placeholder values like `subnet-xxxxxxxxxxxxxxxxx`, `sg-xxxxxxxxxxxxxxxxx`, `ami-0abcdef1234567890`, and IAM role ARNs with actual values from your AWS setup. The image URL and container name should be specific to your application as well.

By deploying this code with Pulumi, you’ll have a fault-tolerant regional training cluster deployment on AWS.