Scalable GPU Instances for Deep Learning Models on AWS EC2

Question

Pulumi · Accepted Answer

To create scalable GPU instances for deep learning models on AWS EC2, you'll need to choose an appropriate instance type that offers GPU support and use an AMI that's optimized for machine learning. AWS provides several EC2 instance types that are equipped with GPUs, such as the `p3` and `g4` instance types.

Here’s a high-level overview of what you typically do:

1. **Select a GPU Instance Type**: For deep learning, you can use the `p3` or `g4` instance types which are optimized for compute-intensive workloads and come with NVIDIA GPUs.
   
2. **Choose a Machine Learning AMI**: Amazon provides AMIs that are pre-installed with popular deep learning frameworks such as TensorFlow, PyTorch, and MXNet.

3. **Create a Launch Configuration or Launch Template**: This defines the instance type, AMI, and other configurations like Security Groups.
   
4. **Configure Auto Scaling**: This allows your EC2 instances to scale based on demand, and you can define the minimum, maximum, and desired capacity.

5. **Create an Auto Scaling Group**: This uses the Launch Configuration to create instances and manage scaling policies.

Now, let's write the Pulumi code that accomplishes this. The following program will create an Auto Scaling Group with GPU instances suitable for deep learning:

```python
import pulumi
import pulumi_aws as aws

# Select a GPU instance type
gpu_instance_type = "p3.2xlarge"  # This type is suitable for general-purpose GPU computing.

# Choose an AMI that is optimized for deep learning. This ID should be for an AMI that is pre-installed with deep learning frameworks.
machine_learning_ami_id = "ami-1234567890abcdefg"

# Define a Security Group that allows SSH access
secgroup = aws.ec2.SecurityGroup('secgroup',
    description='Allow SSH inbound',
    ingress=[
        {
            'protocol': 'tcp',
            'from_port': 22,
            'to_port': 22,
            'cidr_blocks': ['0.0.0.0/0'],
        }
    ],
    egress=[
        {'protocol': '-1', 'from_port': 0, 'to_port': 0, 'cidr_blocks': ['0.0.0.0/0']},
    ]
)

# Create a Launch Template
launch_template = aws.ec2.LaunchTemplate('launch-template',
    image_id=machine_learning_ami_id,
    instance_type=gpu_instance_type,
    key_name='my-key-pair',  # Make sure you've created a key pair
    security_group_ids=[secgroup.id],
    tag_specifications=[{
        'resourceType': 'instance',
        'tags': {
            'Name': 'DeepLearningGPU',
        },
    }],
)

# Configure Auto Scaling Group using the launch template
auto_scaling_group = aws.autoscaling.Group('auto-scaling-group',
    launch_template={
        'id': launch_template.id,
        'version': "$Latest",
    },
    vpc_zone_identifiers=['subnet-12345', 'subnet-67890'],  # Replace with your subnet IDs
    desired_capacity=2,
    min_size=1,
    max_size=10,
)

pulumi.export('asg_name', auto_scaling_group.name)
```

In this program, we're doing the following:

- Defining a GPU instance type (`p3.2xlarge`) which is suitable for deep learning tasks.
- Providing an AMI ID for a deep learning AMI to be used with the instance. Please replace `'ami-1234567890abcdefg'` with a valid deep learning AMI ID.
- Creating a Security Group `secgroup` that allows SSH access. This will enable you to connect to the instance remotely for any manual configurations or monitoring.
- Creating a Launch Template `launch_template` that specifies the instance type, AMI, and Security Group. This includes a key pair name for SSH access, which you should have already set up in your AWS account.
- Using the `auto_scaling_group` resource to create an Auto Scaling Group which references the launch template. The group is configured to start with a desired capacity of 2 instances and can scale out to 10 instances or scale in to 1 instance as needed. Replace the `vpc_zone_identifiers` with the actual subnet IDs of your VPC where the instances should be launched.

This program exports the name of the Auto Scaling Group, so you can identify it in the AWS console or when using the AWS CLI.

Please keep in mind that this program assumes you've already configured your AWS credentials with Pulumi. You can execute it using Pulumi CLI commands `pulumi up` to create and `pulumi destroy` to clean up the resources. Make sure to review and update the parameters such as `key_name`, `machine_learning_ami_id`, and `vpc_zone_identifiers` to match your actual AWS environment.