Scaling Deep Learning Workloads with AWS EKS NodeGroups

Question

Pulumi · Accepted Answer

Scaling deep learning workloads efficiently often requires a flexible and powerful computing environment. Amazon EKS (Elastic Kubernetes Service) provides a managed Kubernetes service that allows you to run Kubernetes control plane instances across multiple Availability Zones (AZs), ensuring high availability. EKS makes it easy to deploy, manage, and scale containerized applications using Kubernetes on AWS.

To scale deep learning workloads with AWS EKS, we typically use NodeGroups, which are groups of EC2 instances that serve as workers to run your Kubernetes pods. Depending on your workload, you can choose instances with GPUs or other specific hardware that best suits your deep learning tasks.

Below I will guide you through creating an EKS cluster and configuring a NodeGroup optimized for deep learning workloads. We will leverage the `eks` package to create and manage these resources more succinctly. This high-level Pulumi package abstracts away some of the complexity and provides more straightforward options for configuration compared to the lower-level `aws` or `aws-native` packages.

Here's a program that sets up an EKS cluster and a managed NodeGroup with autoscaling enabled:

```python
import pulumi
import pulumi_eks as eks

# Create an EKS cluster.
cluster = eks.Cluster("my-cluster")

# IAM Role for our EC2 Instances/NodeGroup
nodegroup_role = aws.iam.Role("nodegroup-role", 
    assume_role_policy=aws.iam.get_policy_document(statements=[aws.iam.GetPolicyDocumentStatementArgs(
        effect="Allow",
        principals=[aws.iam.GetPolicyDocumentStatementPrincipalArgs(
            type="Service",
            identifiers=["ec2.amazonaws.com"],
        )],
        actions=["sts:AssumeRole"],
    )]),
)

# IAM Role Policy Attachment to provide the necessary permissions for the NodeGroup
policy_attachment = aws.iam.RolePolicyAttachment("nodegroup-attachment",
    policy_arn="arn:aws:iam::aws:policy/AmazonEKSWorkerNodePolicy",
    role=nodegroup_role.name,
)

# Security Group which will be used by the NodeGroup
nodegroup_sg = aws.ec2.SecurityGroup("nodegroup-sg",
    vpc_id=cluster.vpc_id,
    description="Node group security group",
)

# Define that we want a NodeGroup for our cluster.
# Specify the desired instance type, the AMI type as GPU optimized for deep learning workloads, and autoscaling configurations.
nodegroup = eks.NodeGroup("nodegroup", 
    cluster=cluster.core.cluster.name,
    node_role_arn=nodegroup_role.arn,
    subnet_ids=cluster.core.subnet_ids,
    instance_types=["p3.2xlarge"],  # This is an example of a GPU optimized instance. 
    scaling_config=eks.NodeGroupScalingConfigArgs(
        min_size=1,  # Minimum size of NodeGroup
        desired_size=2,  # Desired initial amount of instances
        max_size=5,  # Maximum size for autoscaling
    ),
    tags={"ManagedBy": "Pulumi"},
    labels={"workload-type": "deep-learning"},
)

# Export the cluster's kubeconfig and NodeGroup details
pulumi.export("kubeconfig", cluster.kubeconfig)
pulumi.export("nodegroup", nodegroup.urn)
```

In the code above, we do the following:

1. **Create EKS Cluster**: We start by creating an EKS cluster using Pulumi's higher-level EKS package, which simplifies cluster management by abstracting much of the underlying complexity.

2. **Create IAM Role**: Next, an IAM role is created for the NodeGroup's EC2 instances. The role has an associated policy that grants permissions required by the NodeGroup instances.

3. **Policy Attachment**: We attach the `AmazonEKSWorkerNodePolicy` to the role which grants it the permissions necessary for the NodeGroup EC2 instances.

4. **Security Group**: We create a security group for our NodeGroup within the VPC of our EKS cluster. This is used to control the traffic that is allowed to and from the NodeGroup instances.

5. **Define NodeGroup**: We then define a NodeGroup with a specific instance type that is suitable for deep learning workloads (here an example `p3.2xlarge` instance with GPU support). We also configure autoscaling so that the group can automatically adjust its size based on the load.

6. **Labels and Tags**: We add custom labels and tags for organizational purposes.

Lastly, we export the kubeconfig needed to access the cluster and the NodeGroup's unique resource name (URN) for reference.

With these components set up in AWS using Pulumi, you can dynamically scale your deep learning workloads on Kubernetes, making use of the powerful and flexible computing resources that AWS provides.