Kubernetes Clusters for Distributed Model Training on EKS

Question

Pulumi · Accepted Answer

Creating a Kubernetes cluster on AWS Elastic Kubernetes Service (EKS) specialized for distributed model training involves a few key components:

1. **EKS Cluster**: The foundational element which represents the Kubernetes environment where your applications will be deployed. It orchestrates the management of your containers.

2. **Node Groups**: These are collections of Amazon EC2 instances which serve as the workers in the Kubernetes cluster. They run the containers and provide the compute resources needed for your model training tasks.

3. **IAM Roles**: These define permissions that allow your EKS cluster and its nodes to interact with other AWS services securely.

4. **Amazon VPC**: A Virtual Private Cloud (VPC) is needed for your Kubernetes cluster to provide isolation and management of the network where your containers live.

With these components, you can build a Pulumi program that sets up an EKS cluster.

Below is a detailed Pulumi code in Python that sets up an EKS cluster with all the necessary components for running distributed model-training workloads. The program will create the following resources:

- An EKS Cluster with a specified Kubernetes version.
- A NodeGroup which consists of instances that have labels and taints suitable for model training workloads.
- The necessary IAM roles for the cluster and node group.
- A VPC with public and private subnets for the EKS cluster to use.

Here's how to create the EKS cluster for distributed model training:

```python
import pulumi
import pulumi_aws as aws
import pulumi_eks as eks

# Create an IAM role for EKS cluster
eks_cluster_role = aws.iam.Role("eksClusterRole",
    assume_role_policy=aws.iam.get_policy_document(statements=[{
        "actions": ["sts:AssumeRole"],
        "effect": "Allow",
        "principals": [{
            "identifiers": ["eks.amazonaws.com"],
            "type": "Service",
        }]
    }]).json
)

# Attach the Amazon EKS Cluster Policy to the role
eks_cluster_policy_attachment = aws.iam.RolePolicyAttachment("eksClusterPolicyAttachment",
    policy_arn="arn:aws:iam::aws:policy/AmazonEKSClusterPolicy",
    role=eks_cluster_role.name
)

# Attach the Amazon EKS VPC Resource Controller Policy to the role
eks_vpc_resource_controller_policy_attachment = aws.iam.RolePolicyAttachment("eksVpcResourceControllerPolicyAttachment",
    policy_arn="arn:aws:iam::aws:policy/AmazonEKSVPCResourceController",
    role=eks_cluster_role.name
)

# Create an IAM role for EKS node group
eks_node_group_role = aws.iam.Role("eksNodeGroupRole",
    assume_role_policy=aws.iam.get_policy_document(statements=[{
        "actions": ["sts:AssumeRole"],
        "effect": "Allow",
        "principals": [{
            "identifiers": ["ec2.amazonaws.com"],
            "type": "Service",
        }]
    }]).json
)

# Attach necessary policies for the worker nodes
node_group_policy_attachments = [
    aws.iam.RolePolicyAttachment("amazonEKSWorkerNodePolicy",
        policy_arn="arn:aws:iam::aws:policy/AmazonEKSWorkerNodePolicy",
        role=eks_node_group_role.name,
    ),
    aws.iam.RolePolicyAttachment("amazonEKS_CNI_Policy",
        policy_arn="arn:aws:iam::aws:policy/AmazonEKS_CNI_Policy",
        role=eks_node_group_role.name,
    ),
    aws.iam.RolePolicyAttachment("amazonEC2ContainerRegistryReadOnly",
        policy_arn="arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly",
        role=eks_node_group_role.name,
    ),
]

# Create a VPC with private and public subnets for the EKS cluster
vpc = aws.ec2.Vpc("eksVpc",
    cidr_block="10.100.0.0/16",
    enable_dns_hostnames=True,
    enable_dns_support=True,
    tags={"Name": "pulumi-eks-vpc"}
)

# Create public and private subnets within the VPC
public_subnets = []
private_subnets = []
for az in ["a", "b", "c"]:
    public_subnets.append(aws.ec2.Subnet(f"publicSubnet{az.upper()}",
        assign_ipv6_address_on_creation=False,
        vpc_id=vpc.id,
        map_public_ip_on_launch=True,
        cidr_block=f"10.100.{az}.0/24",
        availability_zone=f"us-west-2{az}",
        tags={"Name": f"pulumi-vpc-public-{az}"}
    ))
    private_subnets.append(aws.ec2.Subnet(f"privateSubnet{az.upper()}",
        assign_ipv6_address_on_creation=False,
        vpc_id=vpc.id,
        map_public_ip_on_launch=False,
        cidr_block=f"10.100.{3 + int(az)}.0/24",
        availability_zone=f"us-west-2{az}",
        tags={"Name": f"pulumi-vpc-private-{az}"}
    ))

# Create an EKS cluster
eks_cluster = eks.Cluster("eksCluster",
    role_arn=eks_cluster_role.arn,
    version="1.21",
    vpc_id=vpc.id,
    public_subnet_ids=[subnet.id for subnet in public_subnets],
    private_subnet_ids=[subnet.id for subnet in private_subnets]
)

# Create a NodeGroup for the EKS cluster
node_group = eks.NodeGroup("eksNodeGroup",
    cluster_name=eks_cluster.eks_cluster.name,
    node_group_name="pulumi-eks-nodegroup",
    node_role_arn=eks_node_group_role.arn,
    subnet_ids=[subnet.id for subnet in public_subnets],
    scaling_config={
        "desiredSize": 2,
        "maxSize": 3,
        "minSize": 1,
    },
    labels={"workload-type": "model-training"},
    taints=[{"key": "workload-type", "value": "model-training", "effect": "NoSchedule"}],
    disk_size=20,
    instance_types=["t2.medium"]
)

# Export the kubeconfig to use outside of Pulumi
pulumi.export("kubeconfig", eks_cluster.kubeconfig)
```

### What is happening here?

- **IAM Role Creation**: We create IAM roles with the necessary trust relationships to allow the EKS service and EC2 instances to assume them.

- **Attaching Policies**: We then attach AWS managed policies that grant the required permissions for the EKS service and the Node Group (the EC2 workers).

- **VPC Creation**: We allocate a new VPC for our cluster with corresponding subnets for each availability zone to ensure high availability of the resources.

- **EKS Cluster Creation**: The `eks.Cluster` command creates an EKS cluster with the specified Kubernetes version.

- **Node Group Creation**: We add a Node Group to the cluster with its own specific role. This Node Group has labels to identify workloads and taints to prevent other workloads from being scheduled on it (useful if we want dedicated instances for model training jobs).

- **Exporting kubeconfig**: Finally, we export the `kubeconfig`, which provides the connection details for your local `kubectl` tool to interact with the new Kubernetes cluster.

Once this Pulumi program is executed, you'll have a fully functioning AWS EKS cluster set up for distributed model training. You can access the cluster's kubeconfig as an output, which allows you to use `kubectl` to deploy your machine learning workloads.