Managing GPU-enabled Pods for AI on AWS EKS NodeGroups

Question

Pulumi · Accepted Answer

### Managing GPU-enabled Pods on AWS EKS with Pulumi

When managing GPU-enabled pods for AI on AWS EKS, you need to consider both the EKS cluster setup and the Kubernetes pod specification. EKS NodeGroups should be configured with the correct instance types that have GPU capabilities (such as the `p3` or `g4` instance families), and you'll need to ensure that the AMI being used by the NodeGroup is compatible with GPUs.

For the Kubernetes Pods running on EKS, you must specify the GPU resource requirements. Kubernetes uses NVIDIA's device plugin for GPU resources, which identifies the resource as `nvidia.com/gpu`.

In this program, we create an EKS cluster with a GPU-enabled NodeGroup using `pulumi_eks`, and then deploy a Kubernetes Pod that requests GPU resources.

First, we will set up the EKS cluster and the NodeGroup with the GPU instance type. Then we create a Kubernetes Pod with the proper resource request for the GPU.

Below is the Pulumi program in Python that accomplishes these tasks:

```python
import pulumi
import pulumi_eks as eks
import pulumi_kubernetes as k8s

# Step 1: Create an EKS cluster
# In this step, we create an EKS cluster and ensure that the IAM role associated with the NodeGroup has the necessary permissions.
# Reference: https://www.pulumi.com/registry/packages/eks/api-docs/cluster/
cluster = eks.Cluster("gpu-enabled-cluster")

# Step 2: Create an EKS NodeGroup with GPU instance types
# We specify GPU instance types, such as 'p3.2xlarge', which are suitable for GPU-based workloads.
# We also set the AMI type to 'AL2_x86_64_GPU' to use Amazon Linux 2 images that are optimized for GPU workloads.
# Reference: https://www.pulumi.com/registry/packages/eks/api-docs/nodegroup/
node_group = eks.NodeGroup(
    "gpu-nodegroup",
    cluster=cluster.core,
    instance_type="p3.2xlarge",
    desired_capacity=1,
    min_size=1,
    max_size=2,
    ami_type="AL2_x86_64_GPU",
)

# Step 3: Deploy a GPU-enabled Kubernetes Pod
# The Pod is requesting one GPU. Note that the device plugin exposes the GPU under the 'nvidia.com/gpu' resource.
# Reference: https://www.pulumi.com/registry/packages/kubernetes/api-docs/core/v1/pod/
pod = k8s.core.v1.Pod(
    "gpu-pod",
    metadata={"namespace": "default"},
    spec=k8s.core.v1.PodSpecArgs(
        containers=[
            k8s.core.v1.ContainerArgs(
                name="gpu-container",
                image="tensorflow/tensorflow:latest-gpu", # Example image with GPU support
                resources=k8s.core.v1.ResourceRequirementsArgs(
                    limits={"nvidia.com/gpu": "1"},
                    requests={"nvidia.com/gpu": "1"},
                ),
            )
        ],
    ),
    opts=pulumi.ResourceOptions(depends_on=[node_group]),
)

# Export the cluster name and kubeconfig
pulumi.export('cluster_name', cluster.eks_cluster.name)
pulumi.export('kubeconfig', cluster.kubeconfig)
```

#### How to run the program:

Before running the Pulumi program, ensure you have the Pulumi CLI installed and configured with your AWS credentials. You should also have Python and the necessary Pulumi Python packages installed.

To run the program, you need to navigate to the directory containing this script and execute `pulumi up`. This command initiates the deployment process, where Pulumi will perform the following actions:

- Set up the necessary AWS resources for the EKS cluster.
- Configure the EKS NodeGroup with GPU instances.
- Deploy a Kubernetes Pod that requests GPU resources.

It's important to note that GPU instance types and AMI types may vary over time, so you need to use ones appropriate for your specific workload and the region you are working in. The program above is configured for general use, but you may need to customize the instance types and AMI based on the latest offerings from AWS to support GPU-enabled pods.

After the deployment, the cluster name and kubeconfig will be output, which you can use to interact with your EKS cluster and manage your Kubernetes workloads.