Auto-Scaling GPU Resources for Deep Learning on EKS

Question

Pulumi · Accepted Answer

In order to create an auto-scaling group of GPU resources for deep learning on Amazon EKS (Elastic Kubernetes Service), you'll need to define your EKS cluster, configure the GPU nodes, and finally set up the auto-scaling parameters. Below is a step-by-step guide to how we will accomplish this, followed by the Pulumi program in Python:

1. **Define the EKS Cluster**: The cluster is the foundational element of EKS which orchestrates and manages all the activities happening in your Kubernetes environment. We will use the `eks.Cluster` resource for this purpose because it provides a high-level abstraction over the details of managing an EKS cluster, including setting up the necessary roles and policies.

2. **Configure GPU Nodes**: For GPU support, we must configure the EKS cluster with worker nodes that have GPUs attached. We use the `NodeGroup` or `ManagedNodeGroup` class from the Pulumi EKS package to do this, depending on the level of customization and management you want over the nodes. AWS provides instances like the `p2`, `p3`, `g4dn`, and `p4` families that are equipped with GPUs and are suitable for deep learning tasks. We will select an appropriate instance type and ensure that the AMI (Amazon Machine Image) used supports GPUs, typically the Amazon EKS-optimized accelerated AMI.

3. **Enable Auto-Scaling**: To automatically scale the GPU resources based on the load, we will configure the auto-scaling parameters within the `NodeGroup` or `ManagedNodeGroup`. This includes setting the desired, minimum, and maximum number of nodes. Kubernetes can then adjust the number of nodes in the group based on the actual resource demand.

4. **Install NVIDIA Kubernetes Device Plugin**: After the cluster and nodes are set up, to utilize the GPUs within Kubernetes, we need to install the NVIDIA Kubernetes Device Plugin as a DaemonSet. This plugin allows the K8s scheduler to recognize and assign GPU resource requests to the nodes that have them.

Here's the Pulumi program in Python that would create such an environment:

```python
import pulumi
import pulumi_eks as eks

# Create an EKS cluster with a GPU node group.
cluster = eks.Cluster('gpu-enabled-eks-cluster',
    desired_capacity=2,  # Desired number of nodes in the node group.
    min_size=1,  # Min number of nodes in the auto-scaling group.
    max_size=4,  # Max number of nodes in the auto-scaling group.
    instance_type='g4dn.xlarge',  # GPU-backed instance for deep learning tasks.
    node_group_options=eks.ClusterNodeGroupOptionsArgs(
        gpu=True,  # Enable GPU support - this will select the right AMI.
        labels={"hardware": "gpu"},
        taints={"nvidia.com/gpu": {"value": "true", "effect": "NoSchedule"}},
    )
)

# Export the cluster's kubeconfig.
pulumi.export('kubeconfig', cluster.kubeconfig)

# Note: You will install the NVIDIA Kubernetes Device Plugin manually after cluster creation,
# or automate it with a script or additional Pulumi code, typically as a DaemonSet.
```

In the above program:

- We create an `eks.Cluster` named `gpu-enabled-eks-cluster`.  
- We enable GPU support by specifying `gpu=True` in `node_group_options`. This tells Pulumi to use GPU-accelerated instances and the respective AMI.
- The node group is configured to have a range of 1 to 4 instances, with a desire to start with 2 (`desired_capacity`).
- The use of `g4dn.xlarge` instances provides a good balance of compute, memory, and GPU resources for deep learning tasks.
- We've added a label to the GPU nodes to enable us to target them with GPU-specific workloads.
- We've applied a taint to the GPU nodes with `nvidia.com/gpu`. This ensures that only pods that request this resource will be scheduled on the GPU nodes.
- The kubeconfig required to interact with the cluster is exported at the end of the script.

After creating your cluster with this Pulumi program, you'll need to install the [NVIDIA Kubernetes Device Plugin](https://github.com/NVIDIA/k8s-device-plugin). This step can be done manually, by applying the YAML directly into the cluster with `kubectl create`, or by extending the Pulumi program to include it.

For more details and configurations for your GPU-enabled EKS cluster:

- [EKS Cluster in Pulumi](https://www.pulumi.com/registry/packages/eks/api-docs/cluster/)
- [Managed Node Groups in Pulumi](https://www.pulumi.com/registry/packages/eks/api-docs/managednodegroup/)
- [EKS Node Group in Pulumi](https://www.pulumi.com/registry/packages/eks/api-docs/nodegroup/)
- [Amazon EKS-Optimized AMIs](https://docs.aws.amazon.com/eks/latest/userguide/eks-optimized-ami.html)
- [NVIDIA Kubernetes Device Plugin Documentation](https://github.com/NVIDIA/k8s-device-plugin)