Auto-scaling GPU Workloads for Deep Learning on EKS

Question

Pulumi · Accepted Answer

Auto-scaling GPU workloads for deep learning on Amazon EKS (Elastic Kubernetes Service) involves several steps: setting up an EKS cluster, configuring the Kubernetes environment for GPU workloads, and implementing scaling policies that account for the GPU resources required by your deep learning applications.

Below, I'll guide you through the process of creating an EKS cluster suitable for GPU workloads, installing the Nvidia device plugin to enable GPU support in the cluster, and setting up an auto-scaling node group with GPU instances using Pulumi in Python.

### Creating the EKS Cluster
First, we'll create an EKS cluster by defining an `eks.Cluster` resource. This high-level resource handles much of the complexity of setting up an EKS cluster for you.

### Configuring the Cluster for GPU Workloads
To leverage GPUs within our EKS cluster, we'll need to use nodes with appropriate GPU-enabled instance types (e.g., `p3` or `g4` instances). We'll also need to install the Nvidia device plugin on our cluster, which will manage the GPU resources on our nodes and make them available to our workloads.

### Autoscaling with GPU Instances
To autoscale the GPU instances in the node group, we need to define an EKS node group with autoscaling configurations. We will specify the GPU-enabled instance type, the minimum and maximum size of the node group, and the desired capacity.

Let's begin with the Pulumi program, and I'll provide detailed comments at each step of the way:

```python
import pulumi
import pulumi_eks as eks

# Create an EKS cluster with default settings.
# This will provision the necessary AWS resources like the EKS control plane, VPC, subnets, and worker nodes.
cluster = eks.Cluster("gpu-cluster")

# Define the instance type for our GPU workloads. For example, "p3.2xlarge" is a GPU-enabled instance.
instance_type = "p3.2xlarge"

# To use GPU resources in Kubernetes, we need to install the Nvidia device plugin.
# This is achieved by applying a YAML manifest using the Kubernetes provider which is automatically created
# when using the `eks.Cluster` resource. The Nvidia device plugin will register the available GPUs on each node
# with the Kubernetes scheduler, allowing your pods to use them.
nvidia_device_plugin = """apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: nvidia-device-plugin-daemonset
  namespace: kube-system
spec:
  selector:
    matchLabels:
      name: nvidia-device-plugin-ds
  template:
    metadata:
      labels:
        name: nvidia-device-plugin-ds
    spec:
      tolerations:
      - key: nvidia.com/gpu
        operator: Exists
        effect: NoSchedule
      containers:
      - name: nvidia-device-plugin-ctr
        image: nvidia/k8s-device-plugin:v0.9.0
        securityContext:
          allowPrivilegeEscalation: false
        volumeMounts:
        - name: device-plugin
          mountPath: /var/lib/kubelet/device-plugins
      volumes:
      - name: device-plugin
        hostPath:
          path: /var/lib/kubelet/device-plugins"""

# Apply the Nvidia device plugin manifest using the cluster's kubeconfig.
node_plugin = eks.KubernetesManifest("nvidia-device-plugin",
                                     yaml=nvidia_device_plugin,
                                     opts=pulumi.ResourceOptions(provider=cluster.provider))

# Create a managed node group for GPU instances.
# This will set up autoscaling for our GPU nodes, meaning it can automatically
# scale in or out based on resource demand.
gpu_node_group = eks.ManagedNodeGroup("gpu-node-group",
    cluster=cluster.core,
    instance_type=instance_type,
    min_size=1,
    max_size=4, # You may adjust the maximum size according to your requirements.
    desired_capacity=2, # Set to the initial desired capacity for your workloads.
    labels={"workload-type": "gpu-intensive"},
    taints=[{"key": "nvidia.com/gpu", "value": "true", "effect": "NoSchedule"}]
)

# Export the kubeconfig.
pulumi.export('kubeconfig', cluster.kubeconfig)
```

### Explanations:

- `eks.Cluster`: This resource creates the EKS cluster with all the associated resources like the control plane, worker nodes, VPC, and subnets.
  
- `eks.KubernetesManifest`: By applying this resource, we install the Nvidia device plugin onto our cluster, which is necessary for the Kubernetes scheduler to manage GPU resources.
  
- `eks.ManagedNodeGroup`: This managed node group is set to use GPU instance types and can autoscale between the minimum and maximum node count based on the GPU demand by the cluster workloads.

The output of this program will give you the kubeconfig file necessary to interact with your Kubernetes cluster using `kubectl` or other Kubernetes management tools.

Remember, this is just a starting point for setting up an auto-scaling GPU-enabled EKS cluster. You can adjust the instance types, scaling parameters, and other settings according to your specific requirements and budget.