Auto-Scaling GPU Nodes for ML Workloads with Karpenter

Question

Pulumi · Accepted Answer

Auto-scaling GPU nodes for machine learning workloads is a common requirement for applications that need to scale their compute resources based on demand. Since you're interested in setting up auto-scaling on a Kubernetes cluster, we can use Karpenter, which is an open-source auto-scaling project built for Kubernetes. Karpenter is designed to quickly launch new nodes and help Kubernetes clusters efficiently use resources.

In this example, I will guide you through a Pulumi Python program that sets up a Kubernetes cluster on AWS EKS (Elastic Kubernetes Service) and configure Karpenter to manage the auto-scaling of GPU nodes that can be used for ML workloads. Here's how we will proceed:

1. Create an EKS cluster.
2. Create an IAM role and instance profile for Karpenter.
3. Deploy Karpenter to the EKS cluster.
4. Set up Karpenter to provision GPU instances when needed for ML workloads.

Here's the Pulumi Python program that accomplishes these tasks:

```python
import pulumi
from pulumi_aws import iam, ec2, eks
import pulumi_kubernetes as k8s
from pulumi_kubernetes.helm.v3 import Chart, ChartOpts
from pulumi_aws.iam import RolePolicyAttachmentArgs, RoleArgs

# Create an AWS EKS cluster.
eks_cluster = eks.Cluster("eks-cluster")

# Create an IAM role that Karpenter will use to create new instances.
karpenter_role = iam.Role("karpenter-role",
                          assume_role_policy="""{
                            "Version": "2012-10-17",
                            "Statement": [{
                                "Effect": "Allow",
                                "Principal": {
                                    "Service": "ec2.amazonaws.com"
                                },
                                "Action": "sts:AssumeRole"
                            }]
                          }""")

# Attach the necessary policies to the Karpenter role.
iam.RolePolicyAttachment("karpenter-attach-ec2",
                         role=karpenter_role.name,
                         policy_arn="arn:aws:iam::aws:policy/AmazonEC2FullAccess")
iam.RolePolicyAttachment("karpenter-attach-eks",
                         role=karpenter_role.name,
                         policy_arn="arn:aws:iam::aws:policy/AmazonEKSWorkerNodePolicy")
iam.RolePolicyAttachment("karpenter-attach-cni",
                         role=karpenter_role.name,
                         policy_arn="arn:aws:iam::aws:policy/AmazonEKSCNIPolicy")

# Create an instance profile for Karpenter nodes.
karpenter_instance_profile = iam.InstanceProfile("karpenter-instance-profile",
                                                 role=karpenter_role.name)

# Deploy Karpenter to the EKS cluster.
karpenter_chart = Chart("karpenter",
                        config=ChartOpts(
                            chart="karpenter",
                            version="0.4.4",
                            namespace="karpenter",
                            fetch_opts=pulumi_kubernetes.helm.v3.FetchOpts(
                                repo="https://charts.karpenter.sh",
                            ),
                        ),
                        opts=pulumi.ResourceOptions(provider=eks_cluster.provider))

# Assuming you have a Kubernetes configuration file generated for your EKS cluster,
# you can set up a provider for interacting with your EKS cluster.
k8s_provider = k8s.Provider('eks-k8s', kubeconfig=eks_cluster.kubeconfig)

# Set up auto-scaling with Karpenter.
# Below, we set up a Provisioner that tells Karpenter how to launch new nodes.
# You would adjust the labels, taints, and resources based on your ML workload needs.
k8s.apiextensions.CustomResource(
    'gpu-node-provisioner',
    api_version='karpenter.sh/v1alpha5',
    kind='Provisioner',
    metadata={'name': 'gpu-nodes', 'namespace': 'karpenter'},
    spec={
        'requirements': [
            {'key': 'kubernetes.io/arch', 'operator': 'In', 'values': ['amd64']},
            {'key': 'node.kubernetes.io/instance-type', 'operator': 'In', 'values': ['p3.2xlarge']},
            {'key': 'topology.kubernetes.io/zone', 'operator': 'In', 'values': ['us-west-2a', 'us-west-2b', 'us-west-2c']}
        ],
        'provider': {
            'instanceProfile': karpenter_instance_profile.name,
        },
        'ttlSecondsAfterEmpty': 30
    },
    opts=pulumi.ResourceOptions(provider=k8s_provider)
)

# Export the cluster's kubeconfig.
pulumi.export('kubeconfig', eks_cluster.kubeconfig)
```

Let's break down what this program does:

- We first set up a new AWS EKS cluster to host our Kubernetes workloads (`eks.Cluster`).
- Then, we create a new IAM role (with `iam.Role`) that the Karpenter service will use to launch new EC2 instances. We attach policies granting full access to EC2, as well as permissions for working with EKS and the VPC CNI plugin for Kubernetes (`iam.RolePolicyAttachment`).
- We next create an instance profile (`iam.InstanceProfile`) which is referenced by the EC2 instances that Karpenter will manage.
- After that, we deploy the Karpenter Helm chart onto our Kubernetes cluster (`Chart`), specifying the version and chart name. We then use the Karpenter Helm chart to deploy the actual Karpenter service.
- With Karpenter installed, we create a `CustomResource` for a Provisioner, which is a Karpenter resource that defines requirements for node provisioning such as instance types, zones, and other constraints.
- The `k8s.apiextensions.CustomResource` for the Provisioner is customized to specify that we want `p3.2xlarge` instances, which are suitable for GPU-based ML workloads. You may need to adjust the instance type based on your specific workload and the available GPU-instance types.
- Finally, we export the kubeconfig of our EKS cluster, which you can use to interact with your Kubernetes cluster using `kubectl` or other Kubernetes tools.

This example assumes you have Pulumi and AWS CLI installed and configured (ensure `aws configure` has been set up with the necessary access and secret keys, and default region). If you don't have Pulumi installed, please download it from [Pulumi's website](https://www.pulumi.com/docs/get-started/aws/begin/).

Before you run `pulumi up`, make sure you have set up AWS credentials and have selected a region where EKS and GPU-based EC2 instances are available.

After deploying this code using Pulumi, your EKS cluster will be capable of auto-scaling nodes with GPUs, which can be used to efficiently run your ML workloads. Remember to adjust the instance types, labels, taints, and other specifications to suit your specific requirements.