Auto-scaling GPU Clusters for Deep Learning on GKE

Question

Pulumi · Accepted Answer

To build an auto-scaling GPU cluster for deep learning on Google Kubernetes Engine (GKE) with Pulumi, you'll need to create a GKE cluster and configure it with a node pool that includes GPU-enabled nodes. You will then define auto-scaling parameters for the node pool so it can automatically adjust the number of nodes based on the workload.

Here's a high-level overview of what you will do:

1. **Create a GKE Cluster**: This will be the Kubernetes cluster where your deep learning workloads will run.
2. **Add a GPU-enabled Node Pool**: Configure a node pool with GPUs, and install necessary drivers for the GPUs.
3. **Configure Cluster Autoscaling**: Enable autoscaling on your GPU node pool to allow it to automatically scale up or down.
4. **Define Resource Requirements**: Ensure that your workloads request GPU resources so that the scheduler knows which pods need to be placed on GPU-enabled nodes.

Below is a Pulumi program written in Python which sets up an auto-scaling GPU cluster for deep learning on GKE:

```python
import pulumi
import pulumi_gcp as gcp

# Create a GKE cluster.
cluster = gcp.container.Cluster("gpu-cluster",
    initial_node_count=1,
    min_master_version="latest",
    node_config={
        "oauth_scopes": [
            "https://www.googleapis.com/auth/compute",
            "https://www.googleapis.com/auth/devstorage.read_only",
            "https://www.googleapis.com/auth/logging.write",
            "https://www.googleapis.com/auth/monitoring"
        ],
    },
    node_version="latest"
)

# Add a GPU-enabled node pool with auto-scaling enabled, suitable for deep learning workloads.
gpu_node_pool = gcp.container.NodePool("gpu-node-pool",
    cluster=cluster.name,
    initial_node_count=1,
    autoscaling={
        "min_node_count": 1,
        "max_node_count": 5,  # Adjust the maximum number of nodes as needed.
    },
    node_config={
        "machine_type": "n1-standard-1", # Choose an appropriate machine type.
        "oauth_scopes": [
            "https://www.googleapis.com/auth/compute",
            "https://www.googleapis.com/auth/devstorage.read_only",
            "https://www.googleapis.com/auth/logging.write",
            "https://www.googleapis.com/auth/monitoring",
        ],
        # Specify the type and count of GPUs you want to use.
        "guest_accelerators": [{
            "type": "nvidia-tesla-k80", # For example, use the Nvidia Tesla K80 GPUs.
            "count": 1
        }],
        # The following metadata enables the GPU Driver add-on for GKE.
        # This add-on automatically installs the NVIDIA GPU driver.
        "metadata": {
            "install-nvidia-driver": "True"
        }
    },
    management={
        "auto_repair": True,
        "auto_upgrade": True
    },
    depends_on=[cluster]
)

pulumi.export("cluster_name", cluster.name)
pulumi.export("gpu_node_pool_name", gpu_node_pool.name)
```

Here's what each part of the code does:

- The `gcp.container.Cluster` resource creates a new GKE cluster with the specified settings. It includes an initial number of nodes and specifies the required OAuth scopes for the nodes to interact with Google Cloud services.
  
- The `gcp.container.NodePool` resource creates a new node pool within the GKE cluster. This pool is configured with the specified machine type, the initial number of nodes, and the GPU accelerator type (Nvidia Tesla K80 GPUs in the example). It also turns on auto-scaling with a specified minimum and maximum number of nodes, enabling the GKE cluster to automatically adjust to your workload demands.

- The auto-repair and auto-upgrade options are enabled, so GKE automatically keeps the nodes healthy and up to date.

- The `metadata` entry instructs GKE to install the NVIDIA GPU driver on each node in the node pool. This driver is required for the Kubernetes scheduler to schedule GPU workloads on these nodes.

- Lastly, the program exports the cluster and node pool names. These can be used to interact with the cluster or for reference in other Pulumi stacks.

Please note that this is a simplified example intended for learning purposes. For production use, you would need to consider additional aspects, such as network configurations, security settings, more complex resource specifications, and appropriate sizing for your specific workload. Additionally, managing costs associated with auto-scaling GPU clusters on GKE should be kept in mind.