Deploy the gpu-operator helm chart on Azure Kubernetes Service (AKS)

Question

Pulumi · Accepted Answer

To deploy the gpu-operator Helm chart on an Azure Kubernetes Service (AKS) cluster using Pulumi, we'll perform the following steps:

1. **Create an AKS Cluster**: We'll begin by creating an AKS cluster that can support GPUs. We will not go into the specifics of GPU node pool configuration as it involves choosing the right VM sizes which are capable of GPU acceleration and this can vary over time. The AKS cluster will be created using the `ProvisionedCluster` resource from the `azure-native` package.

2. **Install the GPU Operator**: Once the cluster is set up, we will deploy the `gpu-operator` Helm chart. Helm charts are packages of pre-configured Kubernetes resources. The GPU Operator automates the management of all NVIDIA software components needed to provision GPU. You'll need to add the NVIDIA Helm repository to your Pulumi project.

3. **Configure Pulumi to Use Helm**: We'll install the GPU Operator by creating a `Chart` resource from the `kubernetes` package after setting up the AKS cluster.

Here's a Pulumi program in TypeScript that sets up the AKS cluster and deploys the gpu-operator Helm chart:

```typescript
import * as pulumi from "@pulumi/pulumi";
import * as azure_native from "@pulumi/azure-native";
import * as k8s from "@pulumi/kubernetes";

// Step 1: Create an Azure Kubernetes Service (AKS) cluster

// Define the AKS cluster
const aksCluster = new azure_native.containerservice.ManagedCluster("aksCluster", {
    resourceGroupName: "myResourceGroup",
    // Add other required configurations here as needed
    // Ideally, you'd select a VM size that supports GPUs
    // For example, depending on your region and availability you might use "Standard_NC6" or similar
});

// Export the Kubeconfig for the AKS cluster
export const kubeconfig = aksCluster.kubeConfig;

// Step 2: Deploy the gpu-operator Helm chart

const gpuOperatorChart = new k8s.helm.v3.Chart("gpu-operator", {
    // Assuming the Helm repo for the GPU Operator chart has been added to your Pulumi setup
    // Replace `chart` and `version` with the appropriate values from NVIDIA's chart repository
    chart: "gpu-operator",
    version: "x.y.z", // Replace with the specific version you want to deploy
    fetchOpts: {
        repo: "https://nvidia.github.io/gpu-operator", // The NVIDIA GPU operator Helm repository
    },
    // You may need to provide specific values for the Helm chart depending on your requirements
    values: {
        // Define any specific configurations needed by the gpu-operator here
    },
}, { provider: new k8s.Provider("k8sProvider", { kubeconfig: kubeconfig }) });

// Export the status of deployment to know when the gpu-operator is ready
export const gpuOperatorStatus = gpuOperatorChart.status;
```

**Key Aspects of the Program**:

- The `ProvisionedCluster` resource is used to create a new AKS cluster within a resource group named `myResourceGroup`. You need to specify the VM size and other details to suit your requirements, specifically ensuring that the VMs support GPUs.

- Once the cluster is provisioned, its kubeconfig is exported. Pulumi provides this kubeconfig, which is used to interact with your AKS cluster using kubectl or any Kubernetes client.

- The `Chart` resource is used to deploy the gpu-operator Helm chart from NVIDIA's dedicated Helm repository. Replace `x.y.z` with the actual version of the chart you intend to use. This resource requires specifying the kubeconfig output of the AKS cluster so that it can deploy resources to that specific cluster.

Remember that the specific values and configurations needed for deploying the GPU Operator might change based on your exact GPU requirements and the setup you have. Therefore, it's critical to refer to NVIDIA's GPU Operator documentation and the Azure documentation on supporting GPUs in AKS for any specific deployment settings.