Deploy the gpu-operator helm chart on AWS EKS

Question

Pulumi · Accepted Answer

To deploy the GPU Operator helm chart on AWS Elastic Kubernetes Service (EKS), follow these steps:

### Prerequisites

Before you begin, ensure you have:

- An AWS account with appropriate permissions to create EKS clusters, IAM roles, and EC2 instances with GPUs.
- Pulumi CLI installed and configured with your AWS credentials.
- Helm CLI installed on your machine, if you wish to inspect the helm chart locally.
- kubectl CLI installed on your machine to interact with your Kubernetes cluster.
  
### Overview

We're going to create an EKS cluster on AWS and then use the Helm chart for the GPU Operator to facilitate the management of NVIDIA GPU devices in the Kubernetes cluster. The GPU Operator uses NVIDIA's driver container, device plugin for Kubernetes, container runtime, and others, which are encapsulated in Helm charts to manage GPU nodes efficiently.

Here's a step-by-step Pulumi program that will:

1. Provision an EKS cluster with GPU-enabled nodes.
2. Deploy the GPU Operator helm chart to the cluster.

### Pulumi Program

```typescript
import * as pulumi from "@pulumi/pulumi";
import * as eks from "@pulumi/eks";
import * as k8s from "@pulumi/kubernetes";

// Step 1: Create an EKS cluster with GPU-enabled nodes.
const cluster = new eks.Cluster("gpu-cluster", {
    instanceType: "p2.xlarge", // Example GPU instance, ensure you have limits for this type.
    desiredCapacity: 1,        // Number of GPU instances.
    minSize: 1,
    maxSize: 2,
    deployDashboard: false,    // Dashboard is optional.
});

// Step 2: Deploy the GPU Operator using a Helm Chart.
const gpuOperatorChart = new k8s.helm.v3.Chart("gpu-operator", {
    chart: "gpu-operator",
    version: "1.7.0",  // Replace with desired version of GPU Operator Helm chart.
    namespace: "gpu-operator",  // The namespace to deploy the GPU Operator into.
    fetchOpts: { 
        repo: "https://nvidia.github.io/gpu-operator", // Repository for GPU Operator Helm chart.
    },
}, { provider: cluster.provider });

// Export the cluster's kubeconfig.
export const kubeconfig = cluster.kubeconfig;

// Export the GPU Operator Helm chart resources.
export const gpuOperatorResources = gpuOperatorChart.resources;
```

### Explanation

- **eks.Cluster**: Creates a managed Kubernetes cluster on AWS. Here we specify an instance type that is GPU capable. [`eks.Cluster`](https://www.pulumi.com/registry/packages/eks/api-docs/cluster/)
  
- **k8s.helm.v3.Chart**: Deploys Helm charts in your Kubernetes cluster. In this case, we're deploying the `gpu-operator` Helm chart. Helm is a package manager for Kubernetes that allows you to package, configure, and manage Kubernetes resources. [`k8s.helm.v3.Chart`](https://www.pulumi.com/registry/packages/kubernetes/api-docs/helm.sh/v3/chart/)

#### Next Steps

1. Run `pulumi up` to deploy both the EKS cluster and the GPU Operator to it.
1. Once the deployment is complete, you can use `kubectl` with the kubeconfig file to manage the Kubernetes resources.
1. You should see the GPU Operator pods running in the `gpu-operator` namespace.
1. You can also use the Pulumi console to view the status of your deployments and inspect the resources created.

Remember, managing cloud resources via Pulumi incurs costs according to the pricing policies of the cloud provider you're using; AWS in this case. Always clean up resources when they're no longer needed with `pulumi destroy` to avoid unnecessary charges.