Kubernetes Operators for AI Workflows on EKS

Question

Pulumi · Accepted Answer

Creating Kubernetes Operators for AI Workflows on Amazon EKS involves a few steps. Here, I'll guide you through writing a program using Pulumi to set up an Amazon EKS (Elastic Kubernetes Service) cluster, which can be, later on, used to deploy Kubernetes Operators for AI Workflows. Operators in Kubernetes are custom controllers that extend the Kubernetes API to create, configure, and manage instances of complex stateful applications on behalf of Kubernetes users.

We'll use three main resources from Pulumi's libraries to accomplish our task:

1. `eks.Cluster`: This high-level component from the Pulumi EKS library simplifies the process of creating and configuring an EKS Kubernetes cluster. We use this instead of lower-level resources because it bundles together managing the underlying resources like the EKS cluster, node groups, IAM roles, and security groups.

2. `aws.ecr.Repository`: We'll set up an Elastic Container Registry (ECR) as a place to store Docker container images for our AI workflows. We're going to use this because Kubernetes Operators that orchestrate AI workflows will often need to pull container images containing the AI models and workflow code.

3. `pulumi_kubernetes.apiextensions.v1.CustomResourceDefinition` (CRD): We use this to define the schema for Custom Resources, which are extensions of the Kubernetes API that our operators can interact with. In real-world usage, these CRDs are defined as part of the operator's codebase, but for the sake of this demonstration, we'll focus on setting up the infrastructure rather than developing an operator.

Here's how you could write such a program in Pulumi using Python:

```python
import pulumi
import pulumi_eks as eks
import pulumi_aws as aws
import pulumi_kubernetes as kubernetes

# 1. Create an EKS cluster
cluster = eks.Cluster('ai-eks-cluster',
    desired_capacity=2,
    min_size=1,
    max_size=3,
    instance_type="t2.medium",
    node_root_volume_size=10)

# 2. Create an ECR repository
ecr_repo = aws.ecr.Repository('ai-workflows-repo')

# 3. Define a CRD for our AI workflow
ai_workflow_crd = kubernetes.apiextensions.v1.CustomResourceDefinition(
    "aiworkflowcrd",
    metadata={"name": "aiworkflows.example.com"},
    spec={
        "group": "example.com",
        "versions": [
            {
                "name": "v1",
                "served": True,
                "storage": True,
                "schema": {
                    "openAPIV3Schema": {
                        "type": "object",
                        "properties": {
                            "spec": {
                                "type": "object",
                                "properties": {
                                    # Define your AI workflow spec here
                                }
                            }
                        }
                    }
                },
            }
        ],
        "scope": "Namespaced",
        "names": {
            "plural": "aiworkflows",
            "singular": "aiworkflow",
            "kind": "AIWorkflow",
            "shortNames": ["aiwf"]
        },
    }
)

# Export the cluster's kubeconfig and ECR repository URL
kubeconfig = cluster.kubeconfig.apply(lambda kc: kc.raw)
ecr_repo_url = ecr_repo.repository_url

pulumi.export('kubeconfig', kubeconfig)
pulumi.export('ecr_repo_url', ecr_repo_url)
```

This program will set up the basic infrastructure for running AI operators on EKS:

- Creates the EKS cluster with a managed node group using `desired_capacity` to specify the number of EC2 instances.
- Sets up an ECR repository to host our AI workflows' container images.
- Defines a Custom Resource Definition (CRD) for the AI workflows we intend to manage.

The Pulumi EKS package abstracts a lot of the complexity involved in creating an EKS cluster, which is why we use `eks.Cluster` instead of manually setting up all of the elements involved in the AWS and Kubernetes providers. Similarly, for ECR, we use the `aws.ecr.Repository` to create a Docker container registry. Lastly, the CRD is added for our AI workflows, which operators can then use to manage AI workloads within our EKS cluster.

After running this Pulumi program, you would typically implement your Kubernetes operator in your preferred programming language using client libraries like the Python `kubernetes` package or Go client-go library. The operator would watch for changes to AIWorkflow resources in EKS and take actions such as scheduling training jobs or serving models based on the definitions in those resources.

To deploy this Pulumi program, you'd run `pulumi up`. Pulumi would then provide you with a preview of the changes, and you could choose to apply them to provision the resources. After successfully deploying this program, you would see two outputs; the `kubeconfig`, which you use to connect to your cluster with `kubectl`, and the `ecr_repo_url`, which you can use to push or pull images to your ECR repository.

When you write the operators, ensure that they can access the ECR repository and that their role or the role assumed by your node group has the necessary permissions to interact with EKS and ECR.