Kubernetes Pod Metrics for Training Jobs Scaling

Question

Pulumi · Accepted Answer

To implement scaling of Kubernetes training jobs based on pod metrics, we can use the Horizontal Pod Autoscaler (HPA). An HPA automatically scales the number of pods in a deployment, replica set, or stateful set based on observed CPU utilization or other select metrics.

Here's what you need to set up HPA for a Kubernetes training job:

1. **Deployment**: This is where you define the training job. Each Pod in the Deployment will run a single instance of your training job.
2. **Resource requests**: Ensure that each container in your Pods specifies resource requests for CPU and memory. This is used by HPA to make scaling decisions.
3. **HorizontalPodAutoscaler**: This resource targets a Deployment and specifies the scaling policies and metrics to be used.

### Program Demonstration

Let's create a Pulumi program that sets up a Deployment and an HPA to scale based on CPU usage.

```python
import pulumi
import pulumi_kubernetes as k8s

# Creating a Kubernetes namespace for the training job.
training_ns = k8s.core.v1.Namespace("training-namespace",
    metadata={"name": "training-jobs"})

# Define a Deployment for the training job. You would define the container image, resources, etc.
training_deployment = k8s.apps.v1.Deployment("training-deployment",
    metadata={
        "namespace": training_ns.metadata["name"],
    },
    spec={
        "selector": {
            "matchLabels": {"app": "training-job"}
        },
        "replicas": 1, # Start with 1 replica, the HPA will adjust this number based on CPU load.
        "template": {
            "metadata": {
                "labels": {"app": "training-job"}
            },
            "spec": {
                "containers": [{
                    "name": "training-container",
                    "image": "your-training-job-image:latest",  # Replace with your actual container image
                    "resources": {  # Define the resources required for a single instance of the training job.
                        "requests": {
                            "cpu": "500m",
                            "memory": "1Gi"
                        },
                        "limits": {
                            "cpu": "1",
                            "memory": "2Gi"
                        },
                    },
                }],
            },
        },
    })

# Create an HPA to scale the training job based on CPU usage.
training_hpa = k8s.autoscaling.v1.HorizontalPodAutoscaler("training-hpa",
    metadata={
        "namespace": training_ns.metadata["name"],
    },
    spec={
        "scaleTargetRef": {
            "apiVersion": "apps/v1",
            "kind": "Deployment",
            "name": training_deployment.metadata["name"],
        },
        "minReplicas": 1,
        "maxReplicas": 10,  # Maximum number of pods that can be spawned.
        "targetCPUUtilizationPercentage": 50,  # Target CPU utilization percentage to scale up/down.
    })

# Export the namespace and deployment names
pulumi.export("namespace", training_ns.metadata["name"])
pulumi.export("deployment_name", training_deployment.metadata["name"])

```

This Pulumi program performs the following steps:

1. Creates a new Kubernetes namespace named "training-jobs" for organizational purposes.
2. Defines a deployment with a single pod initially, which runs your training job image. Replace `your-training-job-image:latest` with your actual training job container image. The resource request defines how much CPU and memory each instance of the training job should be guaranteed.
3. Sets up the Horizontal Pod Autoscaler to monitor the CPU utilization of the pods running the training job. The HPA will scale the number of pods between 1 and 10 to maintain an average CPU utilization of 50%.

This is a basic example to get you started. In more complex scenarios, you might want to scale based on custom metrics, such as the number of jobs in a queue. This can be done using custom metrics APIs, but that would require setting up additional components such as Prometheus and the Kubernetes metrics server.

Remember to deploy this Pulumi program using the `pulumi up` command after setting your Kubernetes context to the desired cluster where you want to run the training jobs.