EKS NodeGroups with Spot Instances for Cost-Effective ML Training

Question

Pulumi · Accepted Answer

When using Amazon EKS (Elastic Kubernetes Service), NodeGroups are a way to manage groups of nodes (instances) that run your container workloads. By selecting Spot Instances for your NodeGroups, you're able to take advantage of unused EC2 capacity at a discounted price compared to On-Demand rates, which can be particularly cost-effective for workloads like machine learning training that can tolerate interruptions.

In this Pulumi program, we will create the following resources:
1. An EKS Cluster: This is the managed Kubernetes service provided by AWS.
2. A NodeGroup consisting of Spot Instances: This group of instances is part of our EKS cluster, which will run our machine learning workload.

The key elements when setting up a NodeGroup with Spot Instances are:
- The `capacityType` property set to `SPOT` to ensure that the NodeGroup uses Spot Instances.
- Providing an `instanceTypes` list, which contains the types of instances that you'd like to use as Spot Instances.
- Adding `scalingConfig` settings to manage the size of the NodeGroup based on your workload requirements.

Below is the detailed Pulumi program written in Python that provisions these resources, ensuring that you get a cost-effective setup for your machine learning training workloads.

```python
import pulumi
import pulumi_eks as eks

# Create an EKS Cluster.
# The eks.Cluster class abstracts away the details of creating an actual EKS cluster.
cluster = eks.Cluster('eks-cluster', version='1.21')

# Define the NodeGroup for Spot Instances.
# We specify the desired capacity, instance types, scaling settings, and indicate that
# we're requesting spot instances by setting the capacityType to 'SPOT'.
node_group = eks.NodeGroup('spot-ng',
    cluster=cluster.core,  # Referencing the cluster created above.
    spot_price='0.0835',  # Maximum price you're willing to pay per instance hour.
    instance_types=['t3.medium'],  # Specifying a list of possible instance types to use.
    scaling_config=eks.NodeGroupScalingConfigArgs(  # Auto Scaling configuration.
        desired_size=2,  # Desired number of instances at creation/startup.
        min_size=1,      # Minimum number of instances to scale in.
        max_size=3       # Maximum number of instances to scale out.
    ),
    capacity_type='SPOT',  # Indicating we want to use spot instances.
    labels={'workload-type': 'ml-training'},  # Custom labels for identifying the NodeGroup.
    taints={  # Taints ensure that only workloads that tolerate these taints will be scheduled on the nodes.
        'special-resource': {
            'value': 'ml',
            'effect': 'NoSchedule'
        }
    }
)

# Export the Cluster name and the NodeGroup ARN (Amazon Resource Name)
pulumi.export('cluster_name', cluster.eks_cluster.name)
pulumi.export('node_group_arn', node_group.arn)
```

In this program:
- We first import the necessary Pulumi modules.
- We create an EKS cluster using the `eks.Cluster` class.
- We then define a NodeGroup within the EKS cluster, specify the necessary parameters for spot instances, and define auto-scaling configurations for the NodeGroup.
- Finally, we export the cluster name and the NodeGroup ARN as outputs of our Pulumi program. These exports can be used to obtain information about your infrastructure after deployment, which can be helpful for integration with other systems or for reference purposes.

Once you run this Pulumi program, it will use your AWS credentials to provision these resources within your AWS account. The resources can be managed, updated, or destroyed using Pulumi commands, giving you control over the entire lifecycle of your cloud infrastructure.