Spotinst for Interruptible AI Model Training Jobs

Question

Pulumi · Accepted Answer

The goal is to leverage Spot Instances for cost-effective AI model training jobs on the cloud. Spot Instances are unused compute capacity in the cloud available at steep discounts compared to the standard price. The risk with these instances is that they can be interrupted with little to no notice if the cloud provider needs the capacity back. However, for workloads like AI model training, which can often be checkpointed and restarted, using Spot Instances can be very cost-effective.

In this case, we can use Pulumi to orchestrate the creation and management of such instances and handle the AI model training jobs. For this example, I will demonstrate how to set up a machine learning model training job using Google Cloud Platform's AI and machine learning services with Pulumi.

The resources we'll use are:

1. `google-native.ml/v1.Job`: This resource will define the parameters for our AI model training job, specifically on Google Cloud ML Engine. It will allow us to specify the training code, input data, and the machine type, which in this case will be a Spot Instance.

2. `google-native.batch/v1.Job`: To batch process jobs and make effective use of the Spot Instances. It might be particularly useful if you have multiple training jobs or a single job that can be parallelized.

3. `google-native.tpu/v2alpha1.QueuedResource`: This service is to queue jobs for TPU resources, which are Google's specialized hardware accelerators designed for machine learning tasks. This might be necessary if your training job requires TPUs and you want to use Spot Instances for those as well.

Below is an example of how to orchestrate such a training job with Pulumi in Python:

```python
import pulumi
import pulumi_google_native as google_native

# Define the training input for the machine learning model job.
training_input = google_native.ml.v1.JobTrainingInputArgs(
    python_module='trainer.task',                        # Your training module
    args=['--training_dataset=gs://my-bucket/data'],    # Arguments for your training job
    region='us-central1',                                # The region to run the training job
    job_dir='gs://my-bucket/training-output',           # The Cloud Storage location to store the training results
    scale_tier='BASIC_TPU',                             # Use this tier to select TPU in a low cost setting
)

# Create a Google Cloud ML Engine training job using Spot Instances.
training_job = google_native.ml.v1.Job("training-job",
    job_id="my_training_job",
    project="my-project-id",      # Replace with your project ID
    training_input=training_input,
    labels={
        "type": "training",
        "team": "ai",
    }
)

# Export the training job name for reference.
pulumi.export("training_job_name", training_job.job_id)
```

In the code above, we're importing the necessary Pulumi modules for Google Cloud and defining a training job input configuration. This defines things like the Python module to run for training, any arguments it requires, the Google Cloud region, and the bucket for storing job outputs.

We specify a `scale_tier` of `'BASIC_TPU'` to request a Spot Instance of TPU for cost savings. Please note that Google Cloud ML Engine has been migrated to Vertex AI, and while these resources are still available, the precise configuration parameters and service names may differ on the most up-to-date Google Cloud services.

After setting up the job, we submit it with `google_native.ml.v1.Job` and export the job name for reference. You can use the exported job ID to monitor and manage the job post-deployment.

This setup will start a training job that uses Spot Instances, benefiting from the lower cost of interruptible instances without the need for upfront reservation costs. If you're running this in a production environment, be sure to implement checkpointing in your training code, so you can handle preemptions gracefully.