Serverless Async Workflow for AI Jobs on GCP

Question

Pulumi · Accepted Answer

To construct a serverless asynchronous workflow for AI jobs on Google Cloud Platform (GCP), we will leverage Cloud Workflows and AI Platform Jobs. The Cloud Workflows service orchestrates and automates Google Cloud tasks and services, creating a sequence of steps that define a workflow. Meanwhile, AI Platform Jobs allow us to run various machine learning jobs, such as training models, on fully managed infrastructure.

First, we'll need to define a workflow in Cloud Workflows, wherein we specify the steps to trigger an AI Platform Job, such as a training job. This step includes setting up the necessary parameters for the job, like the training application, and the input and output data locations.

Next, we'll set up an AI Platform Job for executing our machine learning tasks. This might involve training a model with specific machine learning frameworks like TensorFlow or PyTorch, using custom code or pre-built containers provided through the AI Platform.

In this Pulumi program example, we'll create a `Workflow` resource to define our workflow and a `Job` under AI Platform to specify the ML job's details.

Let's begin by defining our serverless workflow:

```python
import pulumi
import pulumi_gcp as gcp

# Set up a Workflows Workflow
# The cloud workflow orchestrates the execution of the serverless async AI job.
ai_workflow = gcp.workflows.Workflow("aiWorkflow",
    description="An asynchronous serverless workflow for AI jobs on GCP",
    region="us-central1", # Specify the appropriate region for your workflow
    source_contents="""
- initialize:
    assign:
    - project: ${sys.get_env("GOOGLE_CLOUD_PROJECT_ID")}
    - location: "us-central1" # Same as the region specified for the Workflow resource
    - dataset: "your-dataset-id" # Replace with your dataset ID
    - model: "your-model-id" # Replace with your model ID
- train_model:
    call: googleapis.ml.v1.projects.jobs.create
    args:
      parent: ${"projects/" + project + "/locations/" + location}
      body:
        job_id: "your_ai_job_name" # Replace with your own job identifier
        training_input:
          scaleTier: BASIC # Choose the scale tier according to your needs
          region: ${location}
          masterType: "n1-standard-4" # Specify the machine type
          args: ["--dataset", ${dataset}, "--model", ${model}]
          # Add additional training inputs as needed
          runtimeVersion: "1.15" # Specify the runtime version
          pythonVersion: "3.7" # Specify the Python version
          jobDir: "gs://your-bucket-name/ai-job-dir" # Replace with your Cloud Storage bucket
          packageUris: ["gs://your-bucket-name/packages/trainer.tar.gz"] # Replace with the path to your training package
          pythonModule: "trainer.task" # Replace with your Python training module
        labels:
          type: "ai_job"
""")

# Create an AI Platform Job
# Configures the detailed specifics needed for the AI job such as the training application, data, and outputs.
ai_job = gcp.aiplatform.Job("aiJob",
    project=ai_workflow.project, # Use the same project as the Workflow
    location=ai_workflow.location, # Use the same location/region as the Workflow
    training_input=gcp.aiplatform.JobTrainingInputArgs(
        scale_tier="BASIC",
        region=ai_workflow.location,
        master_type="n1-standard-4",
        args=["--dataset", "your-dataset-id", "--model", "your-model-id"],
        runtime_version="1.15",
        python_version="3.7",
        job_dir="gs://your-bucket-name/ai-job-dir",
        package_uris=["gs://your-bucket-name/packages/trainer.tar.gz"],
        python_module="trainer.task",
    ),
    labels={
        "type": "ai_job",
    },
)

# Export the workflow id and AI job id
pulumi.export('workflow_id', ai_workflow.id)
pulumi.export('ai_job_id', ai_job.name)
```

In the program above, we're deploying two primary resources:

1. `gcp.workflows.Workflow`: This resource sets up the Cloud Workflow that defines our serverless process. We've included steps within the `source_contents` to assign variables for our project and location and trigger a job creation step that will initiate our AI job on the AI Platform with the necessary parameters.

2. `gcp.aiplatform.Job`: This resource describes the actual AI job to be executed. This includes configurations for the machine learning framework, the training module, runtime and Python versions, as well as the package URI containing the training application.

Before running this Pulumi program, you should have the AI training application (`trainer.tar.gz`) uploaded to a GCS bucket, and the Python module (`trainer.task`) ready for execution. Make sure to replace the placeholders (`your-dataset-id`, `your-model-id`, `your-bucket-name`, `your_ai_job_name`) with actual values relevant to your GCP project.

Please note, before launching this Pulumi program, you need to ensure that your Pulumi CLI is configured with the correct GCP credentials and that you've set up the respective GCP services and permissions to create and manage resources on AI Platform and Cloud Workflows.