Service Account Managed Training Jobs on GCP AI Platform
PythonIn this guide, I'll demonstrate how to use Pulumi to create and manage a service account, as well as to set up a training job on Google Cloud's AI Platform. The goal of this Pulumi program is to automate the creation of these resources in a repeatable and controlled manner, taking advantage of infrastructure as code principles.
Below are the steps we'll follow:
-
Create a Service Account: Service accounts are special types of Google Cloud accounts that can be used by applications, virtual machines, and other services within Google Cloud to interact with the rest of the Google Cloud platform. We'll create a service account which the training job will use to access the necessary GCP services securely.
-
Assign Necessary Permissions: We'll bind the necessary roles to the service account to ensure that it has the right permissions to manage and execute training jobs on the AI Platform.
-
Create a Training Job: We will define a training job on the AI Platform, using the previously created service account for authentication, and specify its training inputs such as the region, job configuration, and the machine learning model parameters.
Let's start writing the Pulumi program in Python:
import pulumi import pulumi_google_native.ml.v1 as google_ml import pulumi_google_native.iam.v1 as google_iam # Replace these values with your desired settings project = 'your-gcp-project-id' region = 'us-central1' service_account_id = 'your-unique-service-account-id' job_id = 'your-unique-job-id' training_image = 'gcr.io/cloud-ml-public/training/pytorch-gpu.1-4' python_module = 'trainer.task' scale_tier = 'BASIC_GPU' # Create a Google Cloud service account service_account = google_iam.ServiceAccount('service-account', project=project, accountId=service_account_id) # Bind the necessary roles to the service account for AI Platform jobs iam_policy = google_iam.IAMPolicy('service-account-ml-role', resource=service_account.name, policyData=pulumi.Output.all(service_account.name, service_account.project).apply(lambda args: { 'bindings': [{ 'role': 'roles/ml.developer', # This role allows the service account to access AI Platform resources 'members': [f"serviceAccount:{args[0]}@{args[1]}.iam.gserviceaccount.com"], }], })) # Define the input parameters for the ML training job training_input = { 'args': ['--some_arg', 'value'], 'region': region, 'jobDir': f"gs://{job_id}/training", 'masterType': scale_tier, 'pythonModule': python_module, 'packageUris': [f"gs://{job_id}/trainer.tar.gz"], } # Create a training job on AI Platform training_job = google_ml.Job('training-job', project=project, jobId=job_id, trainingInput=training_input, labels={ 'type': 'training_job' }) # Export the service account email and training job ID as stack outputs pulumi.export('service_account_email', service_account.email) pulumi.export('training_job_id', training_job.jobId)
In the code above, we:
- Defined our project, region, service account ID, and job ID.
- Created a service account with
ServiceAccount
resource. - Assigned the
"roles/ml.developer"
role to our service account to grant it permissions needed for AI Platform operations. - Defined a
training_input
dictionary with the necessary parameters for a job. - Created a training job on the AI Platform with
Job
resource that uses the service account.
The program exports the email address of the service account and the ID of the training job, which can be useful if you need to reference these resources later on.
Make sure to replace the placeholder values with your specific settings, such as project ID, service account name, job buckets, etc.
To run this Pulumi program, save it to a file named
__main__.py
, ensure you have Pulumi installed and configured for GCP, and then execute the commandpulumi up
in the terminal in the same directory as your program. This will start the provisioning process of the resources defined in your code.-