1. Orchestrating BigQuery ML Workflows with GCP Dataform

    Python

    Orchestrating BigQuery ML workflows can be achieved by using GCP's Dataform, which allows you to write SQL workflows for BigQuery and manage them as if they were code. It's a common practice for machine learning teams that work with large datasets to use Dataform to create and manage their data transformations in BigQuery.

    With Pulumi's infrastructure as code, you can define and deploy your BigQuery ML workflows along with other cloud infrastructure. Below is a Pulumi program in Python that demonstrates how to create a Dataform repository for orchestrating BigQuery ML workflows.

    In the following program, we use various resources:

    • pulumi_gcp package, which contains the GCP resources we want to use.
    • gcp.dataform.Repository is a resource for creating a Dataform repository which will host our SQL workflows for BigQuery.
    • gcp.dataform.WorkflowConfig is a resource for configuring the workflow within our Dataform repository, allowing us to specify schedules and settings for our ML workflows.

    Let's look at the code:

    import pulumi import pulumi_gcp as gcp # Initialize your GCP project and region, replace these with your own specific identifiers. gcp_project = 'my-gcp-project' gcp_region = 'us-central1' # Choose a region that makes sense for your scenario # Define a Dataform repository dataform_repository = gcp.dataform.Repository("my-dataform-repository", project=gcp_project, region=gcp_region, git_remote_settings=gcp.dataform.RepositoryGitRemoteSettingsArgs( url="https://github.com/my-org/my-dataform-repo.git", # Your git repository URL authentication_token_secret_version="projects/my-gcp-project/secrets/my-secret/versions/latest", default_branch="master", # Default branch of your git repository )) # Define a Dataform workflow configuration to orchestrate BigQuery ML workflows. # This includes settings like the schedule, the BigQuery dataset to act upon, etc. dataform_workflow_config = gcp.dataform.WorkflowConfig("my-dataform-workflow-config", project=gcp_project, region=gcp_region, repository_id=dataform_repository.name, workflow_config_id="my-workflow-config", invocation_config=gcp.dataform.WorkflowConfigInvocationConfigArgs( included_tags=["my_ml_workflow"], # Tags identifying the jobs to run within this workflow ), release_config=gcp.dataform.WorkflowConfigReleaseConfigArgs( target_name="my-target", # Target environment name defined in Dataform. ), cron_schedule="0 9 * * *", # Schedule using cron syntax, here every day at 9 AM UTC time_zone="UTC") # Time zone for the schedule # Export the Dataform repository URL to access it later pulumi.export('dataform_repository_url', dataform_repository.git_remote_settings.url)

    Here's what's going on in the code:

    • We create a Dataform repository by defining gcp.dataform.Repository. This repository is linked to a Git repository which contains your Dataform SQL scripts.
    • In the git_remote_settings, you must set your url to the Git repository you wish to connect with Dataform. The authentication_token_secret_version is a reference to a secret in GCP's Secret Manager that holds the Git authentication token. The default_branch is the name of the branch to use by default when running Dataform operations.
    • Next, we create a gcp.dataform.WorkflowConfig resource that defines how the Dataform project should behave. This includes the schedule in cron_schedule, the time zone for the schedule, and the tags for the Dataform jobs you want to run as part of the workflow.
    • The workflow_config_id assigns a unique identifier to the configuration, while repository_id connects it to the previously defined Dataform repository.
    • The release_config block is where you specify your Dataform environment target name. This name corresponds to one of the environments you’ve configured in Dataform.
    • Finally, we export the Dataform repository URL so that you can access the Git repository directly from your Pulumi stack's outputs.

    This Pulumi program will set up a Dataform repository and workflow configuration which can be used to orchestrate your BigQuery ML workflows. All the SQL scripts for defining and transforming your data for machine learning can be version controlled in your Git repository, and scheduled through Dataform to run in BigQuery.