BigQuery ML for Creating and Executing Machine Learning Models

Question

Pulumi · Accepted Answer

BigQuery ML enables users to create and execute machine learning models in Google BigQuery using SQL queries. This is valuable because it allows you to use BigQuery's vast data warehouse capabilities directly with machine learning without the need to move data around.

Here's a step-by-step explanation of what we will do in the Pulumi program:

1. We'll create a BigQuery dataset, which is a container for tables, views, and models.
2. Then, we'll set up a BigQuery table which is necessary to store the data that will be used for training the model.
3. We will create a BigQuery ML model within the dataset. I'm going to define the model with an example query to demonstrate how you train a model using existing data in your table. Note that the actual SQL query for creating a model will depend on your specific data and use case.
4. Lastly, we create a job to execute a query. In this case, the query is for predicting outcomes using the ML model we create. Think of this as applying the trained model to new or existing data to make predictions.

Let's write the Pulumi program that accomplishes these steps:

```python
import pulumi
import pulumi_gcp as gcp

# Step 1: Create a BigQuery dataset where we'll store our table and model
bigquery_dataset = gcp.bigquery.Dataset("my_dataset",
    dataset_id="my_ml_dataset",
    location="US",  # you can choose other locations
    description="This is my dataset for ML models."
)

# Step 2: Next, create a BigQuery table to store the data that will be used for training
bigquery_table = gcp.bigquery.Table("my_table",
    dataset_id=bigquery_dataset.dataset_id,
    table_id="my_training_data",
    deletion_protection=False,
    schema="""[
        {
            "name": "feature1",
            "type": "FLOAT64",
            "mode": "NULLABLE"
        },
        {
            "name": "feature2",
            "type": "FLOAT64",
            "mode": "NULLABLE"
        },
        {
            "name": "label",
            "type": "INT64",
            "mode": "NULLABLE"
        }
    ]"""
)

# Step 3: Create a BigQuery ML model in the dataset using SQL
#
# NOTE: The `CREATE MODEL` statement should be defined based on your data and requirements.
#       The below is just a generic example of a logistic regression for binary classification.
#
#       The "input_label_cols" represents the column you want to predict, in this case is 'label'.
#       "data_split_method" is defined to 'RANDOM' which determines how the data will be split
#       into training and evaluation datasets.
#
#       Please replace this query with the one suitable for your case.
model_query = f"""
    CREATE OR REPLACE MODEL `{bigquery_dataset.dataset_id}.my_model`
    OPTIONS(model_type='logistic_reg', input_label_cols=['label'], data_split_method='RANDOM') AS
    SELECT * FROM `{bigquery_dataset.dataset_id}.my_training_data`
"""

bigquery_model_job = gcp.bigquery.Job("my_model_job",
    query=model_query,
    job_id=pulumi.Output.concat(pulumi.get_project(), "_my_model_job"),
    on_behalf_of="user",  # this should be your Google Cloud user email.
    default_dataset=bigquery_dataset.to_data(),
    write_disposition="WRITE_TRUNCATE"  # This option is to avoid multiple runs duplicating data
)

# Step 4: Execute a query - Make predictions using the trained ML model
#
# This step typically happens after the model has been trained, and you would run this as an
# independent operation. This example just chains it for simplicity.
#
# This example SELECT statement fetches data from our previously created BigQuery table
# and applies the trained ML model to make predictions.
#
# Note that "my_prediction_table" must be adjusted to point to a table with data prepared for
# predictions (e.g. it should have the same feature columns as the training data).
predict_query = f"""
    SELECT
        feature1,
        feature2,
        predicted_label
    FROM ML.PREDICT(MODEL `{bigquery_dataset.dataset_id}.my_model`,
                    TABLE `{bigquery_dataset.dataset_id}.my_prediction_table`)
"""

bigquery_predict_job = gcp.bigquery.Job("my_predict_job",
    query=predict_query,
    job_id=pulumi.Output.concat(pulumi.get_project(), "_my_predict_job")
)

# Export BigQuery Dataset ID as an output of the Pulumi stack
pulumi.export('dataset_id', bigquery_dataset.dataset_id)
```

In this program:
- [`gcp.bigquery.Dataset`](https://www.pulumi.com/registry/packages/gcp/api-docs/bigquery/dataset/) is used to create a BigQuery dataset.
- [`gcp.bigquery.Table`](https://www.pulumi.com/registry/packages/gcp/api-docs/bigquery/table/) creates a BigQuery table.
- [`gcp.bigquery.Job`](https://www.pulumi.com/registry/packages/gcp/api-docs/bigquery/job/) represents a BigQuery job that we use first to create ML model and later for executing predictions. The `query=` parameter expects a SQL statement appropriate for your BigQuery ML model.

Replace `model_query` and `predict_query` with queries tailored to your specific dataset and machine learning objectives.

After you write the Pulumi program, deploy it using the `pulumi up` command. This will instruct Pulumi to create the specified resources in the cloud. Note that you'll need to have Google Cloud credentials configured for Pulumi to interact with your GCP account.