BigQuery ML for Large-scale Machine Learning Models
PythonBigQuery ML enables users to create and execute machine learning models in Google BigQuery using SQL queries. It is designed for large-scale ML models and provides a way for analysts to build models on large datasets without the need to move data out of the database.
Using Pulumi with the Google Cloud provider, you can provision BigQuery datasets and tables, and then use BigQuery ML SQL statements for model creation and predictions. The main resources involved in setting up a BigQuery ML solution using Pulumi are:
Dataset
: Represents a BigQuery dataset, which organizes and controls access to your tables and views.Table
: Represents a BigQuery table, which holds your data in the form of rows and columns.Job
: Represents a BigQuery job, such as a query or a model creation job.
In order to create a machine learning model in BigQuery using Pulumi, you'll:
- Define a
Dataset
where the ML model will be stored. - Define a
Table
with the necessary schema for training data. - Use a
Job
to execute a SQL query that creates the ML model using theCREATE MODEL
statement.
Below is a program that sets up the necessary infrastructure and creates a simple linear regression model on the BigQuery.
First, make sure you have the
pulumi_gcp
Python package installed:pip install pulumi_gcp
Now, here is a Pulumi Python program that will create the Dataset, Table, and ML Model in BigQuery:
import pulumi import pulumi_gcp as gcp # Create a new dataset to house the BigQuery ML model dataset = gcp.bigquery.Dataset("my_dataset", dataset_id="my_ml_dataset", location="US") # Define a schema for the table where training data will be stored schema = """[ {"name": "x", "type": "FLOAT64"}, {"name": "y", "type": "FLOAT64"} ]""" # Create a table for the training data table = gcp.bigquery.Table("my_table", dataset_id=dataset.dataset_id, table_id="my_training_data", schema=schema) # SQL query to create a BigQuery ML model create_model_query = f""" CREATE MODEL `{dataset.dataset_id}.my_linear_model` OPTIONS(model_type='linear_reg') AS SELECT * FROM `{dataset.dataset_id}.{table.table_id}` """ # Run the SQL query to create the ML model job = gcp.bigquery.Job("my_ml_model_creation_job", job_id="my_ml_model_creation_job", query=gcp.bigquery.JobQueryArgs( query=create_model_query, destination_table=gcp.bigquery.JobQueryDestinationTableArgs( dataset_id=dataset.dataset_id, table_id="my_linear_model" ), write_disposition="WRITE_TRUNCATE", )) pulumi.export('dataset_id', dataset.dataset_id) pulumi.export('table_id', table.table_id) pulumi.export('model_creation_job_id', job.job_id)
This program will create a new dataset and table in BigQuery, and then initiate the creation of a linear regression model called
my_linear_model
using theCREATE MODEL
statement in SQL.- The
Dataset
resource is a place where the ML model will be stored. It's given an ID ofmy_ml_dataset
, you can change this ID to something more relevant to your context. - The
Table
resource is where the training data will be stored. It uses the schema defined in theschema
variable, which currently has two columns,x
andy
, both of which are floating-point numbers. - The
Job
resource runs the SQL statement to create the ML model. TheCREATE MODEL
statement is standard SQL for BigQuery ML and specifies the type of model (linear_reg
for linear regression) and the data to use for training (all columns from the training data table).
After running this program with Pulumi, you can go to the Google Cloud Console to see your new dataset, training data table, and the linear regression model.
Remember, Pulumi manages the state and dependencies of your infrastructure, making it easy to repeat and update these deployment processes in a safe and predictable manner.