Automated Machine Learning Workflows with GCP Dataproc

Question

Pulumi · Accepted Answer

To create automated machine learning workflows with Google Cloud Platform's Dataproc, you can use Pulumi to orchestrate the necessary resources, including clusters, jobs, and workflow templates. Dataproc provides a managed Hadoop MapReduce, Spark, Pig, and Hive service. It simplifies the management of these open source data processing technologies and enables you to create machine learning workflows on managed clusters rapidly.

The primary resources you will need to create within Pulumi are:

1. **Dataproc Cluster**: This is the managed group of virtual machines that will run the data processing jobs. You can customize configurations such as the number of machines, machine type, and disk sizes.

2. **Dataproc Job**: Represents a job that runs on a Dataproc cluster. You can submit Spark, Hadoop, Hive, Pig, and PySpark job types.

3. **Dataproc WorkflowTemplate**: It enables you to define a graph of jobs with information on where to run those jobs.

Before starting the program, you should have the GCP SDK installed and configured.

Below is a Pulumi program in Python that defines a simple machine learning workflow on GCP's Dataproc service:

```python
import pulumi
import pulumi_gcp as gcp

# A Dataproc cluster is created to run our machine learning jobs.
# Here, we define a simple cluster with some of the default settings.
# Depending on the machine learning job requirements, you can customize
# the cluster configuration. For example, choosing the machine type,
# disk size, number of worker nodes, etc.
dataproc_cluster = gcp.dataproc.Cluster("ml-cluster",
    region="us-central1",
    cluster_config={
        "master_config": {
            "num_instances": 1,
            "machine_type": "n1-standard-1",
        },
        "worker_config": {
            "num_instances": 2,
            "machine_type": "n1-standard-1",
        },
    })

# A Dataproc job definition, linked to the cluster by the cluster name.
# This particular example shows a PySpark job, though the type could be
# Spark, Hadoop, or any other supported by Dataproc.
dataproc_job = gcp.dataproc.Job("ml-job",
    region="us-central1",
    placement={"cluster_name": dataproc_cluster.name},
    pyspark_config={
        "main_python_file_uri": "gs://my-bucket/my-pyspark-job.py",
        "args": ["arg1", "arg2"],
        # You can also specify Python file URIs, jar file URIs, and more.
    })

# A Dataproc workflow template allows for the creation of a set of data
# processing jobs (such as Hadoop MapReduce, Spark, Pig jobs) that are
# managed as a single entity. Below you see how you can define one with
# a single step that references the previously defined job.
dataproc_workflow_template = gcp.dataproc.WorkflowTemplate("ml-workflow-template",
    region="us-central1",
    jobs=[{
        "pyspark_job": {
            "main_python_file_uri": "gs://my-bucket/my-pyspark-job.py",
            "args": ["arg1", "arg2"],
        },
        "step_id": "step-one",
    }],
    placement={
        "managed_cluster": {
            "cluster_name": "managed-cluster",
            "config": {
                "master_config": {
                    "num_instances": 1,
                    "machine_type": "n1-standard-1",
                },
                "worker_config": {
                    "num_instances": 2,
                    "machine_type": "n1-standard-1",
                },
            },
        },
    })

# Export the Dataproc cluster name and the Workflow Template ID
pulumi.export("dataproc_cluster_name", dataproc_cluster.name)
pulumi.export("dataproc_workflow_template_id", dataproc_workflow_template.id)
```

In the program above, we start by defining a `gcp.dataproc.Cluster` which sets up our cluster's configuration. We specify the number of master and worker instances along with their machine types.

Next, we define a `gcp.dataproc.Job` that specifies a PySpark job that will run on our cluster. The `main_python_file_uri` points to a Python script in a GCS bucket. The `args` provides an array of command-line arguments for the PySpark job.

We also define a `gcp.dataproc.WorkflowTemplate`. This template contains the data processing job configuration for use in the workflow. We include one job in this template (although you could include more), and specify the step's Python file URI and arguments.

We end the program by exporting the cluster name and the workflow template ID, which can be useful for querying and monitoring our resources within GCP and Pulumi's console.

Please review the [Cluster](https://www.pulumi.com/registry/packages/gcp/api-docs/dataproc/cluster/), [Job](https://www.pulumi.com/registry/packages/gcp/api-docs/dataproc/job/), and [WorkflowTemplate](https://www.pulumi.com/registry/packages/gcp/api-docs/dataproc/workflowtemplate/) API reference for more details on how to customize these resources to match your specific requirements.

Remember that the file `my-pyspark-job.py` needs to be in a Google Cloud Storage bucket and the path must be provided correctly for the job to be able to execute the script. Also, when deploying such infrastructure, ensure that you have sufficient permissions to create resources within the GCP project.