Databricks Jobs for Distributed ML Model Evaluation

Question

Pulumi · Accepted Answer

When working with Databricks on a cloud platform like AWS, Azure, or GCP, you can create and manage jobs that execute various tasks such as training machine learning models, running ETL (Extract, Transform, Load) processes, executing notebooks, and more. Pulumi allows you to define the infrastructure as code, which includes creating Databricks jobs to automate these operations.

In this case, we are going to create a Databricks job meant for distributed machine learning model evaluation using Pulumi and the Databricks provider. To achieve this, we need to define the following:

1. **Databricks Job**: This is the main resource which defines what the job does, when it runs, and the computational resources it will use. For our distributed ML model evaluation, we need to specify the computational aspect (like cluster configuration), the runtime environment (libraries and versions), and the specific tasks like notebook paths or JARs for running the ML model evaluation.

Before we dig into the code, here is a high-level overview of what the Pulumi code does:

- **Creating a Cluster**: The job will run on a Databricks cluster, which we need to configure to have the desired specifications for our machine learning task.
- **Defining the Job's Tasks**: Tasks can be commands (for running scripts), notebooks, or libraries (to import ML models or other resources). They define the actual work the job is supposed to do.
- **Job Schedule**: If needed, we can define a schedule for our job to run periodically.
- **Job Configuration**: Includes additional settings like email notifications, job timeout, and maximum retries for failed tasks.

Now, let's jump into the Pulumi code to create a simple Databricks job for distributed ML model evaluation:

```python
import pulumi
import pulumi_databricks as databricks

# Define a new Databricks cluster for running our ML jobs
cluster = databricks.Cluster("ml-cluster",
    num_workers=2,
    spark_version="7.3.x-scala2.12",
    node_type_id="Standard_D3_v2",
    spark_conf={
        "spark.databricks.cluster.profile": "singleNode",
        "spark.databricks.repl.allowedLanguages": "sql,python,r",
        "spark.databricks.delta.preview.enabled": True,
    }
)

# Define the Databricks job
ml_evaluation_job = databricks.Job("ml-evaluation-job",
    new_cluster={
      "num_workers": 2,
      "spark_version": "7.3.x-scala2.12",
      "node_type_id": "Standard_D3_v2"
    },
    notebook_task={
        "notebook_path": "/Users/user.name@databricks.com/ML_Evaluation"
    },
    # Configure email notifications for job completion
    email_notifications={
        "on_start": [],
        "on_success": ["user.name@databricks.com"],
        "on_failure": ["user.name@databricks.com"],
        "no_alert_for_skipped_runs": False
    },
    # Specify a timeout for the job, after which it will be terminated
    timeout_seconds=3600,
    # Define a schedule if you want the job to run periodically
    # Here it's scheduled to run at 6am daily.
    schedule={
        "quartz_cron_expression": "0 0 6 * * ?",
        "timezone_id": "America/Los_Angeles"
    },
)

pulumi.export('job_url', ml_evaluation_job.state_href) # Exports the URL to access the job
```

In this code:

- We first define a `Cluster` which corresponds to a Databricks cluster configuration. Here, we set up a small cluster suitable for small-medium data processing tasks. The `spark_conf` dictionary can be populated with other Spark configurations needed for the job.
- Next, we define a `Job`. We specify the cluster configuration directly in the job using `new_cluster`. In real scenarios, using a `job_cluster_key` to refer to an existing cluster is more cost-effective.
- The `notebook_task` specifies the path to the notebook within Databricks that contains the evaluation code. You need to replace this with the actual path to your notebook.
- We add `email_notifications` to get email alerts on the job's progress and any potential issues that need attention.
- We also specify `timeout_seconds`, which ensures that the job doesn't run indefinitely and incurs unexpected costs.
- The `schedule` is an optional configuration that sets the job to run on a predetermined routine based on the cron expression provided.
- Finally, to get quick access to the job, we `export` the job URL from the Pulumi stack which can be used to access the job from Databricks workspace directly.

Please note that this program should be seen as a starting point. Based on the complexity of the machine learning task, you may need to adjust cluster size, timeout, and other configurations. Ensure that you replace placeholders with actual values that reflect your Databricks setup and ML task requirements.