Serving Machine Learning Batch Inference Outputs via DBFS

Question

Pulumi · Accepted Answer

In a Machine Learning workflow, batch inference is a pattern where a trained model is applied to a dataset to generate predictions. The Databricks File System (DBFS) is a distributed file system installed on Databricks clusters that enables you to treat data as if it were on a local file system. For a scenario of serving machine learning batch inference outputs via DBFS, you would likely process your batch inference job within Databricks, and then save your results to DBFS.

Below is a Pulumi program that accomplishes the following:

1. Defines a Databricks workspace
2. Configures a DBFS file resource to store batch inference outputs

Let's break down the steps before seeing the full program.

1. **Create a Databricks workspace**: The workspace is the environment for running your machine learning and data engineering tasks. It provides an integrated environment for ETL (Extract, Transform, Load), machine learning, and analytics tasks.

2. **DBFS File**: This resource points to a file within the Databricks File System where you would store the outputs of your machine learning batch inference. The file might contain inference results, such as IDs and predicted values, depending on your exact use case.

3. **Model Serving**: Although not directly involved in the Pulumi code below, you would need to set up a model serving resource on Databricks, where you deploy and manage machine learning models. This part is typically done within the Databricks environment.

Here is the Pulumi Python program:

```python
import pulumi
import pulumi_databricks as databricks

# Create a Databricks workspace
databricks_workspace = databricks.Workspace("databricks-workspace",
    sku="premium"
)

# Configure a DBFS file resource
dbfs_file = databricks.DbfsFile("batch-inference-output",
    # Create a path in DBFS where inference results will be stored
    path="/mnt/output/batch_inference_results.csv",
    # Assuming 'source' is the location of the results to be written to DBFS
    # This could be a local file. Replace with actual output from your batch job.
    source="./batch_inference_results.csv",
    # Use the workspace_id from the created workspace
    workspace_id=databricks_workspace.id
)

# Export the URL to access the batch inference results stored in DBFS
pulumi.export('dbfs_file_url', pulumi.Output.concat("dbfs:/", dbfs_file.path))

```

This program sets up a basic dataset within DBFS and points to it through a Databricks workspace resource. The `DbfsFile` resource assumes that you've already created a file locally (in this example './batch_inference_results.csv') which contains the results of the ML batch inference.

The exported `dbfs_file_url` will provide the DBFS path for the stored batch inference results. This can be used to access the data for further processing or visualization.

Keep in mind that in the actual machine learning workflow, additional steps like training models, performing the batch inference job within Databricks notebooks, and setting up model serving are required. The Pulumi code here is meant for infrastructure setup and assumes that the ML workflow is already in place.

Remember, before you run this Pulumi program, you need to set up your Databricks provider with the required credentials, and you should have the Databricks CLI configured and authenticated. You would typically run the Pulumi program using the Pulumi CLI with commands like `pulumi up` to deploy the resources and `pulumi destroy` to remove them when no longer needed.