Persistent Storage for Databricks Machine Learning Models

Question

Pulumi · Accepted Answer

In order to provide persistent storage for your Databricks Machine Learning Models, you would typically need to integrate cloud storage solutions into your Databricks environment. Databricks integrates with various cloud storage services, and you could use these to store machine learning models, data, and other artifacts.

The `databricks.Mount` resource from the Pulumi Databricks provider allows you to mount a supported cloud storage bucket into the Databricks filesystem (DBFS), making it accessible from your Databricks clusters and jobs.

Here's an example of how you would use Pulumi to set up a persistent storage mount in Databricks:

1. **Instantiate a cloud storage bucket**: This could be an AWS S3 bucket, an Azure Blob Storage container, or a GCP Cloud Storage bucket.
2. **Create a Databricks cluster**: A cluster is required to run jobs and access the DBFS.
3. **Mount the cloud storage bucket in the Databricks workspace**: Use the `databricks.Mount` resource to mount the storage into DBFS.

Below is a Pulumi program that demonstrates how to create an AWS S3 bucket and mount it in a Databricks workspace. Replace the placeholders with appropriate values for your environment before running this code:

```python
import pulumi
import pulumi_aws as aws
import pulumi_databricks as databricks

# Create an AWS S3 bucket to store your machine learning models.
ml_bucket = aws.s3.Bucket("ml-models-storage")

# Set up a Databricks cluster.
# Replace the values of these configuration options with those appropriate for your use case.
cluster = databricks.Cluster("ml-cluster",
    autoscale=databricks.ClusterAutoscaleArgs(
        min_workers=1,
        max_workers=3,
    ),
    node_type_id="Standard_D3_v2",
    spark_version="7.3.x-scala2.12",
)

# Mount the S3 bucket in Databricks DBFS using the S3 bucket name and the instance profile ARN,
# which must be configured to allow Databricks access to the S3 bucket.
# The instance profile ARN should be created as per AWS IAM roles and policies for S3 access.
dbfs_mount = databricks.Mount("ml-models-mount",
    cluster_id=cluster.id,
    source=databricks.MountS3Args(   # Specify that this is an S3 mount.
        bucket_name=ml_bucket.bucket,    # Use the name of the S3 bucket you created.
        instance_profile="arn:aws:iam::123456789012:instance-profile/MyInstanceProfile", # Replace with your instance profile ARN.
    ),
    mount_name="ml-models", # The name of the DBFS mount point.
)

# Export the mount point and the S3 bucket URL so we can access them later.
pulumi.export("mount_point", dbfs_mount.name.apply(lambda name: f"/mnt/{name}"))
pulumi.export("s3_bucket_url", ml_bucket.bucket_regional_domain_name.apply(lambda name: f"s3://{name}"))
```

Explanation of this code:
- The `aws.s3.Bucket` resource creates a new Amazon S3 bucket, which will be used to store the machine learning models.
- The `databricks.Cluster` resource creates a new Databricks cluster where the ML models will be trained and executed.
- The `databricks.Mount` resource mounts the newly created S3 bucket to the Databricks DBFS via the `cluster_id` of the previously created cluster. 
- We then export the DBFS mount point and the S3 bucket URL using Pulumi's `export` function to make two outputs available for us to use or view: the mount point within DBFS and the S3 bucket URL.

Please replace placeholders like `MyInstanceProfile` and the `arn:...` with your AWS IAM instance profile details. The IAM role associated with this instance profile needs the necessary permissions to access the S3 bucket.

Keep in mind that you would also need to configure the Pulumi AWS and Databricks providers properly, setting up credentials and any other required configuration according to the best practices for your organization.