Mounting Databricks File Systems for Distributed AI Training

Question

Pulumi · Accepted Answer

When setting up a distributed AI training environment using Databricks, you often need to read datasets and write output models to a centralized file system. Databricks allows you to mount storage systems such as Amazon S3, Azure Data Lake Storage, Google Cloud Storage, etc., onto its native file system, DBFS (Databricks File System). This enables datasets stored on these platforms to be easily accessible within your Databricks workspace, allowing for scalable and distributed training sessions.

We'll be using Pulumi's Databricks provider to create mounts for our cloud storage within Databricks. The databricks.Mount resource will help us to manage mounted storage systems in Databricks, while the databricks.DbfsFile resource allows us to work directly with files within DBFS. Below I'll provide a Pulumi program written in Python that mounts an Amazon S3 bucket onto DBFS in a Databricks workspace.

Firstly, ensure you have the pulumi-databricks provider installed in your working environment. If not, you can install it using pip:

pip install pulumi-databricks

Now, let’s proceed with the Pulumi program:

import pulumi
import pulumi_databricks as databricks

# Configure your Databricks workspace and AWS S3 details
databricks_workspace = "<your-databricks-workspace-name>"
aws_s3_bucket_name = "<your-s3-bucket-name>"
instance_profile_arn = "<instance-profile-arn>"

# Create a mount point for the S3 bucket on DBFS using Databricks
s3_mount = databricks.Mount("s3Mount",
    cluster_id=databricks_workspace,
    s3=databricks.MountS3Args(
        bucket_name=aws_s3_bucket_name, 
        instance_profile=instance_profile_arn
    ),
    mount_name="myS3Mount"  # The name to refer to the mount point within Databricks
)

# Optional: You might want to add a file in DBFS
dbfs_file = databricks.DbfsFile("dbfsFile",
    path="/mnt/myS3Mount/some-directory/your-file.txt",
    content_base64=pulumi.Output.secret("SGVsbG8sIFB1bHVtaSE=") # base64-encoded "Hello, Pulumi!"
)

# Export the mount point's DBFS path for easy access
pulumi.export("mount_dbfs_path", s3_mount.dbfs_path)

This program achieves the following:

It imports the necessary Pulumi Databricks module.
It uses the databricks.Mount resource to create a mount point named myS3Mount for the specified S3 bucket, referencing an existing cluster in your Databricks workspace.
It uses the databricks.DbfsFile resource to create a file on the mounted DBFS path, containing a simple "Hello, Pulumi!" base64-encoded message. Note that this step is optional and specific to your requirements.
The DBFS path to the mount point is exported, allowing you to reference it in other Pulumi stacks or for programmatic access.

Be sure to replace the placeholder strings with actual values that correspond to your infrastructure: databricks_workspace, aws_s3_bucket_name, and instance_profile_arn. The instance profile should be set up in AWS IAM to allow the necessary permissions for Databricks clusters to access the S3 bucket.

By running this Pulumi program within a directory set up with the right Pulumi and cloud credentials, you will be able to set up the DBFS mount point which can then be used across notebooks and jobs in your Databricks workspace. This is essential for collaborative and reproducible data science work on scalable compute clusters.