Databricks External Locations for Federated Machine Learning

Question

Pulumi · Accepted Answer

When working with federated machine learning using Databricks on cloud platforms like AWS, Azure, or GCP, you need to set up external data sources where different parties can share and access datasets without moving them into a central location. This concept aligns with the principle of federated learning where you perform model training across multiple decentralized edge devices or servers while keeping data localized.

To facilitate this within Databricks, you can utilize the databricks.ExternalLocation resource, which allows you to reference external data storage locations such as an S3 bucket, an Azure Data Lake storage, or Google Cloud Storage.

For this example, let’s assume we want to create an external location in an AWS environment that points to an S3 bucket. Our goal is to reference this bucket from within Databricks so that we can access data stored within it for machine learning purposes without moving the data into Databricks managed storage. We will configure the external location to be read-only to preserve the integrity of the data.

Here's a Pulumi program in Python that will create a new Databricks external location pointing to an existing S3 bucket. Before running this code, you'd need to ensure you have the pulumi-databricks provider configured and authenticated to interact with your Databricks workspace. This may involve setting up appropriate tokens or authentication mechanisms that have permissions to manage resources within Databricks.

import pulumi
import pulumi_databricks as databricks

# Configure your existing Databricks workspace details and S3 bucket information
DATABRICKS_WORKSPACE_URL = "https://<your-databricks-workspace-url>"
S3_BUCKET_URL = "s3://<your-s3-bucket-name>"
OWNER_NAME = "<owner-name>"
METASTORE_ID = "<metastore-id>"
CREDENTIAL_NAME = "<credentials-name-for-S3-access>"

# Define the external location
external_location = databricks.ExternalLocation("federated-ml-external-location",
    url=S3_BUCKET_URL,
    name="FederatedMLBucket",
    owner=OWNER_NAME,
    comment="Read-only bucket for federated machine learning data.",
    readOnly=True,
    metastoreId=METASTORE_ID,
    credentialName=CREDENTIAL_NAME
)

# Export the ID of the Databricks external location so it can be referenced if needed
pulumi.export('external_location_id', external_location.id)

In this program:

We first import the required Pulumi modules for Databricks.
We then define several string constants that are placeholders for the details of your Databricks workspace, the S3 bucket you wish to reference, the owner's name, the metastore ID, and the credentials name for S3 access. You’ll need to replace these placeholders with the correct values from your own setup.
We create an instance of databricks.ExternalLocation that points to the specified S3 bucket URL, sets the owner, a comment, indicates that it is read-only, and provides the metastore ID and credential name. The owner is typically an entity that manages the resource within Databricks, and the metastore ID allows you to tie this external location to a particular Databricks metastore.
Finally, we export the ID of the external location, which can be useful if you need to reference this external location from other parts of your Pulumi program or infrastructure code.

You can now proceed by creating models in Databricks that will utilize this external data location for federated learning purposes. Ensure your Databricks workspace has the necessary permissions and configurations to access this S3 data.

Remember to replace placeholder values with actual information from your environment and set up the necessary authentication and authorization mechanisms in place for both Pulumi and Databricks to communicate with AWS services and manage resources.