Secure Data Integration with Databricks Mount Points

Question

Pulumi · Accepted Answer

Data integration with Databricks often involves connecting your Databricks workspace to various data storage options. In the context of cloud storage, secure data integration can be achieved by mounting storage buckets, often from AWS S3, Azure Blob Storage, or Google Cloud Storage, into the Databricks filesystem (DBFS). The process of creating mount points can securely connect your Databricks workspace to the data stored in these services without exposing any sensitive credentials within your Databricks notebooks or jobs.

When you mount a storage service to Databricks, your data can be accessed as if it were a local directory within DBFS. This setup is essential for seamless data access across various Databricks components like notebooks, jobs, and workflows.

In the following program, we'll demonstrate how to securely integrate cloud storage with Databricks using Pulumi. We'll create the necessary resources and set up a Databricks workspace, along with the required storage credentials. Although the precise code to mount a storage bucket is not directly executable by Pulumi (as it typically runs within a Databricks notebook or an environment where Databricks utilities are available), we will set up everything necessary up to that point.

Pulumi provides infrastructure as code in many languages, including Python, which we'll use here. We'll assume that you have an existing Pulumi project and the appropriate cloud provider credentials configured for AWS, Azure, or GCP. Remember to install the Pulumi Databricks provider by running `pip install pulumi_databricks`.

Below is a Pulumi program that outlines how to set up secure storage credentials in Databricks which in practice would be used to create mount points in DBFS:

```python
import pulumi
import pulumi_databricks as databricks

# Replace the following variables with your details
db_workspace_name = 'my-databricks-workspace'
db_resource_group = 'my-resource-group'
db_managed_resource_group = 'my-managed-resource-group'
storage_account_name = 'mydatalakestorage'
db_location = 'westus2'

# Create a Databricks workspace 
databricks_workspace = databricks.Workspace("exampleWorkspace",
    resource_group_name=db_resource_group,
    managed_resource_group_name=db_managed_resource_group,
    location=db_location,
    sku="standard")

# Assuming you have storage set up in Azure (blob or adls), 
# create a Databricks secret scope with your storage account access key
storage_account_key = '...your_storage_account_access_key...'
secret_scope = databricks.SecretScope("exampleSecretScope",
    scope="exampleScope",
    initial_manage_principal="users",
    keyvault_metadata=databricks.SecretScopeKeyvaultMetadataArgs(
        dns_name=f"https://{storage_account_name}.vault.azure.net/", # replace with your Key Vault DNS name
        resource_id="/subscriptions/{subscription-id}/resourceGroups/{resource-group}/providers/Microsoft.KeyVault/vaults/{keyvault-name}", # replace with your Key Vault resource ID
    ))

# Create a storage credential using the key vault secret
storage_credential = databricks.StorageCredential("exampleStorageCredential",
    name="exampleCredential",
    owner="exampleOwner",
    azure_service_principal=databricks.StorageCredentialAzureServicePrincipalArgs(
        application_id="...your_service_principal_app_id...",
        directory_id="...your_tenant_id...",
    ))

# Fetch the access key secret from the secret scope we've created above
storage_secret = databricks.Secret("exampleSecret",
    scope=secret_scope.name,
    key="storage-key",
    string_value=storage_account_key)

# Now, normally you'd mount the storage account inside Databricks with the Databricks CLI or a notebook/job,
# but Pulumi cannot do this directly. Therefore, you'll need to follow up with Databricks-specific commands to create the mount point.
# Those commands would generally look like the following (not to be executed within Pulumi):
# dbutils.fs.mount(
#     source="wasbs://example-container@mydatalakestorage.blob.core.windows.net/",
#     mount_point="/mnt/data",
#     extra_configs={"fs.azure.account.key.mydatalakestorage.blob.core.windows.net": dbutils.secrets.get(scope="exampleScope", key="storage-key")})

pulumi.export("workspaceUrl", databricks_workspace.workspace_url)
pulumi.export("secretScopeName", secret_scope.name)
pulumi.export("storageCredentialName", storage_credential.name)

```

In this program, the following key steps are performed:
1. A Databricks workspace is provisioned using the `databricks.Workspace` resource.
2. A secret scope is created for the storage account using the `databricks.SecretScope` resource, which will securely store the account's access key.
3. A `databricks.StorageCredential` resource is then created to represent the security credential for the storage integration.
4. An access key for the storage account is then stored in the secret scope via the `databricks.Secret` resource.

After setting up these resources, you will need to use Databricks CLI or workflows within a Databricks notebook to mount the storage account to DBFS using the credentials set up in the secret scope. The placeholder code commented out at the bottom provides a rough idea of what those commands could look like, although it varies based on the specifics of your cloud storage and Databricks configuration.

This complete setup ensures that your data integration is secure, as sensitive credentials are not exposed directly within your data processing scripts or notebooks.