Secure Data Integration with Databricks Mount Points
PythonData integration with Databricks often involves connecting your Databricks workspace to various data storage options. In the context of cloud storage, secure data integration can be achieved by mounting storage buckets, often from AWS S3, Azure Blob Storage, or Google Cloud Storage, into the Databricks filesystem (DBFS). The process of creating mount points can securely connect your Databricks workspace to the data stored in these services without exposing any sensitive credentials within your Databricks notebooks or jobs.
When you mount a storage service to Databricks, your data can be accessed as if it were a local directory within DBFS. This setup is essential for seamless data access across various Databricks components like notebooks, jobs, and workflows.
In the following program, we'll demonstrate how to securely integrate cloud storage with Databricks using Pulumi. We'll create the necessary resources and set up a Databricks workspace, along with the required storage credentials. Although the precise code to mount a storage bucket is not directly executable by Pulumi (as it typically runs within a Databricks notebook or an environment where Databricks utilities are available), we will set up everything necessary up to that point.
Pulumi provides infrastructure as code in many languages, including Python, which we'll use here. We'll assume that you have an existing Pulumi project and the appropriate cloud provider credentials configured for AWS, Azure, or GCP. Remember to install the Pulumi Databricks provider by running
pip install pulumi_databricks
.Below is a Pulumi program that outlines how to set up secure storage credentials in Databricks which in practice would be used to create mount points in DBFS:
import pulumi import pulumi_databricks as databricks # Replace the following variables with your details db_workspace_name = 'my-databricks-workspace' db_resource_group = 'my-resource-group' db_managed_resource_group = 'my-managed-resource-group' storage_account_name = 'mydatalakestorage' db_location = 'westus2' # Create a Databricks workspace databricks_workspace = databricks.Workspace("exampleWorkspace", resource_group_name=db_resource_group, managed_resource_group_name=db_managed_resource_group, location=db_location, sku="standard") # Assuming you have storage set up in Azure (blob or adls), # create a Databricks secret scope with your storage account access key storage_account_key = '...your_storage_account_access_key...' secret_scope = databricks.SecretScope("exampleSecretScope", scope="exampleScope", initial_manage_principal="users", keyvault_metadata=databricks.SecretScopeKeyvaultMetadataArgs( dns_name=f"https://{storage_account_name}.vault.azure.net/", # replace with your Key Vault DNS name resource_id="/subscriptions/{subscription-id}/resourceGroups/{resource-group}/providers/Microsoft.KeyVault/vaults/{keyvault-name}", # replace with your Key Vault resource ID )) # Create a storage credential using the key vault secret storage_credential = databricks.StorageCredential("exampleStorageCredential", name="exampleCredential", owner="exampleOwner", azure_service_principal=databricks.StorageCredentialAzureServicePrincipalArgs( application_id="...your_service_principal_app_id...", directory_id="...your_tenant_id...", )) # Fetch the access key secret from the secret scope we've created above storage_secret = databricks.Secret("exampleSecret", scope=secret_scope.name, key="storage-key", string_value=storage_account_key) # Now, normally you'd mount the storage account inside Databricks with the Databricks CLI or a notebook/job, # but Pulumi cannot do this directly. Therefore, you'll need to follow up with Databricks-specific commands to create the mount point. # Those commands would generally look like the following (not to be executed within Pulumi): # dbutils.fs.mount( # source="wasbs://example-container@mydatalakestorage.blob.core.windows.net/", # mount_point="/mnt/data", # extra_configs={"fs.azure.account.key.mydatalakestorage.blob.core.windows.net": dbutils.secrets.get(scope="exampleScope", key="storage-key")}) pulumi.export("workspaceUrl", databricks_workspace.workspace_url) pulumi.export("secretScopeName", secret_scope.name) pulumi.export("storageCredentialName", storage_credential.name)
In this program, the following key steps are performed:
- A Databricks workspace is provisioned using the
databricks.Workspace
resource. - A secret scope is created for the storage account using the
databricks.SecretScope
resource, which will securely store the account's access key. - A
databricks.StorageCredential
resource is then created to represent the security credential for the storage integration. - An access key for the storage account is then stored in the secret scope via the
databricks.Secret
resource.
After setting up these resources, you will need to use Databricks CLI or workflows within a Databricks notebook to mount the storage account to DBFS using the credentials set up in the secret scope. The placeholder code commented out at the bottom provides a rough idea of what those commands could look like, although it varies based on the specifics of your cloud storage and Databricks configuration.
This complete setup ensures that your data integration is secure, as sensitive credentials are not exposed directly within your data processing scripts or notebooks.
- A Databricks workspace is provisioned using the