Multi-Format Data Storage using Databricks External Locations

Question

Pulumi · Accepted Answer

To set up multi-format data storage using Databricks external locations, we'll need to create an external storage location that Databricks can connect to. An external location in Databricks is a way to connect to data stored outside of Databricks filesystems. These can be in multiple formats like CSV, JSON, Parquet, etc., and can exist on various storage services such as AWS S3, Azure Data Lake Storage, or Google Cloud Storage.

In this example, we will assume you are using AWS S3 for storage; however, the concepts apply similarly to Azure and GCP. We will set up an AWS S3 bucket as an external location where Databricks can access data in different formats.

The Python code below is a Pulumi program that creates an S3 bucket configured as an external location in Databricks. We'll use the `pulumi_databricks` package to interact with Databricks resources.

Before you begin, ensure you have the Pulumi CLI installed and configured with the appropriate cloud provider credentials. You should also have access credentials for Databricks, allowing you to manage resources there.

Here is the structure of the setup we will follow:

1. Create an AWS S3 bucket that will serve as the data storage.
2. Define an external location in Databricks that points to the S3 bucket.

Let's take a look at the program:

```python
import pulumi
import pulumi_aws as aws
import pulumi_databricks as databricks

# Create an AWS S3 bucket to store the data
s3_bucket = aws.s3.Bucket("data_bucket",
    acl="private",
    force_destroy=True, # Allows the bucket to be destroyed even if it contains objects at destroy time
)

# Databricks external location settings - S3 bucket
# Make sure to replace 'REGION' with the region of your bucket,
# and 'DATABRICKS_WORKSPACE_URL' with your Databricks workspace URL.
external_location = databricks.ExternalLocation("my_external_location",
    url=f"s3://data_bucket.REGION.amazonaws.com/",
    metastore_id="DATABRICKS_WORKSPACE_URL",  # Replace with your actual workspace URL
    credential_name="my_credentials", # This should correspond to credentials known to Databricks
    # If using SSE-KMS for encryption, specify in encryption_details
    # encryption_details=databricks.ExternalLocationEncryptionDetailsArgs(
    #     sse_encryption_details=databricks.ExternalLocationSseEncryptionDetailsArgs(
    #         algorithm="AES256",
    #         aws_kms_key_arn="arn:aws:kms:REGION:ACCOUNT_ID:key/KEY_ID" # Replace with KMS Key ARN
    #     ),
    # ),
)

# Export the S3 bucket name and Databricks external location URL
pulumi.export('s3_bucket', s3_bucket.id)
pulumi.export('external_location_url', external_location.url)
```

In this code, you're defining two main resources:

- An S3 bucket named `data_bucket` that will hold your multi-format data. We've set the ACL to `private` for security purposes and enabled `force_destroy` to clean up the bucket easily during the tear-down process.
- A Databricks external location resource named `my_external_location`. This resource references the S3 bucket using an S3 URL format that Databricks expects. It includes the URL for the Databricks workspace and the name of the credentials configured in Databricks to access the S3 bucket.

To connect this external location to your Databricks workspace correctly, you need to replace `REGION` with the actual region where your S3 bucket resides, `DATABRICKS_WORKSPACE_URL` with your Databricks workspace URL, and set the `credential_name` with the name of the credentials Databricks will use to access the S3 bucket.

Be sure to provide the correct Databricks workspace URL and set up appropriate credentials within Databricks to allow access to the external storage. The credentials would typically include access and secret keys (if you’re using AWS) with the required permissions on the S3 bucket.

This program will configure your cloud resources and can be deployed using Pulumi CLI. After the deployment is complete, Databricks will be able to use the designated S3 bucket as an external storage location, and you can start storing and accessing data in different formats within your Databricks applications.