Secure Data Sharing through Databricks External Locations

Question

Pulumi · Accepted Answer

To set up secure data sharing through Databricks External Locations, you will be utilizing the `databricks.ExternalLocation` resource. This resource allows you to define an external location that can be used for securely sharing data with other Databricks workspaces or external storage systems like Amazon S3, Azure Data Lake Store, or Google Cloud Storage.

Below I will show you how to use Pulumi to create an external location in Databricks that is read-only and configured with encryption details for security. We will assume Amazon S3 as the external data storage for this example.

We will follow these steps to accomplish this task:

1. Set up an Amazon S3 bucket to hold the data you want to share. The S3 bucket would be a secure place to store your data.
2. Create a Databricks secret scope with the necessary S3 access credentials. This will allow Databricks to authenticate with S3 without hardcoding your credentials in the code.
3. Define an External Location in Databricks linked to the S3 bucket using the credentials stored in the secret scope.

Here is the complete Pulumi program written in Python:

```python
import pulumi
import pulumi_databricks as databricks

# First, we define an S3 bucket in AWS where our data will reside.
# Ensure that AWS access credentials are configured in your environment for Pulumi to use.
s3_bucket = aws.s3.Bucket("data-share-bucket", acl="private")

# Now, we create a Databricks secret scope.
# Insert your AWS access key ID and secret access key here.
# Make sure to secure your credentials and do not expose them.
secret_scope = databricks.SecretScope("my-secret-scope", backend_type="DATABRICKS")

# With the secret scope created, we now put our S3 credentials into the secret scope.
aws_access_key = databricks.Secret("aws-access-key", key="AWS_ACCESS_KEY_ID", string_value="your-access-key-id", scope=secret_scope.name)
aws_secret_key = databricks.Secret("aws-secret-key", key="AWS_SECRET_ACCESS_KEY", string_value="your-secret-access-key", scope=secret_scope.name)

# Now we define the encryption details. Replace `your-kms-key-arn` with your actual key ARN.
encryption_details = {
    "sseEncryptionDetails": {
        "algorithm": "AES-256",
        "awsKmsKeyArn": "your-kms-key-arn"
    }
}

# Finally, we create an External Location in Databricks that references the S3 bucket.
external_location = databricks.ExternalLocation(
    "my-external-location",
    url=s3_bucket.bucket.apply(lambda name: f"s3://{name}"),  # Create an S3 URL from the bucket name
    credential_name="my-credentials",
    encryption_details=encryption_details,
    metastore_id="your-metastore-id",
    owner="your-databricks-username",
    force_destroy=True  # Safely remove the external location when it's not needed anymore
)

# Output the External Location URL for reference.
pulumi.export("external_location_url", external_location.url)
```

In this program:

- We started by creating an Amazon S3 bucket to store the data.
- We then created a secret scope in Databricks, which is a secure way to store and reference secrets like the AWS access and secret keys.
- We added secrets to the secret scope: the AWS access key ID and secret access key. These will be used by Databricks to access the S3 bucket securely.
- Next, we specified the encryption details that Databricks will use when accessing the S3 data. This example uses server-side encryption with an AWS KMS-managed key.
- Using the `databricks.ExternalLocation`, we created an external location that points to the S3 bucket. We passed the S3 URL, secrets for credentials, and encryption details.
- Finally, we exported the URL of the external location to be accessible from outside the Pulumi program.

Make sure to replace placeholders like `your-access-key-id`, `your-secret-access-key`, and `your-metastore-id` with actual values for your setup.

When this Pulumi program runs, it will provision the resources specified and output the URL of the external location, which indicates where your data resides for secure sharing.

Keep in mind that proper permissions on both the S3 bucket and the Databricks workspace must be configured to allow secure data sharing. Also, handle your credentials with care, avoiding exposing them in your code or version control.