Unified Data Analytics with Databricks Storage Credentials

Question

Pulumi · Accepted Answer

Unified Data Analytics with Databricks Storage Credentials involves setting up credentials to securely access various storage services from within Databricks. Integrating storage solutions allows Databricks to interact with data stored in, for example, AWS S3 buckets, Azure Blob Storage, or Google Cloud Storage, for the purpose of reading and analyzing large datasets.

Databricks itself doesn't store data; instead, it analyzes data that is stored elsewhere. To do this safely and securely, you need to supply Databricks with credentials. How you provide these credentials can vary based on the cloud provider you are using.

Let's say you're using AWS and you want to set up a read-only storage credential that grants Databricks read-only access to an S3 bucket via an IAM role. The following Pulumi program demonstrates how you can provision such a storage credential using the `pulumi_databricks` package.

Here's what each part of the program does:

1. **Import dependencies**: We're importing the required Pulumi packages for Databricks and Pulumi itself.

2. **Set up the storage credential**: We're creating an instance of `databricks.StorageCredential` and specifying the necessary parameters such as the name of the credential, the owner, the AWS IAM role, and the metastore ID. The `readOnly` flag is set to `True`, indicating that this credential will only allow read access.

3. **Export the storage credential ID**: By exporting the `storage_credential_id`, you can reference this credential ID in other parts of your infrastructure as code.

Here is the program:

```python
import pulumi
import pulumi_databricks as databricks

# Define the AWS IAM role that will be used by Databricks to access the S3 bucket
aws_iam_role = {
    "roleArn": "arn:aws:iam::123456789012:role/databricks-s3-access-role"
}

# Create a read-only storage credential in Databricks for accessing data in S3 using the IAM role
storage_credential = databricks.StorageCredential("readOnlyS3Credential",
    # Replace with your Databricks workspace owner user or group name
    owner="owner@example.com",
    # The following configurations specify the AWS IAM role and mark this credential as read-only
    awsIamRole=aws_iam_role,
    readOnly=True,
    # You must specify the metastore ID associated with your Databricks workspace
    metastoreId="your-metastore-id"
)

# Output the storage credential ID
pulumi.export("storage_credential_id", storage_credential.id)
```

Remember to replace `"arn:aws:iam::123456789012:role/databricks-s3-access-role"` with the actual ARN of the IAM role you've configured for Databricks access to S3, `"owner@example.com"` with the actual owner of the Databricks workspace, and `"your-metastore-id"` with the actual metastore ID associated with your Databricks workspace.

In this program, we're making use of the Databricks provider for Pulumi to provision a storage credential. This program would be part of a larger Pulumi deployment where you would also define the necessary cloud resources such as the S3 buckets themselves, potentially the IAM role as well, and other Databricks resources needed for your unified data analytics setup.

For detailed information about the Databricks provider in Pulumi, you can refer to the [Pulumi Databricks Provider documentation](https://www.pulumi.com/registry/packages/databricks/). Pulumi's programming model allows you to define such credentials in a declarative manner, enabling secure, auditable, and reproducible deployments.

Make sure to run `pulumi up` to deploy this configuration and provision the resources in your own Databricks environment. The output of the command will show you the ID of the created read-only storage credential.