Data Lakes on Databricks with External Storage Mounting

Question

Pulumi · Accepted Answer

To build a data lake on Databricks with external storage mounting, you'll need to create a Databricks workspace, and then configure mounts to link Databricks with external storage systems such as an S3 bucket or Azure Data Lake Storage (ADLS). Mounts allow the Databricks file system (`dbfs`) to access data stored in external storage seamlessly.

First, let's set up a Databricks workspace where data processing and analytics will take place. Then, we'll set up a mount point to an S3 bucket, which we'll use as an example of external storage. This will enable our data lake to read and write data to S3 as if it were a local filesystem.

Here is how you can do this using the Pulumi Python SDK:

1. **Create a Databricks workspace**: A workspace is your environment for accessing all of Databricks' features.
2. **Mount S3 bucket**: This will provide the workspace access to your S3 data lake.
3. **Cluster & Notebook (optional)**: Create a Databricks cluster to process data and a notebook to write your analytics code.

Let's write a Pulumi program to achieve this:

```python
import pulumi
import pulumi_aws as aws
import pulumi_databricks as databricks

# Note: Ensure your AWS and Databricks providers are configured properly

# Provision a new Databricks workspace
databricks_workspace = databricks.Workspace("my-databricks-workspace",
    tags={
        "Environment": "Production"
    },
    sku="premium"  # SKU can be "standard", "premium", or "trial" depending on your requirements
)

# Use AWS IAM role for the workspace to access the S3 bucket
s3_access_role = aws.iam.Role("s3-access-role",
    assume_role_policy=databricks_workspace.workspace_url.apply(
        lambda url: json.dumps({
            "Version": "2012-10-17",
            "Statement": [{
                "Effect": "Allow",
                "Principal": {
                    "Service": "databricks.amazonaws.com",
                },
                "Action": "sts:AssumeRole",
                "Condition": {
                    "StringEquals": {
                        "sts:ExternalId": url,
                    },
                },
            }]
        })
    )
)

# Define the policy to read S3 data
s3_access_policy = aws.iam.RolePolicy("s3-access-policy",
    role=s3_access_role.name,
    policy=pulumi.Output.all(databricks_workspace.workspace_url, databricks_workspace.id).apply(
        lambda args: json.dumps({
            "Version": "2012-10-17",
            "Statement": [{
                "Effect": "Allow",
                "Action": [
                    "s3:Get*",
                    "s3:List*",
                ],
                "Resource": "*",
            }]
        })
    )
)

# Create an S3 bucket to be used as the data lake storage
data_lake_bucket = aws.s3.Bucket("data-lake-bucket",
    acl="private",
    tags={
        "Purpose": "Databricks Data Lake Storage"
    }
)

# Mount the S3 bucket to Databricks
s3_mount = databricks.Mount("s3-mount",
    cluster_id=databricks_cluster.id,
    uri=f"s3a://{data_lake_bucket.bucket}",
    s3=databricks.MountS3Args(
        bucket_name=data_lake_bucket.bucket,
        # The instance profile ARN we just created
        instance_profile=s3_access_role.arn
    )
)

# Outputs
pulumi.export('databricksWorkspaceUrl', databricks_workspace.workspace_url)
pulumi.export('dataLakeBucket', data_lake_bucket.bucket)
```

In the above program, we create a Databricks workspace and an AWS S3 bucket. We then set up an IAM Role with the appropriate trust policy for Databricks to assume the role, along with a Role Policy for necessary S3 access permissions. Then we mount the S3 bucket to the Databricks workspace using the `databricks.Mount` resource. This Pulumi program is structured so that the outputs will provide URLs to access the Databricks workspace and the name of the S3 bucket, which is now part of your data lake and mounted to Databricks.

Please remember that each `pulumi.export` will output the value after the Pulumi program is run, allowing you to access the workspace URL or the S3 bucket name easily. Replace the placeholder values with appropriate values corresponding to your AWS account and Databricks setup. Also, ensure you configure your cloud provider before running this program.

For more detailed documentations on resources:
- [Databricks Workspace](https://www.pulumi.com/registry/packages/databricks/api-docs/workspace/)
- [AWS IAM Role](https://www.pulumi.com/registry/packages/aws/api-docs/iam/role/)
- [AWS IAM Role Policy](https://www.pulumi.com/registry/packages/aws/api-docs/iam/rolepolicy/)
- [AWS S3 Bucket](https://www.pulumi.com/registry/packages/aws/api-docs/s3/bucket/)
- [Databricks Mount to S3](https://www.pulumi.com/registry/packages/databricks/api-docs/mount/)