Mounting S3 Buckets for Databricks Notebooks.

Question

Pulumi · Accepted Answer

To mount an Amazon S3 bucket for use with Databricks Notebooks, we will perform a few steps using Pulumi to provision the necessary cloud resources and configurations:

Create an S3 Bucket: This is where your data will be stored. An S3 bucket is a container for objects stored in Amazon S3.
Set up an IAM Role and Policy: Databricks needs permissions to access S3, which we can grant by creating an IAM role and attaching a policy that allows access to the bucket.
Configure Databricks: We'll set up the Databricks workspace and notebook, but the actual mounting process (linking S3 to Databricks) often involves running commands within a Databricks notebook that aren't executed by Pulumi. The mounting script utilizes the Databricks CLI or REST API which uses the earlier provisioned IAM role.

Below is a Pulumi program that sets up the cloud resources. Please note that you need to configure your Databricks workspace separately and include the mounting script in a notebook or job to complete the process. We assume that you have set up authentication and have appropriate AWS and Databricks credentials.

import pulumi
import pulumi_aws as aws
import pulumi_databricks as databricks

# Create an S3 Bucket for storing data.
s3_bucket = aws.s3.Bucket("data-bucket")

# IAM Role that Databricks will assume to get access to the S3 bucket.
databricks_s3_role = aws.iam.Role("databricks-s3-role", assume_role_policy=json.dumps({
    "Version": "2012-10-17",
    "Statement": [{
        "Action": "sts:AssumeRole",
        "Effect": "Allow",
        "Principal": {
            "AWS": "arn:aws:iam::account-ID:root"  # Replace account-ID with your AWS account ID.
        },
    }]
}))

# IAM Policy granting access to the S3 bucket.
s3_access_policy = aws.iam.Policy("s3-access-policy", policy=s3_bucket.arn.apply(lambda arn: json.dumps({
    "Version": "2012-10-17",
    "Statement": [{
        "Effect": "Allow",
        "Action": ["s3:GetObject", "s3:PutObject", "s3:ListBucket"],
        "Resource": [arn, f"{arn}/*"]
    }]
})))

# Attach the policy to the role.
role_policy_attachment = aws.iam.RolePolicyAttachment("role-policy-attachment", 
    role=databricks_s3_role.name,
    policy_arn=s3_access_policy.arn
)

# Output the ARN of the S3 bucket and the IAM role for use in Databricks.
pulumi.export("s3_bucket_arn", s3_bucket.arn)
pulumi.export("databricks_s3_role_arn", databricks_s3_role.arn)

In this Pulumi program, we created an S3 bucket and an IAM role along with a policy that grants Databricks the necessary permissions to read and write to the S3 bucket. The ARNs of both resources are exported for use in your Databricks notebooks.

Mounting S3 to Databricks Notebooks is typically performed through the CLI or a notebook cell using the DBUtils library or Hadoop Configuration. An example command in a Databricks notebook cell would look like this (not executed by Pulumi):

dbutils.fs.mount(
  source = "s3a://<your-bucket-name>",
  mount_point = "/mnt/data",
  extra_configs = {"fs.s3a.aws.credentials.provider": "com.databricks.spark.redshift.DefaultAWSCredentialsProviderChain"}
)

Ensure that you replace <your-bucket-name> with the name of the S3 bucket you provisioned.

In practice, to secure access:

Store AWS keys in a secure place like Databricks secrets or environment variables, not hardcoded in your notebook.
Use fine-grained IAM policies to limit access, following the principle of least privilege.

Lastly, remember to run pulumi up to deploy these resources before attempting to mount the bucket in Databricks.