Persisting Databricks Delta Tables on S3.

Question

Pulumi · Accepted Answer

Persisting Databricks Delta Tables on Amazon S3 involves several steps utilizing Databricks and AWS services. The first step is to set up an S3 bucket that will be used to store the Delta tables. Next, we need to mount the S3 bucket to the Databricks workspace. We can then create tables within Databricks specifying the S3 location where the data should persist.

Below is a Pulumi program written in Python to achieve this:

Set up an S3 bucket using the aws.s3.Bucket resource.
Mount the S3 bucket to the Databricks workspace using the databricks.Mount resource.
Create the Delta table using the databricks.Table resource, specifying the S3 location as the storage location.

Let's walk through the code, which accomplishes these tasks:

import pulumi
import pulumi_aws as aws
import pulumi_databricks as databricks

# Create an S3 bucket to store Databricks Delta Tables.
s3_bucket = aws.s3.Bucket("delta-data-bucket")

# Use the Databricks provider to mount the S3 bucket into Databricks.
s3_mount = databricks.Mount("delta-mount",
    cluster_id="<CLUSTER-ID>",  # Replace with your Databricks cluster ID.
    s3=databricks.MountS3Args(
        bucket_name=s3_bucket.bucket,   # Reference the created S3 bucket.
        instance_profile="<INSTANCE-PROFILE>",  # Replace this with the instance profile associated with S3 access.
    ))

# Now, let's define the Databricks Delta Table.
# The location should point to a path within the mounted S3 bucket.
delta_table = databricks.Table("delta-table",
    name="my_delta_table",
    schema_name="default",
    catalog_name="hive_metastore",   # This refers to the Hive metastore service that Databricks uses.
    columns=[                       # Define the schema of the table.
        databricks.TableColumnArgs(
            name="id",
            type_name="INT",
            nullable=False,
            position=1,
        ),
        databricks.TableColumnArgs(
            name="data",
            type_name="STRING",
            nullable=True,
            position=2,
        ),
    ],
    table_type="DELTA",         # Specify the table type as DELTA.
    storage_location=s3_bucket.arn.apply(lambda arn: f"{arn}/delta-tables/my_delta_table"),
    # Set the table storage location to a path within the S3 bucket.
)

# Export the S3 bucket URL where data is persisted.
pulumi.export('bucket_url', s3_bucket.bucket_regional_domain_name)
pulumi.export('table_storage_location', delta_table.storage_location)

Replace <CLUSTER-ID> with the ID of your Databricks cluster, and replace <INSTANCE-PROFILE> with the ARN of the AWS instance profile that has the necessary permissions to read from and write to the S3 bucket.

In this code, aws.s3.Bucket creates an S3 bucket to store the Delta table data. The databricks.Mount resource is used to mount the S3 bucket as a storage location in your Databricks workspace. With the databricks.Table resource, a Databricks Delta Table is defined, specifying where in S3 the data should be stored by setting the storage_location property referencing the S3 bucket we created.

Lastly, the program exports the S3 bucket URL and the specific path to the persisted Delta table, so they can be easily accessed or referenced.

It's important to ensure that the Databricks cluster and the AWS configuration have been set up correctly with the right permissions to access the S3 bucket. This will likely involve configuring AWS IAM roles and policies, as well as setting up the Databricks instance profile. This is assumed to be in place before running this program.