1. Multi-Format Data Storage using Databricks External Locations


    To set up multi-format data storage using Databricks external locations, we'll need to create an external storage location that Databricks can connect to. An external location in Databricks is a way to connect to data stored outside of Databricks filesystems. These can be in multiple formats like CSV, JSON, Parquet, etc., and can exist on various storage services such as AWS S3, Azure Data Lake Storage, or Google Cloud Storage.

    In this example, we will assume you are using AWS S3 for storage; however, the concepts apply similarly to Azure and GCP. We will set up an AWS S3 bucket as an external location where Databricks can access data in different formats.

    The Python code below is a Pulumi program that creates an S3 bucket configured as an external location in Databricks. We'll use the pulumi_databricks package to interact with Databricks resources.

    Before you begin, ensure you have the Pulumi CLI installed and configured with the appropriate cloud provider credentials. You should also have access credentials for Databricks, allowing you to manage resources there.

    Here is the structure of the setup we will follow:

    1. Create an AWS S3 bucket that will serve as the data storage.
    2. Define an external location in Databricks that points to the S3 bucket.

    Let's take a look at the program:

    import pulumi import pulumi_aws as aws import pulumi_databricks as databricks # Create an AWS S3 bucket to store the data s3_bucket = aws.s3.Bucket("data_bucket", acl="private", force_destroy=True, # Allows the bucket to be destroyed even if it contains objects at destroy time ) # Databricks external location settings - S3 bucket # Make sure to replace 'REGION' with the region of your bucket, # and 'DATABRICKS_WORKSPACE_URL' with your Databricks workspace URL. external_location = databricks.ExternalLocation("my_external_location", url=f"s3://data_bucket.REGION.amazonaws.com/", metastore_id="DATABRICKS_WORKSPACE_URL", # Replace with your actual workspace URL credential_name="my_credentials", # This should correspond to credentials known to Databricks # If using SSE-KMS for encryption, specify in encryption_details # encryption_details=databricks.ExternalLocationEncryptionDetailsArgs( # sse_encryption_details=databricks.ExternalLocationSseEncryptionDetailsArgs( # algorithm="AES256", # aws_kms_key_arn="arn:aws:kms:REGION:ACCOUNT_ID:key/KEY_ID" # Replace with KMS Key ARN # ), # ), ) # Export the S3 bucket name and Databricks external location URL pulumi.export('s3_bucket', s3_bucket.id) pulumi.export('external_location_url', external_location.url)

    In this code, you're defining two main resources:

    • An S3 bucket named data_bucket that will hold your multi-format data. We've set the ACL to private for security purposes and enabled force_destroy to clean up the bucket easily during the tear-down process.
    • A Databricks external location resource named my_external_location. This resource references the S3 bucket using an S3 URL format that Databricks expects. It includes the URL for the Databricks workspace and the name of the credentials configured in Databricks to access the S3 bucket.

    To connect this external location to your Databricks workspace correctly, you need to replace REGION with the actual region where your S3 bucket resides, DATABRICKS_WORKSPACE_URL with your Databricks workspace URL, and set the credential_name with the name of the credentials Databricks will use to access the S3 bucket.

    Be sure to provide the correct Databricks workspace URL and set up appropriate credentials within Databricks to allow access to the external storage. The credentials would typically include access and secret keys (if you’re using AWS) with the required permissions on the S3 bucket.

    This program will configure your cloud resources and can be deployed using Pulumi CLI. After the deployment is complete, Databricks will be able to use the designated S3 bucket as an external storage location, and you can start storing and accessing data in different formats within your Databricks applications.