Databricks External Locations for Scalable Data Ingestion

Question

Pulumi · Accepted Answer

When working with Databricks on a cloud platform, you often need mechanisms to efficiently ingest and process vast amounts of data from various sources. A Databricks *External Location* provides a way to define a location outside of Databricks for scalable data ingestion. By defining an External Location, you can mount storage from S3, ADLS, GCS, etc., directly into your Databricks workspace, allowing you to access it as if the data was present locally.

This pattern is useful when dealing with large datasets or when integrating with existing data lakes. It is common practice to read from and write to these external locations from your Databricks notebooks or jobs, enabling large-scale data processing tasks.

Let's create an External Location in Databricks using Pulumi. This program sets up an External Location that points to an S3 bucket. We need the URL of the S3 bucket, the name we want to give to this location in Databricks, the owner of this resource, and credentials to access this storage which may include encryption details if the data is encrypted at rest.

Here is a Pulumi program that demonstrates how to set up an External Location in Databricks:

```python
import pulumi
import pulumi_databricks as databricks

# Create an External Location that references an S3 bucket for data ingestion.
external_data_location = databricks.ExternalLocation("external-data-location",
    url="s3://my-external-data-bucket",
    name="MyExternalDataLocation",
    owner="owner@example.com",
    credential_name="my-external-storage-credential",
    metastore_id="metastore-id",  # Replace with your metastore ID
    force_destroy=True,          # This allows the deletion of the external location even if there are dependent objects.
    # If you had encryption details, you would specify them in `encryptionDetails` argument
)

# Export the ID of the External Location
pulumi.export("external_location_id", external_data_location.id)
```

In this program:

- We import the required Pulumi packages.
- Using the `databricks.ExternalLocation` class, we create an External Location that points to an `s3` bucket with its URL.
- We provide a `name` which is how this External Location will be referred to in the Databricks workspace.
- The `owner` is typically the email of the Databricks workspace user who will own this resource.
- The `credential_name` is the name of the Databricks secret where the S3 bucket access credentials are stored.
- `metastore_id` is specific to the metastore in use and you would replace "metastore-id" with the actual metastore ID for your setup.
- `force_destroy` is a boolean indicating whether or not to allow Databricks to delete this external location even if there are dependent objects. Be cautious with this setting in production.
- Finally, we export the `id` of the External Location as an output of our Pulumi program, so we can easily reference it elsewhere.

Note that you'll need to have the appropriate Databricks provider configuration set up before running this code. This often includes setting up a Databricks token and workspace URL as environment variables or config values within your Pulumi stack configuration.

Remember, precise names like keys, URLs, etc., should be handled securely and not hardcoded as in this example. Use Pulumi's config and secrets manager or cloud provider's best practices for handling sensitive information securely.

This program implicitly assumes that you have the correct permissions set up on the cloud provider side to access the specified S3 bucket and that the resources specified (like the metastore) exist. Ensure that your IAM roles, policies, and resource provisioning align with this setup for the code to execute successfully.