Centralized Data Hub Creation with Databricks External Locations

Question

Pulumi · Accepted Answer

Creating a centralized data hub in Databricks requires an external location to store and manage data, which can then be accessed and processed by Databricks clusters. External locations in Databricks are used for connecting to data stored outside of DBFS (Databricks File System). Databricks external locations can refer to storage in cloud providers like AWS S3, Azure Blob Storage, or Google Cloud Storage.

An external location object in Databricks essentially acts as a link or pointer to data stored in remote locations. It provides a way for Databricks to access this data without having to duplicate it within DBFS, which allows for more efficient data processing and management, as well as better security and governance.

In this case, I will demonstrate how to create an external location using Pulumi and the Databricks provider. The external location will point to an S3 bucket in this example, as AWS is a common cloud provider. If you're using Azure or GCP, the process would be similar, but you would point to the appropriate storage services for those environments.

Here's how you can create an external location with Pulumi using Python:

```python
import pulumi
import pulumi_databricks as databricks

# Create a Databricks external location that points to an S3 bucket.
# Ensure you have already defined your Databricks workspace and S3 bucket outside of this script.
external_location = databricks.ExternalLocation("central-data-hub",
    name="central-data-hub",
    owner="datahub-owner",  # Owner of the external location in Databricks, such as a user or group
    comment="Centralized Data Hub for company analytics",  # Optional description for clarity
    url="s3://my-central-data-hub-bucket",  # URL pointing to the intended S3 bucket
    credentialName="my-databricks-s3-credentials",  # Name of the credentials resource to access the S3 bucket
    metastoreId="my-databricks-metastore-id",  # ID of the metastore in Databricks associated with this location
)

# Export the ID of the external location to be available after deployment.
pulumi.export("external_location_id", external_location.id)
```

In this Pulumi program, we are creating an `ExternalLocation` resource.

- `name`: The user-friendly name of the external location within Databricks.
- `owner`: The user or group within Databricks that owns the external location.
- `comment`: This is where you can provide additional information about the external location, which helps maintain clarity and context for other users who might interact with this resource.
- `url`: The URL specifies the remote storage location where your data resides (in this case, an S3 bucket).
- `credentialName`: This refers to the Databricks credentials configuration that allows access to the specified storage. These credentials should be set up in Databricks beforehand.
- `metastoreId`: This identifier points to the Databricks metastore associated with this location, enabling the data hub to organize and manage metadata for your data.

Note: This program assumes that the Workspace, S3 bucket, and credentials have been pre-configured in your Databricks environment.

For more details and configuration options, you can refer to the documentation for the [`databricks.ExternalLocation`](https://www.pulumi.com/registry/packages/databricks/api-docs/externallocation/) resource.

After running this program with Pulumi, the output will give you the ID of the external location you created. You can use this ID to reference the external location in further Pulumi programs or Databricks tasks.