Serverless SQL Queries on Delta Lake using Databricks SQL Endpoint

Question

Pulumi · Accepted Answer

To execute serverless SQL queries on Delta Lake using a Databricks SQL Endpoint, you will have to create a Databricks workspace, set up a cluster (in this case a SQL Endpoint for serverless execution), and configure the necessary security and networking. To perform this task with Pulumi, you would use the Databricks provider to interact with the Databricks service and manage your resources.

Here is a step-by-step guide with a corresponding Pulumi program in Python that accomplishes the following:

1. **Databricks Workspace Setup**: The first step is to create a workspace where you can run your SQL queries. A workspace is an environment for accessing all of your Databricks assets.

2. **SQL Endpoint Creation**: After setting up the workspace, you will create a SQL Endpoint. This will serve as your serverless SQL query execution environment. It is a fully managed compute resource.

3. **Delta Lake and SQL Queries**: Once the workspace and SQL Endpoint are ready, you are set to execute SQL queries against Delta Lake. Delta Lake is a storage layer that brings ACID transactions to Apache Spark and other big-data engines.

4. **Security and Access Control**: We will also include an example of setting up an access control policy for your SQL endpoint.

Below is a Pulumi program in Python that demonstrates how to create these resources. Remember to replace placeholder values with your actual configuration values.

```python
import pulumi
import pulumi_databricks as databricks

# Create a Databricks workspace.
# Replace 'resource_group_name' with your Azure resource group or other cloud provider specifics.
# The workspace is where all your Databricks assets, like notebooks and clusters, reside.
workspace = databricks.Workspace("my-databricks-workspace",
    location="westus", # Replace with your desired region.
    sku="premium", # Choose the SKU that fits your needs.
    resource_group_name="my-resource-group" # Replace with your resource group name.
)

# Define Databricks SQL Endpoint.
# This will create a lightweight compute for running SQL queries.
# We will set the autoStopMins to an appropriate value to stop the endpoint when it's not in use to save costs.
sql_endpoint = databricks.SqlEndpoint("my-sql-endpoint",
    channel="CURRENT", # The channel specifies the version of the runtime to use.
    cluster_size="Small", # The size of the cluster to run queries.
    auto_stop_mins=120, # Automatically stop the SQL Endpoint after 2 hours of inactivity.
    enable_serverless_compute=True, # Enable the serverless compute option.
    data_source_id=workspace.id, # Associate the SQL Endpoint with our workspace.
    tags={"env": "production"} # Add tags for resource categorization and filtering.
)

# Export the SQL Endpoint URL for usage outside of Pulumi.
# You can use this URL to connect to the SQL Endpoint with JDBC/ODBC clients or within Databricks notebooks.
pulumi.export("sqlEndpointJdbcUrl", sql_endpoint.jdbc_url)
```

**Explanation**:

- The `databricks.Workspace` resource sets up the Databricks workspace. Workspaces are collaborative environments where teams can work on data analysis tasks and include code repositories, note storage, and access to data sources.

- The `databricks.SqlEndpoint` resource represents the SQL Endpoint within Databricks. SQL Endpoints provide a serverless interface for running SQL queries against data stored in Delta Lake, without requiring a persistent cluster. This makes it easy to execute queries on an ad hoc basis without incurring continuous costs. The `auto_stop_mins` parameter is particularly important for cost management, as it automatically terminates the endpoint after a specified period of inactivity.

- The `pulumi.export` line at the end of the script outputs the JDBC URL of the SQL Endpoint. This URL can be used to connect to the endpoint from various clients that support JDBC, such as BI tools or custom applications.

Remember to adjust the `resource_group_name`, `location`, and other configurations based on your cloud provider and environment. Before running this program, ensure that you have set up authentication with your cloud provider so that Pulumi can manage the resources on your behalf. You will also need a Pulumi account, and the Pulumi CLI installed and configured on your development machine.

This program should be saved in a Python file (e.g., `main.py`) within a Pulumi project. To create a new Pulumi project, you can follow the instructions from Pulumi documentation or use the Pulumi CLI (`pulumi new python`). Once your Pulumi project is set up, run `pulumi up` in the directory of your Pulumi project to deploy the resources described in the program.

For more detailed information on the resources used above, refer to the corresponding Pulumi documentation:
- [Databricks Workspace](https://www.pulumi.com/registry/packages/databricks/api-docs/workspace/)
- [Databricks SQL Endpoint](https://www.pulumi.com/registry/packages/databricks/api-docs/sqlendpoint/)