Restricting Inbound Traffic for Databricks Clusters

Question

Pulumi · Accepted Answer

To restrict inbound traffic for Databricks clusters, you would typically set up networking rules that control the traffic flow to the clusters within your cloud environment. With Pulumi, you can configure these settings using the appropriate cloud provider resources (like security groups in AWS, network security groups in Azure, or firewall rules in GCP) and by using Databricks-specific settings to ensure your clusters only accept traffic from allowed sources.

Below, I will provide a Python program using Pulumi to set up a Databricks cluster with restricted inbound traffic. The program uses the `pulumi_databricks` provider to create the Databricks cluster with limited network access. Pretend we are using AWS for this example; we'll also create a security group to restrict the inbound traffic to the Databricks cluster.

First, we need to define a security group with appropriate inbound rules to allow traffic from the allowed IP ranges. Then, we associate this security group with the Databricks cluster. Here's a step-by-step guide written as a Pulumi program in Python:

```python
import pulumi
import pulumi_aws as aws
import pulumi_databricks as databricks

# Create a security group to restrict inbound traffic
security_group = aws.ec2.SecurityGroup('databricks-sg',
    description='Allow inbound traffic from specified IP ranges to Databricks',
    ingress=[
        # Replace 'YOUR_IP_ADDRESS/CIDR' with the allowed IP address ranges
        aws.ec2.SecurityGroupIngressArgs(
            from_port=443,  # Databricks clusters typically use port 443 for HTTPS
            to_port=443,
            protocol='tcp',
            cidr_blocks=['YOUR_IP_ADDRESS/CIDR'],
        ),
        # You can add more ingress rules here as needed
    ],
    egress=[
        # Allowing all outbound traffic by default
        aws.ec2.SecurityGroupEgressArgs(
            from_port=0,
            to_port=0,
            protocol='-1',  # '-1' means all protocols
            cidr_blocks=['0.0.0.0/0'],
        ),
    ])

# Create a Databricks workspace (if it doesn't already exist)
databricks_workspace = databricks.Workspace('my-databricks-workspace',
    # Additional workspace configuration here
)

# Provision a Databricks cluster with restricted networking using the security group defined above
databricks_cluster = databricks.Cluster('my-databricks-cluster',
    autoscale=databricks.ClusterAutoscaleArgs(
        min_workers=1,
        max_workers=4,
    ),
    # Specify the AWS attributes for the cluster, including the security group
    aws_attributes=databricks.AwsAttributeArgs(
        instance_profile_arn=databricks_workspace.instance_profile_arn,
        security_group_id=security_group.id,  # Attach the security group to the Databricks cluster
        availability='SPOT',
    ),
    node_type_id='i3.xlarge',
    spark_version='6.4.x-scala2.11',
    # Additional configuration here...
)

# Export the cluster ID
pulumi.export('cluster_id', databricks_cluster.id)
```

In this program, we are doing the following:

1. **Creating a Security Group**: We create an AWS security group that specifies which inbound traffic is allowed. Here, we specify that traffic is only allowed on TCP port 443 (used for HTTPS) from a range of IP addresses you should replace with your own secure sources.
   
2. **Creating a Databricks Workspace**: We define a Databricks workspace resource. This step assumes that a workspace is needed; if you already have a workspace, this would be unnecessary.

3. **Provisioning a Databricks Cluster**: We provision a new Databricks cluster and assign the created security group to the cluster's AWS attributes. There are some placeholders in this code that you would fill out with your actual configuration details, such as the minimum and maximum number of workers and the node type.

4. **Exporting Output**: The cluster ID is exported so that it can be accessed easily outside of Pulumi, for instance from the Pulumi console or in a CI/CD pipeline.

Remember that you will need to have the appropriate Pulumi providers set up and configured to run this program effectively, and you should replace placeholders such as `'YOUR_IP_ADDRESS/CIDR'` with actual data. Additionally, the example uses a spot instance for the cluster to potentially reduce costs, which you can change based on your requirements.

You might also need to set up proper permissions and roles in your cloud environment to enable the Databricks cluster to access other AWS resources. Ensure that your Pulumi automation has the required access to manage these resources within your AWS account.