Managing Costs of AI Workloads Using Databricks Cluster Policies

Question

Pulumi · Accepted Answer

Managing costs for AI workloads can be crucial, especially when working with scalable resources in the cloud. Pulumi provides native support for cloud service providers like AWS, GCP, and Azure, allowing you to manage your infrastructure as code.

For Databricks on AWS, you'll primarily interact with Databricks clusters, which can be configured and managed through cluster policies. These policies enable you to set constraints on clusters, such as the maximum number of instances, which instance types are allowed, and so on. This helps control costs by preventing the over-provisioning of resources.

In the program below, I'm demonstrating how to define a Databricks cluster policy using Pulumi. The policy restricts the size of clusters to prevent overuse of resources. For illustration purposes, let's assume we're setting a maximum number of workers and a specific list of allowed node types to manage usage and costs effectively.

Here's how you can create such a policy with Pulumi:

```python
import pulumi
import pulumi_databricks as databricks

# Define a Databricks cluster policy
cluster_policy = databricks.ClusterPolicy("ai-workload-policy",
    # The definition is a JSON document describing the allowed and disallowed configurations.
    # The 'spark_conf.spark.databricks.cluster.profile' is used to ensure the cluster is using
    # a serverless compute pool to manage costs without compromising on performance.
    definition="""{
        "spark_conf.spark.databricks.cluster.profile": {
            "type": "fixed",
            "value": "serverless",
            "hidden": true
        },
        "spark_conf.spark.databricks.repl.allowedLanguages": {
            "type": "allowlist",
            "value": ["sql", "python", "r", "scala"]
        },
        "aws_attributes.instance_profile_arn": {
            "type": "fixed",
            "value": "arn:aws:iam::123456789012:instance-profile/my-instance-profile"
        },
        "node_type_id": {
            "type": "allowlist",
            "values": ["m4.large", "m4.xlarge", "m4.2xlarge"]
        },
        "num_workers": {
            "type": "range",
            "minValue": 1,
            "maxValue": 10,
            "defaultValue": 2
        }
    }"""
)

# pulumi.export is used to output the id of the cluster policy which can be useful for
# referencing this policy in other configurations or applications.
pulumi.export("cluster_policy_id", cluster_policy.id)
```

In this program, we're using the `pulumi_databricks.ClusterPolicy` resource to define constraints that will govern our Databricks clusters:

- We're fixing the cluster profile to "serverless" to optimize resource allocation.
- The policy allows a certain set of languages in the Databricks REPL.
- A specific AWS IAM role is hardcoded in the policy to work with clusters.
- We're allowing only specific node types, this helps in using only the instances detailed in our cost plan.
- Finally, we're setting a range for the number of workers to ensure that clusters do not go over a certain size, thus managing costs.

You'd use this Pulumi program to manage Databricks cluster policies which, in turn, can manage costs associated with running AI workloads. To use this program, ensure you have the Pulumi CLI installed, along with the required cloud provider configuration set up, and simply run `pulumi up` after saving the script in a Python file.

Remember, the `pulumi.export` at the end of the script makes it easy to retrieve the ID of the new cluster policy, which you might need to reference in other parts of your infrastructure or applications.