Securing AI Training Clusters with DigitalOcean Firewalls

Question

Pulumi · Accepted Answer

To secure AI training clusters on DigitalOcean, implementing firewalls is one of the primary measures to ensure that your resources are protected from unwanted access. You would typically control traffic at two levels: 
1. **Firewall**: This guards your infrastructure by defining rules for inbound and outbound traffic to your Droplets (virtual machines) or Kubernetes clusters.
2. **Database Firewall**: If you have databases in your architecture, you would manage access to these specifically with database firewall rules.

For AI training clusters running in Kubernetes on DigitalOcean, we can use Pulumi with the `digitalocean` provider to create and configure both `digitalocean.Firewall` for Droplets and `digitalocean.KubernetesCluster` for managing the cluster itself. We'll create a firewall and Kubernetes cluster with the required configuration that is deemed secure.

The `digitalocean.Firewall` resource allows us to specify inbound and outbound rules which filter traffic based on its source and destination. You will typically configure it to allow only necessary traffic to your nodes, such as SSH (port 22), HTTPS (port 443), and any other ports your applications specifically need.

The `digitalocean.KubernetesCluster` resource sets up a managed Kubernetes cluster. As part of the setup, you can provide tags which are used by the Firewall to apply rules to every node in the cluster that has the tag. Also, you may configure the version, the size, and other specifications for your nodes.

Below is a Pulumi program that defines both a DigitalOcean Firewall and a Kubernetes cluster:

```python
import pulumi
import pulumi_digitalocean as do

# Create a DigitalOcean Kubernetes cluster
k8s_cluster = do.KubernetesCluster(
    "ai-training-cluster",
    region="nyc3",
    version="1.21.5-do.0",
    node_pool=do.KubernetesClusterNodePoolArgs(
        name="default",
        size="s-2vcpu-2gb",
        node_count=3,
        tags=["k8s-node"]
    )
)

# Create a firewall for the Kubernetes nodes
firewall = do.Firewall(
    "k8s-firewall",
    # Assuming the Kubernetes nodes are tagged "k8s-node", we reference that tag to apply rules to those nodes.
    tags=["k8s-node"],
    inbound_rules=[
        do.FirewallInboundRuleArgs(
            protocol="tcp",
            port_range="22",
            source_addresses=["0.0.0.0/0"]  # Be cautious with this; restrict to known IPs if possible.
        ),
        do.FirewallInboundRuleArgs(
            protocol="tcp",
            port_range="6443",
            source_addresses=["0.0.0.0/0"]  # This is the Kubernetes API server port.
        ),
        # Add other rules as required for your AI training cluster
    ],
    outbound_rules=[
        do.FirewallOutboundRuleArgs(
            protocol="tcp",
            port_range="all",
            destination_addresses=["0.0.0.0/0"]
        )
    ]
)

# Output the cluster endpoint and firewall ID
pulumi.export("k8s_cluster_endpoint", k8s_cluster.endpoint)
pulumi.export("firewall_id", firewall.id)
```

This script automates the infrastructure setup for your AI training cluster with essential security configurations. Here are the critical points of the script:
- A Kubernetes cluster is instantiated with version `1.21.5-do.0`, though you should choose an appropriate version based on your requirements.
- The cluster nodes are tagged `k8s-node`, allowing us to reference this tag in the firewall settings.
- The firewall allows inbound on SSH port 22, which you might adjust to your IP range, and port 6443 for Kubernetes API communications.
- Outbound traffic is fully permitted, which you may want to lock down further based on outbound traffic needs.

By using these resources, your cluster's security posture is maintained, and you can confidently proceed with deploying and training your AI models. If you need to secure incoming traffic to specific services, you would define additional firewall rules that allow traffic to the ports these services use. Always restrict access to what is necessary, and consider using a VPN or other secure means to access your resources when possible.