Enforcing Network Policies in ML Training Clusters with Cilium

Question

Pulumi · Accepted Answer

Enforcing network policies in Machine Learning (ML) training clusters is essential for ensuring secure and controlled network communication between the resources within your clusters. This helps to restrict the communication to only necessary pathways and makes sure sensitive data is not exposed to undesired services or external networks.

[Cilium](https://cilium.io/) is an open-source project that provides and secures network connectivity and load balancing for workloads using a direct implementation of services and network policies at eBPF program level, which is part of the Linux kernel. Using Cilium, you can define fine-grained network policies that can be dynamically enforced without modifying the application code or container configuration.

In a Kubernetes environment, you can use Pulumi to deploy a NetworkPolicy to enforce network policies in your cluster. The following program demonstrates how to define a simple network policy using Pulumi with the Kubernetes provider.

```python
import pulumi
import pulumi_kubernetes as k8s

# This policy enforces that Pods labeled 'role: ml-trainee' in the 'ml-training' namespace
# can only communicate with Pods labeled 'role: ml-server' on TCP port 6006 (Typically used for TensorBoard).
network_policy = k8s.networking.v1.NetworkPolicy(
    "ml-training-policy",
    metadata=k8s.meta.v1.ObjectMetaArgs(
        name="ml-training-policy",
        namespace="ml-training",
    ),
    spec=k8s.networking.v1.NetworkPolicySpecArgs(
        # Selects the Pods this policy applies to.
        pod_selector=k8s.meta.v1.LabelSelectorArgs(
            match_labels={"role": "ml-trainee"}
        ),
        # Define the ingress (incoming traffic) rules.
        ingress=[
            k8s.networking.v1.NetworkPolicyIngressRuleArgs(
                # Allows connections from Pods labeled 'role: ml-server'.
                from_=[
                    k8s.networking.v1.NetworkPolicyPeerArgs(
                        pod_selector=k8s.meta.v1.LabelSelectorArgs(
                            match_labels={"role": "ml-server"}
                        ),
                    ),
                ],
                # Restrict to TCP port 6006 (TensorBoard).
                ports=[
                    k8s.networking.v1.NetworkPolicyPortArgs(
                        protocol="TCP",
                        port=6006,
                    ),
                ],
            ),
        ],
        # Specify that this policy is for controlling ingress traffic (inbound connections).
        policy_types=["Ingress"],
    )
)

# Export the name of the network policy
pulumi.export('network_policy_name', network_policy.metadata["name"])
```

In the above Pulumi program:

1. We import the required Pulumi modules for Kubernetes resources.
2. We create an instance of `NetworkPolicy` which defines the desired network policy.
3. In the metadata stanza, we provide a name and the namespace for the policy.
4. In the spec stanza, we define policy specifications:
   - `pod_selector` selects the pods to which the policy will apply. Here, it applies to Pods with the label `role: ml-trainee`.
   - `ingress` defines rules for incoming traffic. We allow traffic from Pods labeled with `role: ml-server` that communicate on a `TCP` port `6006`, which is the port typically used by TensorBoard in ML workflows.
5. Finally, we specify `policy_types`, setting it to `["Ingress"]` to indicate that these rules concern incoming traffic.

This policy ensures that your ML training Pods only receive network traffic from specific, defined sources, increasing the security of your Kubernetes cluster.

To apply this policy, you need a running Kubernetes cluster with Cilium installed. You would typically run this Pulumi program using the Pulumi CLI after setting up your desired Pulumi project and stack. When executed, Pulumi would communicate with your Kubernetes cluster API and apply the network policy as defined.

To actually enforce these policies using Cilium, you should make sure that Cilium's CNI plugin is properly installed and configured in your Kubernetes cluster. Cilium will then automatically apply and enforce these policies using eBPF without any changes necessary to the application Pods.