1. Enforcing Network Policies in ML Training Clusters with Cilium


    Enforcing network policies in Machine Learning (ML) training clusters is essential for ensuring secure and controlled network communication between the resources within your clusters. This helps to restrict the communication to only necessary pathways and makes sure sensitive data is not exposed to undesired services or external networks.

    Cilium is an open-source project that provides and secures network connectivity and load balancing for workloads using a direct implementation of services and network policies at eBPF program level, which is part of the Linux kernel. Using Cilium, you can define fine-grained network policies that can be dynamically enforced without modifying the application code or container configuration.

    In a Kubernetes environment, you can use Pulumi to deploy a NetworkPolicy to enforce network policies in your cluster. The following program demonstrates how to define a simple network policy using Pulumi with the Kubernetes provider.

    import pulumi import pulumi_kubernetes as k8s # This policy enforces that Pods labeled 'role: ml-trainee' in the 'ml-training' namespace # can only communicate with Pods labeled 'role: ml-server' on TCP port 6006 (Typically used for TensorBoard). network_policy = k8s.networking.v1.NetworkPolicy( "ml-training-policy", metadata=k8s.meta.v1.ObjectMetaArgs( name="ml-training-policy", namespace="ml-training", ), spec=k8s.networking.v1.NetworkPolicySpecArgs( # Selects the Pods this policy applies to. pod_selector=k8s.meta.v1.LabelSelectorArgs( match_labels={"role": "ml-trainee"} ), # Define the ingress (incoming traffic) rules. ingress=[ k8s.networking.v1.NetworkPolicyIngressRuleArgs( # Allows connections from Pods labeled 'role: ml-server'. from_=[ k8s.networking.v1.NetworkPolicyPeerArgs( pod_selector=k8s.meta.v1.LabelSelectorArgs( match_labels={"role": "ml-server"} ), ), ], # Restrict to TCP port 6006 (TensorBoard). ports=[ k8s.networking.v1.NetworkPolicyPortArgs( protocol="TCP", port=6006, ), ], ), ], # Specify that this policy is for controlling ingress traffic (inbound connections). policy_types=["Ingress"], ) ) # Export the name of the network policy pulumi.export('network_policy_name', network_policy.metadata["name"])

    In the above Pulumi program:

    1. We import the required Pulumi modules for Kubernetes resources.
    2. We create an instance of NetworkPolicy which defines the desired network policy.
    3. In the metadata stanza, we provide a name and the namespace for the policy.
    4. In the spec stanza, we define policy specifications:
      • pod_selector selects the pods to which the policy will apply. Here, it applies to Pods with the label role: ml-trainee.
      • ingress defines rules for incoming traffic. We allow traffic from Pods labeled with role: ml-server that communicate on a TCP port 6006, which is the port typically used by TensorBoard in ML workflows.
    5. Finally, we specify policy_types, setting it to ["Ingress"] to indicate that these rules concern incoming traffic.

    This policy ensures that your ML training Pods only receive network traffic from specific, defined sources, increasing the security of your Kubernetes cluster.

    To apply this policy, you need a running Kubernetes cluster with Cilium installed. You would typically run this Pulumi program using the Pulumi CLI after setting up your desired Pulumi project and stack. When executed, Pulumi would communicate with your Kubernetes cluster API and apply the network policy as defined.

    To actually enforce these policies using Cilium, you should make sure that Cilium's CNI plugin is properly installed and configured in your Kubernetes cluster. Cilium will then automatically apply and enforce these policies using eBPF without any changes necessary to the application Pods.