Scalable AI Inference Services on Kubernetes with Cilium

Question

Pulumi · Accepted Answer

Sure, let's set up a scalable AI inference service on a Kubernetes cluster, using Cilium for network policies to secure the data in transit. Pulumi enables us to define this infrastructure using Python, leveraging cloud-native tools and resources effectively.

We will go through the steps of defining this infrastructure as code:

1. **Kubernetes Cluster**: We need a Kubernetes cluster to run our services. For scalability, you might choose a managed Kubernetes service like AWS EKS, GCP GKE, or Azure AKS.

2. **Cilium**: As a CNI (Container Network Interface) for Kubernetes, Cilium provides networking and security. It is scalable and offers advanced features such as API-aware network security, transparent load balancing, and multi-cluster connectivity.

3. **Deployments and Services**: We will define Kubernetes `Deployment` resources for our AI services. A `Service` object will then expose the deployments so that the inference services are accessible.

4. **Horizontal Pod Autoscaler (HPA)**: To scale the AI inference pods based on load, we will use an HPA, which automatically adjusts the number of pods in a deployment based on observed CPU utilization.

5. **NetworkPolicies with Cilium**: We will define Kubernetes `NetworkPolicy` resources to control the traffic to the AI inference services. With Cilium, we can define policies that are API-aware and can filter traffic at Layer 7.

Here's a Pulumi program in Python that sets up such an environment:

```python
import pulumi
from pulumi_kubernetes.apps.v1 import Deployment, DeploymentSpecArgs
from pulumi_kubernetes.core.v1 import Service, ServiceSpecArgs
from pulumi_kubernetes.autoscaling.v1 import HorizontalPodAutoscaler, HorizontalPodAutoscalerSpecArgs
from pulumi_kubernetes.networking.v1 import NetworkPolicy, NetworkPolicySpecArgs
from pulumi_kubernetes.helm.v3 import Chart, ChartOpts
from pulumi_kubernetes import Provider

# Create a Kubernetes provider
k8s_provider = Provider("k8s_provider")

# Deploy Cilium using a Helm Chart
cilium_chart = Chart(
    "cilium",
    ChartOpts(
        chart="cilium",
        version="1.9.5",
        fetch_opts={'repo': "https://helm.cilium.io/"},
    ),
    opts=pulumi.ResourceOptions(provider=k8s_provider)
)

# Define the Kubernetes Deployment for the AI inference service
inference_deployment = Deployment(
    "inference-deployment",
    spec=DeploymentSpecArgs(
        selector={"matchLabels": {"app": "inference-service"}},
        replicas=2,  # Start with two replicas
        template={
            "metadata": {"labels": {"app": "inference-service"}},
            "spec": {
                "containers": [{
                    "name": "inference-container",
                    "image": "my-inference-service:latest",
                    "ports": [{"containerPort": 8080}],
                    # Define resource requirements as needed for the AI workload
                    "resources": {
                        "requests": {
                            "cpu": "1",
                            "memory": "2Gi"
                        },
                        "limits": {
                            "cpu": "2",
                            "memory": "4Gi"
                        }
                    },
                }]
            }
        }
    ),
    opts=pulumi.ResourceOptions(provider=k8s_provider)
)

# Create a Service object to expose the AI inference service
inference_service = Service(
    "inference-service",
    spec=ServiceSpecArgs(
        selector={"app": "inference-service"},
        ports=[{"port": 80, "targetPort": 8080}],
        # Define the service type depending on the requirements
        # For external access, you might want to choose LoadBalancer
        type="ClusterIP",  
    ),
    opts=pulumi.ResourceOptions(provider=k8s_provider)
)

# Setup Horizontal Pod Autoscaler
inference_autoscaler = HorizontalPodAutoscaler(
    "inference-autoscaler",
    spec=HorizontalPodAutoscalerSpecArgs(
        scale_target_ref={"apiVersion": "apps/v1", "kind": "Deployment", "name": "inference-deployment"},
        min_replicas=2,
        max_replicas=10,  # Scale up to ten replicas
        metrics=[{"type": "Resource", "resource": {"name": "cpu", "target": {"type": "Utilization", "averageUtilization": 60}}}],
    ),
    opts=pulumi.ResourceOptions(provider=k8s_provider)
)

# Define a Cilium Network Policy to control the ingress and egress traffic to the inference service
inference_network_policy = NetworkPolicy(
    "inference-network-policy",
    spec=NetworkPolicySpecArgs(
        pod_selector={"matchLabels": {"app": "inference-service"}},
        policy_types=["Ingress", "Egress"],
        # Define the ingress rules. For example, allow traffic from a specific namespace:
        ingress=[{
            "from": [{"namespaceSelector": {"matchLabels": {"project": "my-project"}}}],
            "ports": [{"protocol": "TCP", "port": 80}]
        }],
        # Define the egress rules. For example, restrict egress to certain IPs or services:
        egress=[{
            # Egress rules can be defined similarly to ingress rules
        }]
    ),
    opts=pulumi.ResourceOptions(provider=k8s_provider)
)

# Export the Inference Service endpoint
pulumi.export("inference_service_endpoint", inference_service.metadata.apply(lambda meta: meta["name"]))
```

This program defines a Cilium installation, a Kubernetes deployment for an AI inference service, a service to expose the deployment, an autoscaler to adjust the number of pods based on load, and network policies to secure the network traffic.

Make sure you adjust the program with the correct image for your container and the resources it requires. Also, tailor the network policies according to your specific needs in terms of namespace selectors and allowed sources.

After creating the above resources, you would have a scalable AI inference platform on Kubernetes managed by Pulumi. You can run this Pulumi program to provision the infrastructure described in the code. Remember to have Pulumi CLI installed and configured with a Kubernetes cluster credentials before running the program.