Scaling LLM Inference with Kubernetes & Infinispan Caching

Question

Pulumi · Accepted Answer

Scaling a large language model (LLM) inference workload using Kubernetes and Infinispan caching involves several components:

- Kubernetes **Deployment** to manage the pods running the inference service.
- **Infinispan**, a distributed in-memory key/value data store for caching, which can be deployed on Kubernetes.
- **Services** in Kubernetes to expose the Deployments and enable network access.

In Pulumi, you declare resources in a desired state, and Pulumi brings the cloud environment to that state. In this example, we'll create a Kubernetes Deployment to scale the LLM inference service. We'll also deploy Infinispan for caching to improve performance.

We will not create Infinispan resources directly, as there's no direct Pulumi resource available from the results. Instead, we will deploy it using a generic `kubernetes.yaml.ConfigGroup` which can apply raw YAML manifest files or Helm charts.

Here's how you can structure this in Pulumi with Python:

1. Create a Kubernetes Deployment for the LLM service.
2. Use a ConfigMap or a similar approach to inject configuration into the service.
3. Create a Kubernetes Service to expose the Deployment.
4. Apply the Infinispan manifest to deploy Infinispan on Kubernetes.

Below is a program that carries out these steps:

```python
import pulumi
import pulumi_kubernetes as k8s

# Define the Kubernetes Provider if needed (not shown here).
# We're assuming the provider is already configured.

# 1. Define the Deployment for the LLM inference service.
llm_deployment = k8s.apps.v1.Deployment(
    "llm-inference-deployment",
    spec=k8s.apps.v1.DeploymentSpecArgs(
        # Number of replicas for the LLM service.
        replicas=3,
        selector=k8s.meta.v1.LabelSelectorArgs(
            match_labels={"app": "llm-inference"}
        ),
        template=k8s.core.v1.PodTemplateSpecArgs(
            metadata=k8s.meta.v1.ObjectMetaArgs(
                labels={"app": "llm-inference"}
            ),
            spec=k8s.core.v1.PodSpecArgs(
                containers=[k8s.core.v1.ContainerArgs(
                    name="inference-container",
                    image="your-inference-container-image",  # Replace with your container image
                    ports=[k8s.core.v1.ContainerPortArgs(container_port=8080)],
                    # Configure caching client to point to Infinispan if needed.
                    env=[k8s.core.v1.EnvVarArgs(
                        name="CACHE_HOST",
                        value="infinispan-service"  # Assuming Infinispan service is named 'infinispan-service'
                    )]
                )]
            ),
        ),
    )
)

# 2. Create a Service to expose the Deployment.
llm_service = k8s.core.v1.Service(
    "llm-inference-service",
    spec=k8s.core.v1.ServiceSpecArgs(
        selector={"app": "llm-inference"},
        ports=[k8s.core.v1.ServicePortArgs(
            port=8080,
            target_port=8080
        )],
        type="LoadBalancer"  # Expose the service outside of the cluster.
    )
)

# 3. Apply the Infinispan manifest using ConfigGroup.
#    The YAML manifest needs to define the Infinispan deployment.
#    We assume the file 'infinispan-deployment.yaml' contains the right configuration.
infinispan_manifest = k8s.yaml.ConfigGroup(
    "infinispan-configs",
    files=["infinispan-deployment.yaml"]
)

# Export the service endpoint for easy access.
pulumi.export('llm_inference_endpoint', llm_service.status.apply(lambda status: status.load_balancer.ingress[0].ip if status.load_balancer.ingress else None))
```

This code sets up the basic Kubernetes resources needed to scale an LLM inference service and deploy Infinispan. Here is a breakdown of what the Pulumi program does:

- Creates a `Deployment` named `llm-inference-deployment` with 3 replicas. Each replica runs a container using the image specified by `your-inference-container-image`, which you need to replace with your actual image name that has the LLM inference code.
- Creates a `Service` named `llm-inference-service` to expose the Deployment to the internet via a LoadBalancer. This allows for the LLM inference service to be called from outside the Kubernetes cluster.
- Uses `ConfigGroup` to apply the Infinispan manifest. You must create a separate YAML manifest file named `infinispan-deployment.yaml` with the configuration for deploying Infinispan on Kubernetes.
- Exports the external IP address assigned to the LoadBalancer, making it easy to access the inference service.

Please note that you'll need to replace the `your-inference-container-image` placeholder with your real container image and provide the `infinispan-deployment.yaml` file with the appropriate content for deploying Infinispan in your cluster.