StatefulSet Storage for LLM Checkpointing

Question

Pulumi · Accepted Answer

StatefulSets in Kubernetes are typically used to manage stateful applications, where it's crucial to maintain the identity and storage of the pods. This is particularly important for tasks like LLM (Large Language Models) checkpointing, where you have to save and restore states, weights, or other complex datasets reliably.

To achieve this with Pulumi and Kubernetes, we'll take advantage of two key Kubernetes concepts: StatefulSets and PersistentVolumeClaims (PVCs). A `StatefulSet` manages the deployment and scaling of a set of Pods, and provides guarantees about the ordering and uniqueness of these Pods. Alongside this, we use PVCs which provide a way for pods to persist data even after the pod is no longer running.

Here's how we will set that up using Pulumi with Python:

1. We'll create a `StatefulSet` resource to manage our stateful pods. This will define the desired state for our pods, including the container image to use, the number of replicas, and the volume mounts.

2. We'll define a `PersistentVolumeClaim`, which can be used by the pods in the StatefulSet. A PVC is a request for storage, which will be provisioned by Kubernetes using the default or specified `StorageClass`.

3. The StatefulSet will reference the PVC in its volume claim templates. This allows each replica in the StatefulSet to have its own persistent storage associated with it.

Each pod in a StatefulSet has a stable hostname derived from the name of the StatefulSet and the ordinal index of the pod, which is a perfect fit for scenarios like LLM checkpointing, where you might need to ensure that a specific pod handles specific parts of a dataset.

Let's see what the code looks like:

```python
import pulumi
import pulumi_kubernetes as k8s

# Define the PersistentVolumeClaim for the StatefulSet
pvc = k8s.core.v1.PersistentVolumeClaim(
    "llm-checkpointing-pvc",
    spec=k8s.core.v1.PersistentVolumeClaimSpecArgs(
        access_modes=["ReadWriteOnce"],  # This PVC is mounted as read-write by a single node
        resources=k8s.core.v1.ResourceRequirementsArgs(
            requests={
                "storage": "10Gi"  # Requesting 10Gi of storage - adjust as needed
            }
        )
    )
)

# Define the StatefulSet
stateful_set = k8s.apps.v1.StatefulSet(
    "llm-checkpointing-statefulset",
    spec=k8s.apps.v1.StatefulSetSpecArgs(
        selector=k8s.meta.v1.LabelSelectorArgs(
            match_labels={"app": "llm-checkpointing"}  # This label links the PVC to the StatefulSet
        ),
        serviceName="llm-checkpointing",  # The name of the service that governs this StatefulSet
        replicas=3,  # Number of desired pods
        template=k8s.core.v1.PodTemplateSpecArgs(
            metadata=k8s.meta.v1.ObjectMetaArgs(
                labels={"app": "llm-checkpointing"}
            ),
            spec=k8s.core.v1.PodSpecArgs(
                containers=[
                    k8s.core.v1.ContainerArgs(
                        name="llm-checkpointing-container",
                        image="your-llm-container-image",  # Use your actual image here
                        volume_mounts=[
                            k8s.core.v1.VolumeMountArgs(
                                name="checkpointing-storage",
                                mount_path="/data",  # Your container writes checkpoints to this directory
                            )
                        ],
                    )
                ]
            ),
        ),
        volume_claim_templates=[pvc]  # This links the PVC defined above to our StatefulSet
    )
)

# Export the StatefulSet name so you can easily retrieve it with `pulumi stack output`
pulumi.export("statefulset_name", stateful_set.metadata["name"])
```

Here's what each part of the code is doing:

- The `PersistentVolumeClaim` is a storage request that asks Kubernetes to provide a 10Gi disk that can be mounted as read-write by a single node.
- The `StatefulSet` describes the desired state for our application. It specifies that we want three replicas (adjust as needed for your use case).
- Each pod in the `StatefulSet` will mount the persistent storage we requested at the `/data` path - this is where the checkpoints should be written to and read from.
- The StatefulSet is associated with a service named `llm-checkpointing`, which provides a stable network identity for each pod.

Make sure to replace `your-llm-container-image` with the actual image you want to use for your LLM application.

This code is Python and uses the Pulumi Kubernetes SDK. To run it, you just need your Pulumi setup to be configured for connecting to your Kubernetes cluster.

Once you apply this configuration using `pulumi up`, the Kubernetes cluster will ensure that your LLM application is running with the correct number of replicas and each has its own persistent storage for checkpointing. If a pod fails, Kubernetes will attempt to reschedule it on a different node with its storage intact, thanks to the persistent volume.

For more information, here are some relevant pages from the Pulumi documentation:
- [Kubernetes PersistentVolumeClaim](https://www.pulumi.com/registry/packages/kubernetes/api-docs/core/v1/persistentvolumeclaim/)
- [Kubernetes StatefulSet](https://www.pulumi.com/registry/packages/kubernetes/api-docs/apps/v1/statefulset/)