Kubernetes Stateful Workloads for AI with Rook-Ceph Block Storage

Question

Pulumi · Accepted Answer

We're going to create a Kubernetes StatefulSet that uses Rook-Ceph as its storage backend for running an AI workload. StatefulSets are ideal for stateful applications and databases that require stable, unique network identifiers, stable persistent storage, and ordered, graceful deployment and scaling.

Rook is an open-source cloud-native storage orchestrator for Kubernetes, providing the platform, framework, and support for a diverse set of storage solutions to natively integrate with cloud-native environments. Ceph is a highly scalable distributed storage solution and provides block, object, and file storage in a unified system.

Here's how to get started with a StatefulSet that uses Rook-Ceph for its volumes:

1. **Ensure Rook-Ceph is Deployed**: Before setting up your StatefulSet, you need to have a Rook-Ceph cluster running in your Kubernetes cluster. This involves setting up the Rook operator and creating a CephCluster resource.

2. **Create a StorageClass**: This defines the storage provisioner (in this case, Ceph RBD) and pool details. Persistent volumes (PVs) will be dynamically provisioned as needed by PersistentVolumeClaims (PVCs) using this StorageClass.

3. **Define PersistentVolumeClaims**: These are used by the StatefulSet to request physical storage from the StorageClass.

4. **Create the StatefulSet**: This will run your application, referencing the PVCs for storage needs.

Below is a Pulumi program written in Python that outlines the steps you need to create the StatefulSet and its associated resources. Please ensure you've configured Pulumi with access to your Kubernetes cluster.

```python
import pulumi
import pulumi_kubernetes as k8s

# The following code assumes that Rook-Ceph is already deployed in the cluster
# and that you have defined a Ceph Block Pool and a StorageClass that will use Rook-Ceph.

# Step 1: Define the PersistentVolumeClaim using the StorageClass provisioned by Rook-Ceph.
pvc = k8s.core.v1.PersistentVolumeClaim(
    "ai-workload-pvc",
    spec=k8s.core.v1.PersistentVolumeClaimSpecArgs(
        access_modes=["ReadWriteOnce"],  # This should be suitable for most stateful workloads
        resources=k8s.core.v1.ResourceRequirementsArgs(
            requests={
                "storage": "10Gi"  # Define the size of the storage required
            },
        ),
        storage_class_name="rook-ceph-block"  # This should match the StorageClass name created by Rook-Ceph
    )
)

# Step 2: Create a StatefulSet using the PVC for storing stateful data.
statefulset = k8s.apps.v1.StatefulSet(
    "ai-workload-statefulset",
    spec=k8s.apps.v1.StatefulSetSpecArgs(
        service_name="ai-service",  # The name of the service that governs this StatefulSet
        replicas=3,  # Number of desired replicas
        selector=k8s.meta.v1.LabelSelectorArgs(
            match_labels={
                "app": "ai-workload"
            }
        ),
        template=k8s.core.v1.PodTemplateSpecArgs(
            metadata=k8s.meta.v1.ObjectMetaArgs(
                labels={
                    "app": "ai-workload"
                }
            ),
            spec=k8s.core.v1.PodSpecArgs(
                containers=[
                    k8s.core.v1.ContainerArgs(
                        name="ai-container",
                        image="ai-application-image",  # Replace with the actual image you need for AI workload
                        ports=[k8s.core.v1.ContainerPortArgs(container_port=8080)],
                        volume_mounts=[k8s.core.v1.VolumeMountArgs(
                            name="storage",  # The name must match the volume name.
                            mount_path="/data"  # The path in the container where the volume should be mounted.
                        )],
                    ),
                ],
                volumes=[
                    k8s.core.v1.VolumeArgs(
                        name="storage",
                        persistent_volume_claim=k8s.core.v1.PersistentVolumeClaimVolumeSourceArgs(
                            claim_name=pvc.metadata.name
                        ),
                    ),
                ],
            ),
        ),
        # If you're using stateful apps with persistent storage, it's probably
        # a good idea to set update strategy to "RollingUpdate" to minimize downtime.
        update_strategy=k8s.apps.v1.StatefulSetUpdateStrategyArgs(
            type="RollingUpdate"
        ),
    )
)
```

Ensure that you replace `ai-application-image` with the actual container image for your AI workload. Also, change the `storage` size to match the requirements of your workload.

Lastly, apply this configuration with `pulumi up`, and Pulumi will provision the necessary Kubernetes resources described above.