Stateful AI Model Training on Kubernetes with OpenEBS

Question

Pulumi · Accepted Answer

To accomplish stateful AI model training on Kubernetes with OpenEBS, the following steps and resources will be used:

1. **Kubernetes Cluster**: You need a Kubernetes cluster to deploy your workloads. This can be an existing cluster or one created using cloud providers like AWS, Azure, GCP, or any other supported by Pulumi.

2. **OpenEBS for Persistent Storage**: OpenEBS provides persistent and containerized block storage for DevOps and container environments. It's essential for stateful applications that require storage persistence across pod rescheduling and restarts.

3. **StatefulSet**: Kubernetes `StatefulSet` is used to manage stateful applications. It provides stable, unique network identifiers, stable, persistent storage, and ordered, graceful deployment and scaling.

4. **PersistentVolume (PV) and PersistentVolumeClaim (PVC)**: PVs represent storage resources in the cluster, and PVCs are requests for storage by users. OpenEBS manages PVs to provide persistent storage for your stateful application.

5. **StorageClass**: To define how volumes should be created, you need to set up a `StorageClass`. For OpenEBS, this would involve setting parameters specific to OpenEBS storage provisioners.

6. **AI Model Training Job**: This can be represented as a `Job` resource or a `Pod` managed by the `StatefulSet`.

The program below sets up a `StatefulSet` with a PVC backed by OpenEBS, on which you can then run your AI model training workloads. Replace the `...` in the code with your specific container image and other configuration details pertinent to your AI model training workload.

```python
import pulumi
import pulumi_kubernetes as k8s

# Create a StorageClass resource for OpenEBS.
storage_class = k8s.storage.v1.StorageClass(
    "openebs-sc",
    metadata=k8s.meta.v1.ObjectMetaArgs(
        name="openebs-sc",
    ),
    provisioner="openebs.io/provisioner-iscsi",
    parameters={
        "openebs.io/cas-type": "iscsi",
        "openebs.io/fstype": "xfs",
    }
)

# Create a PersistentVolumeClaim resource which will be used by the StatefulSet.
persistent_volume_claim = k8s.core.v1.PersistentVolumeClaim(
    "openebs-pvc",
    metadata=k8s.meta.v1.ObjectMetaArgs(
        name="openebs-pvc",
    ),
    spec=k8s.core.v1.PersistentVolumeClaimSpecArgs(
        access_modes=["ReadWriteOnce"],
        storage_class_name=storage_class.metadata.name,
        resources=k8s.core.v1.ResourceRequirementsArgs(
            requests={"storage": "10Gi"},
        ),
    )
)

# Define the StatefulSet with OpenEBS PVC for the AI model training.
stateful_set = k8s.apps.v1.StatefulSet(
    "ai-model-training-ss",
    metadata=k8s.meta.v1.ObjectMetaArgs(
        name="ai-model-training",
    ),
    spec=k8s.apps.v1.StatefulSetSpecArgs(
        selector=k8s.meta.v1.LabelSelectorArgs(
            match_labels={"app": "ai-model-training"},
        ),
        service_name="ai-training-service",
        replicas=1,
        template=k8s.core.v1.PodTemplateSpecArgs(
            metadata=k8s.meta.v1.ObjectMetaArgs(
                labels={"app": "ai-model-training"},
            ),
            spec=k8s.core.v1.PodSpecArgs(
                containers=[
                    k8s.core.v1.ContainerArgs(
                        name="training-container",
                        image="your-ai-model-training-image:latest",  # Replace with your AI training container image
                        volume_mounts=[
                            k8s.core.v1.VolumeMountArgs(
                                name="model-data",
                                mount_path="/data",
                            ),
                        ],
                        # Define additional container specs as needed for the training job
                    ),
                ],
                volumes=[
                    k8s.core.v1.VolumeArgs(
                        name="model-data",
                        persistent_volume_claim=k8s.core.v1.PersistentVolumeClaimVolumeSourceArgs(
                            claim_name=persistent_volume_claim.metadata.name,
                        ),
                    ),
                ],
            ),
        ),
    )
)

# Export the StatefulSet name
pulumi.export("stateful_set_name", stateful_set.metadata.name)
```

The above Pulumi program defines a Kubernetes `StatefulSet` that utilizes a `PersistentVolumeClaim` backed by OpenEBS. It specifies an OpenEBS StorageClass and provisions a persistent volume of 10Gi size with `ReadWriteOnce` access mode, which is suitable for single pod access - typical in AI/ML workloads where one wants to retain data even if the pod goes down.

The `StatefulSet` has a single replica and mounts the persistent storage at the `/data` path within the container, which should hold the AI model training data. This path must correspond to the expected storage path within your AI training application.

Make sure to replace the container image and other configurations to fit your AI model's requirements. This setup will persist the data on the provided storage path across container restarts and rescheduling on the Kubernetes cluster nodes.

After deploying this Pulumi program, you can proceed with your AI model training job which will have a persistent storage backend provided by OpenEBS for storing the stateful data like model checkpoints, training datasets, etc.