1. Kubernetes Persistent Volumes for Distributed Training Checkpoints


    In a Kubernetes cluster, a PersistentVolume (PV) is a piece of storage that has been provisioned by an administrator. It is a resource in the cluster just like a node is a cluster resource. PersistentVolumes are volume plugins like Volumes, but have a lifecycle independent of any individual Pod that uses the PV. This resource is used when you need persistent storage for your application that survives pod restarts or failures.

    A PersistentVolumeClaim (PVC) is a request for storage by a user. It is similar to a Pod in that Pods consume node resources and PVCs consume PV resources. Pods can request specific levels of resources (CPU and Memory). Similarly, PVCs can request specific size and access modes (e.g., they can be mounted once read/write or many times read-only).

    To use Persistent Volumes for distributed training checkpoints in a Kubernetes cluster, you need to:

    1. Create a PersistentVolume that represents the physical storage.
    2. Create a PersistentVolumeClaim that a pod will use to request the physical storage.

    Here's a simple program that demonstrates how to define a PersistentVolume and a PersistentVolumeClaim in Pulumi using the Kubernetes provider:

    import pulumi import pulumi_kubernetes as kubernetes # Define a PersistentVolume using a local path on the node. # You can change the storage source to match your requirement (e.g., NFS, iSCSI, or cloud-specific storage). persistent_volume = kubernetes.core.v1.PersistentVolume( "pv-checkpoints", metadata=kubernetes.meta.v1.ObjectMetaArgs( name="checkpoints-vol", # Name of the PV ), spec=kubernetes.core.v1.PersistentVolumeSpecArgs( capacity={"storage": "10Gi"}, # Size of the volume access_modes=["ReadWriteOnce"], # Access mode persistent_volume_reclaim_policy="Retain", # Retain the PV after use host_path=kubernetes.core.v1.HostPathVolumeSourceArgs( path="/mnt/data", # Path on the host node ), ), ) # Define a PersistentVolumeClaim for the pod to use. # The PVC will match the PV based on labels, access modes, and storage size. persistent_volume_claim = kubernetes.core.v1.PersistentVolumeClaim( "pvc-checkpoints", metadata=kubernetes.meta.v1.ObjectMetaArgs( name="checkpoints-pvc", # Name of the PVC ), spec=kubernetes.core.v1.PersistentVolumeClaimSpecArgs( access_modes=["ReadWriteOnce"], # Must match the access modes of PV resources=kubernetes.core.v1.ResourceRequirementsArgs( requests={"storage": "10Gi"}, # Requested size of the volume ), ), ) # Export the name of the volume and claim to access it later pulumi.export("persistent_volume_name", persistent_volume.metadata["name"]) pulumi.export("persistent_volume_claim_name", persistent_volume_claim.metadata["name"])

    In this program:

    • We begin by importing the required Pulumi packages for Kubernetes.
    • We create a PersistentVolume named pv-checkpoints.
      • The PV is set up to use the local storage on a node in the cluster under the /mnt/data path.
      • We specify the capacity to 10Gi and declare it with ReadWriteOnce access mode, which means the volume can be mounted as read-write by a single node.
      • We set the persistent_volume_reclaim_policy to Retain, which tells Kubernetes to retain the underlying storage when the PV is released from a claim.
    • We define a PersistentVolumeClaim named pvc-checkpoints.
      • The PVC requests the required access modes and storage size that match the PersistentVolume definition. So, when a pod is created using this PVC, Kubernetes will bind this claim to the PersistentVolume we created.
    • Finally, we export the names of the PV and PVC so they can be easily referenced, such as when attaching the PVC to pods in your deployment that perform distributed training.

    This example uses local storage; however, you should adjust the storage source (host_path in this case) in the PersistentVolume definition according to the actual storage solution you're using. Whether it's a network file system like NFS, a block storage like AWS EBS, or any other type of storage supported by Kubernetes, the corresponding configuration would need to be updated in the PersistentVolume's spec.

    You can use the persistent_volume_claim_name in your Pod spec to mount the volume and store checkpoints during the distributed training process. All pods using the same PVC will have access to the same persistent storage to retrieve or update training checkpoints.