1. Distributed Training File System via AWS EFS Access Points


    To set up a distributed training file system on AWS, we'll use Amazon Elastic File System (EFS) to create a centrally located file system that can be accessed by multiple EC2 instances for training models in parallel. We'll also use EFS Access Points which are application-specific entry points into an EFS file system that make it easier to manage access to shared datasets for distributed workloads.

    The EFS file system will allow your training jobs running on different EC2 instances to share common data, enabling synchronized training while ensuring that each job has the right permissions and its unique workspace within the file system.

    Below is a Pulumi program in Python that creates an EFS file system and an access point to that file system. Each access point will enforce a specific POSIX user and group, and optionally, can specify a directory in the file system to which the access point will point.

    The program includes:

    1. An EFS File System: the distributed file system that your training jobs will access.
    2. An EFS Access Point: this access point will enforce a specific POSIX user and group, allowing you to control access to the file system. Optional properties to control the root directory can also be set.

    Here is the Python program that sets up your distributed training file system:

    import pulumi import pulumi_aws as aws # Create an EFS File System efs_file_system = aws.efs.FileSystem("trainingEfsFileSystem", # Enforce encryption for data at-rest encrypted=True, # Define lifecycle policy to save costs by transitioning files not accessed over a period to Infrequent Access lifecycle_policy=aws.efs.FileSystemLifecyclePolicyArgs( transition_to_ia="AFTER_30_DAYS", ), # Provisioned Throughput mode with specified throughput (in MiB/s) can be used for predictable load # throughput_mode="provisioned", # provisioned_throughput_in_mibps=1024, # Tags can be used to add metadata to the file system tags={"Name": "trainingFileSystem"} ) # Specify root directory and POSIX permissions with the EFS Access Point. # This could be the working directory for a specific training job or user. efs_access_point = aws.efs.AccessPoint("trainingAccessPoint", file_system_id=efs_file_system.id, posix_user=aws.efs.AccessPointPosixUserArgs( # Define POSIX user with UID and GID uid=1001, gid=1001 ), root_directory=aws.efs.AccessPointRootDirectoryArgs( # Define root directory for the training job path="/trainingData", creation_info=aws.efs.AccessPointRootDirectoryCreationInfoArgs( # Define permissions and ownership for the directory owner_gid=1001, owner_uid=1001, permissions="700" ), ), tags={"Name": "trainingAccessPoint"} ) # Export the IDs of the created resources to the output pulumi.export("efs_file_system_id", efs_file_system.id) pulumi.export("efs_access_point_id", efs_access_point.id)

    This program does the following:

    • It starts by importing the necessary Pulumi and AWS libraries.
    • Then, it defines the EFS file system with encryption enabled and a policy to transition files to infrequent access storage class to save costs.
    • It creates an EFS access point with a specified POSIX user and group to enforce ownership and permissions, as well as a root directory that new training jobs can use to store their data. It also sets the directory permission to 700, meaning that only the owner will have read, write, and execute permissions.
    • Finally, it exports the EFS file system and access point IDs which can be used in other parts of your infrastructure or for reference.

    This setup is the foundation of a distributed file system that can be used by multiple EC2 instances to host training jobs. Each EC2 instance can mount the EFS file system at the location of the access point, ensuring that each instance has access to the same data for training while maintaining a level of isolation and security through POSIX permissions.