Mounting EFS for Distributed Machine Learning Training

Question

Pulumi · Accepted Answer

To set up a distributed machine learning training system using Amazon Elastic File System (EFS), you'll need to mount your EFS filesystem into multiple Amazon EC2 instances. This allows all of the training processes to read and write to a common set of files, which is necessary for many machine learning frameworks and models that rely on shared data and parameters.

Here's what we need to do for this setup:

1. Create an EFS FileSystem: This is the storage that will be shared across all EC2 instances.
2. Set up Mount Targets: These are network interfaces in your VPC subnets that allow EC2 instances to connect to the file system.
3. Configure Security Groups: To control the traffic to and from the mount targets.
4. Launch your EC2 Instances and mount the file system within each instance.

The following Pulumi program written in Python accomplishes these tasks:

```python
import pulumi
import pulumi_aws as aws

# Create a new security group for your EC2 instances
sg = aws.ec2.SecurityGroup('ml-training-sg',
    description='Enable SSH and EFS',
    ingress=[
        # SSH access from anywhere
        aws.ec2.SecurityGroupIngressArgs(
            protocol='tcp',
            from_port=22,
            to_port=22,
            cidr_blocks=['0.0.0.0/0'],
        ),
        # EFS access from within the VPC
        aws.ec2.SecurityGroupIngressArgs(
            protocol='tcp',
            from_port=2049,
            to_port=2049,
            cidr_blocks=['0.0.0.0/0'],  # Ideally should be restricted to your VPC's IP range
        ),
    ],
)

# Create an EFS file system
efs_filesystem = aws.efs.FileSystem('ml-training-efs')

# Create mount targets for the EFS file system. We need one for each Availability Zone.
# Assume we have 2 availability zones here for simplicity. In a real-world scenario, you
# would dynamically determine this based on your VPC and EC2 instance configuration.
availability_zones = ['us-west-2a', 'us-west-2b']
subnet_ids = ['subnet-abcdefgh', 'subnet-ijklmnop']  # replace with your real subnet IDs

mount_targets = []
for i in range(len(availability_zones)):
    zone = availability_zones[i]
    subnet_id = subnet_ids[i]
    mount_target = aws.efs.MountTarget(
        f"mount-target-{zone}",
        file_system_id=efs_filesystem.id,
        subnet_id=subnet_id,
        security_groups=[sg.id],
    )
    mount_targets.append(mount_target)

# Output the IDs of the mount targets
for i, mount_target in enumerate(mount_targets):
    pulumi.export(f'mount_target_{i}', mount_target.id)

```

This program starts by importing the required Pulumi and AWS modules.

It then creates a security group with rules allowing SSH access and NFS (which EFS uses) connections.

Next, it creates an EFS filesystem.

Following that, it sets up mount targets for the EFS filesystem in two subnets. In a real-world scenario, you'd want to create a mount target in each Availability Zone where you have EC2 instances for high availability; the two AZs here are examples. Make sure to replace `'subnet-abcdefgh'` and `'subnet-ijklmnop'` with the actual subnet IDs where you want the mount targets to be located.

Finally, it exports the IDs of the mount targets so that you can easily reference them when launching your EC2 instances and setting up the mount points within them. This is done outside the loop that creates mount targets.

After running this Pulumi program and standing up your infrastructure, the next steps would be to launch your EC2 instances using your machine learning AMI of choice, associate them with the created security group, and mount the EFS filesystem within each instance using Linux's `mount` command or via `/etc/fstab`.

You can refer to the AWS documentation on how to connect to the Mount Targets using EC2 instances: [Mounting file systems](https://docs.aws.amazon.com/efs/latest/ug/mounting-fs.html).

Also, the documentation for the Pulumi resources used here for more details:
- [AWS EFS FileSystem](https://www.pulumi.com/registry/packages/aws/api-docs/efs/filesystem/)
- [AWS EFS MountTarget](https://www.pulumi.com/registry/packages/aws/api-docs/efs/mounttarget/)
- [AWS EC2 SecurityGroup](https://www.pulumi.com/registry/packages/aws/api-docs/ec2/securitygroup/)