Durable File Storage for AI Model Checkpoints

Question

Pulumi · Accepted Answer

When working with AI models, it's crucial to have a durable and reliable storage solution for saving model checkpoints. Model checkpoints are versions of your model that you save at intervals during training, so you can resume training from a particular point if needed, or use it for predictions later.

For this purpose, we'll use AWS Elastic File System (EFS), a fully-managed service that makes it easy to set up and scale file storage in the AWS Cloud. EFS offers high availability and durability, with the added benefit that multiple EC2 instances and containerized services can access the same file system simultaneously, making it a great fit for distributed training scenarios and sharing states across different compute instances.

Below is a program written in Python using Pulumi, which will create an EFS file system, along with the necessary networking components to make it accessible.

```python
import pulumi
import pulumi_aws as aws

# Create a new VPC for our file system, to keep things isolated and manageable.
vpc = aws.ec2.Vpc("ai_model_vpc",
    cidr_block="10.100.0.0/16",
    enable_dns_hostnames=True,
    enable_dns_support=True
)

# Create a subnet; a sub-section of our VPC with its own CIDR block. The EFS will be accessible within this subnet.
subnet = aws.ec2.Subnet("ai_model_subnet",
    vpc_id=vpc.id,
    cidr_block="10.100.1.0/24",
    map_public_ip_on_launch=True # Instances launched into this subnet should be directly accessible from the Internet
)

# Create an internet gateway to allow communication between the VPC and the internet.
internet_gateway = aws.ec2.InternetGateway("ai_model_ig",
    vpc_id=vpc.id
)

# Create a routing table and a route that directs traffic from the subnet to the internet via the internet gateway.
route_table = aws.ec2.RouteTable("ai_model_rt",
    vpc_id=vpc.id,
    routes=[
        {
            "cidr_block": "0.0.0.0/0",
            "gateway_id": internet_gateway.id
        }
    ]
)

# Associate the route table with our subnet.
route_table_assoc = aws.ec2.RouteTableAssociation("ai_model_rta",
    subnet_id=subnet.id,
    route_table_id=route_table.id
)

# Create a security group to control the flow of traffic to our instances and EFS.
security_group = aws.ec2.SecurityGroup("ai_model_sg",
    description="Allow access to EFS from within the VPC",
    vpc_id=vpc.id,
    ingress=[
        { "protocol": "tcp", "from_port": 2049, "to_port": 2049, "cidr_blocks": [subnet.cidr_block]}
    ],
    egress=[ # Allow all outbound traffic.
        {"protocol": "-1", "from_port": 0, "to_port": 0, "cidr_blocks": ["0.0.0.0/0"]}
    ]
)

# Create the Elastic File System
efs_file_system = aws.efs.FileSystem("ai_model_efs",
    lifecycle_policy={
        "transition_to_ia": "AFTER_30_DAYS" # Automatically move files not accessed for 30 days to infrequent access storage class
    },
    tags={
        "Name": "AIModelCheckpoints",
    }
)

# Create a mount target in our subnet. This is the access point for EC2 instances to communicate with the EFS.
efs_mount_target = aws.efs.MountTarget("ai_model_efs_mt",
    file_system_id=efs_file_system.id,
    subnet_id=subnet.id,
    security_groups=[security_group.id]
)

# Export the EFS file system ID and the mount target IP address. This will be used by your EC2 instances to mount the file system.
pulumi.export('efs_id', efs_file_system.id)
pulumi.export('mount_target_ip_address', efs_mount_target.ip_address)
```

In the above program:
- We set up an isolated VPC and a subnet for networking. This ensures that our file storage is secure and only accessible within our defined network.
- We create an internet gateway and routing table, which are required to allow incoming and outgoing network traffic.
- An NFS-supported security group is created which allows inbound traffic on port 2049, which is the standard port for NFS traffic (EFS uses NFS).
- We define an EFS file system with a lifecycle policy to lower the storage cost by moving files that aren't frequently accessed to cheaper storage.
- A mount target is created within the subnet to provide EC2 instances with access to the EFS.

After deploying this infrastructure with Pulumi by running it with the Pulumi CLI, we can mount the EFS to any EC2 instance within our VPC and securely store our AI model checkpoints.

This program does not include the creation of EC2 instances and assumes that appropriate AWS configurations are in place. If you plan to use this in a production environment, make sure to update the security rules to fit your specific use case and properly manage access control.